David Barbarin » Performance

Graphing SQL Server wait stats on Prometheus and Grafana

mikedavem — Thu, 09 Sep 2021 21:19:22 +0000

Wait stats are essential performance metrics for diagnosing SQL Server Performance problems. Related metrics can be monitored from different DMVs including sys.dm_os_wait_stats and sys.dm_db_wait_stats (Azure).

As you probably know, there are 2 categories of DMVs in SQL Server: Point in time versus cumulative and DMVs mentioned previously are in the second category. It means data in these DMVs are accumulative and incremented every time wait events occur. Values reset only when SQL Server restarts or when you intentionally run DBCC SQLPERF command. Baselining these metric values require taking snapshots to compare day-to-day activity or maybe simply trends for a given timeline. Paul Randal kindly provided a TSQL script for trend analysis in a specified time range in this blog post. The interesting part of this script is the focus of most relevant wait types and corresponding statistics. This is basically the kind of scripts I used for many years when I performed SQL Server audits at customer shops but today working as database administrator for a company, I can rely on our observability stack that includes Telegraf / Prometheus and Grafana to do the job.

In a previous write-up, I explained the choice of such platform for SQL Server. But transposing the Paul’s script logic to Prometheus and Grafana was not a trivial stuff, but the result was worthy. It was an interesting topic that I want to share with Ops and DBA who wants to baseline SQL Server telemetry on Prometheus and Grafana observability platform.

So, let’s start with metrics provided by Telegraf collector agent and then scraped by Prometheus job:
– sqlserver_waitstats_wait_time_ms
– sqlserver_waitstats_waiting_tasks_count
– sqlserver_waitstats_resource_wait_time_ms
– sqlserver_waitstats_signe_wait_time_ms

In the context of the blog post we will focus only on the first 2 ones of the above list, but the same logic applies for others.

As a reminder, we want to graph most relevant wait types and their average value within a time range specified in a Grafana dashboard. In fact, this is a 2 steps process:

1) Identifying most relevant wait types by computing their ratio with the total amount of wait time within the specific time range.
2) Graphing in Grafana these most relevant wait types with their corresponding average value for every Prometheus step in the time range.

To address the first point, we need to rely on special Prometheus rate() function and group_left modifier.

As per the Prometheus documentation, rate() gives the per second average rate of change over the specified range interval by using the boundary metric points in it. That is exactly what we need to compute the total average of wait time (in ms) per wait type in a specified time range. rate() needs a range vector as input. Let’s illustrate what is a range vector with the following example. For a sake of simplicity, I filtered with sqlserver_waitstats_wait_time_ms metric to one specific SQL Server instance and wait type (PAGEIOLATCH_EX). Range vector is expressed with a range interval at the end of the query as you can see below:

sqlserver_waitstats_wait_time_ms{sql_instance="$Instance",wait_type="PAGEIOLATCH_EX"}[1m]

The result is a set of data metrics within the specified range interval as show below:

We got for each data metric the value and the corresponding timestamp in epoch format. You can convert this epoch format to user friendly one by using date -r -j for example. Another important point here: The sqlserver_waitstats_wait_time_ms metric is a counter in Prometheus world because value keeps increasing over the time as you can see above (from top to bottom). The same concept exists in SQL Server with cumulative DMV category as explained at the beginning. It explains why we need to use rate() function for drawing the right representation of increase / decrease rate over the time between data metric points. We got 12 data metrics with an interval of 5s between each value. This is because in my context we defined a Prometheus scrape interval of 5s for SQL Server => 60s/5s = 12 data points and 11 steps. The next question is how rate calculates per-second rate of change between data points. Referring to my previous example, I can get the rate value by using the following prompQL query:

rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance",wait_type="PAGEIOLATCH_EX"}[1m])

… and the corresponding value:

To understand this value, let’s have a good reminder of mathematic lesson at school with slope calculation.

Image from Wikipedia

The basic idea of slope value is to find the rate of change of one variable compared to another. Less the distance between two data points we have, more chance we have to get a precise approximate value of the slope. And this is exactly what it is happening with Prometheus when you zoom in or out by changing the range interval. A good resolution is also determined by the Prometheus scraping interval especially when your metrics are extremely volatile. This is something to keep in mind with Prometheus. We are working with approximation by design. So let’s do some math with a slope calculation of the above range vector:

Slope = DV/DT = (332628-332582)/(@1631125796.971 – @1631125746.962) =~ 0.83

Excellent! This is how rate() works and the beauty of this function is that slope calculation is doing automatically for all the steps within the range interval.

But let’s go back to the initial requirement. We need to calculate per wait type the average value of wait time between the first and last point in the specified range vector. We can now step further by using Prometheus aggregation operator as follows:

sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance"}[1m]))

Please note we could have written it another way without using the sum by aggregator but it allows naturally to exclude all unwanted labels for the result metric. It will be particularly helpful for the next part. Anyway, Here a sample of the output:

Then we can compute label (wait type) ratio (or percentage). First attempt and naïve approach could be as follows:

sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance"}[1m]))/ sum(rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance'}[1m]))

But we get empty query result. Bad joke right? We need to understand that.

First part of the query gives total amount of wait time per wait type. I put a sample of the results for simplicity:

It results a new set of metrics with only one label for wait_type. Second part gives to total amount of wait time for all wait types as show below:

With SQL statement, we instinctively select columns that have matching values in concerned tables. Those columns are often concerned by primary or foreign keys. In Prometheus world, vector matching is performing the same way by using all labels at the starting point. But samples are selected or dropped from the result vector based either on « ignoring » and « on » keywords. In my case, they are no matching labels so we must tell Prometheus to ignore the remaining label (wait_type) on the first part of the query:

sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance"}[1m]))/ ignoring(wait_type) sum(rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance'}[1m]))

But another error message …

Error executing query: multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)

In the many-to -one or one-to-many vector matching with Prometheus, samples are selected using keywords like group_left or group_right. In other words, we are telling Prometheus to perform a cross join in this case with this final query before performing division between values:

sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance"}[1m]))/ ignoring(wait_type) group_left sum(rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance'}[1m]))

Here we go!

We finally managed to calculate ratio per wait type with a specified range interval. Last thing is to select most relevant wait types by excluding first irrelevant wait types. Most of wait types come from the exclusion list provided by Paul Randal’s script. We also decided to only focus on max top 5 wait types with ratio > 10% but it is up to you to change these values:

topk(5, sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance',measurement_db_type="SQLServer",wait_type!~'(BROKER_EVENTHANDLER|BROKER_RECEIVE_WAITFOR|BROKER_TASK_STOP|BROKER_TO_FLUSH|BROKER_TRANSMITTER|CHECKPOINT_QUEUE|CHKPT|CLR_AUTO_EVENT|CLR_MANUAL_EVENT|CLR_SEMAPHORE|DBMIRROR_DBM_EVENT|DBMIRROR_EVENTS_QUEUE|DBMIRROR_WORKER_QUEUE|DBMIRRORING_CMD|DIRTY_PAGE_POLL|DISPATCHER_QUEUE_SEMAPHORE|EXECSYNC|FSAGENT|FT_IFTS_SCHEDULER_IDLE_WAIT|FT_IFTSHC_MUTEX|KSOURCE_WAKEUP|LAZYWRITER_SLEEP|LOGMGR_QUEUE|MEMORY_ALLOCATION_EXT|ONDEMAND_TASK_QUEUE|PARALLEL_REDO_DRAIN_WORKER|PARALLEL_REDO_LOG_CACHE|PARALLEL_REDO_TRAN_LIST|PARALLEL_REDO_WORKER_SYNC|PARALLEL_REDO_WORKER_WAIT_WORK|PREEMPTIVE_OS_FLUSHFILEBUFFERS|PREEMPTIVE_XE_GETTARGETSTATE|PWAIT_ALL_COMPONENTS_INITIALIZED|PWAIT_DIRECTLOGCONSUMER_GETNEXT|QDS_PERSIST_TASK_MAIN_LOOP_SLEEP|QDS_ASYNC_QUEUE|QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP|QDS_SHUTDOWN_QUEUE|REDO_THREAD_PENDING_WORK|REQUEST_FOR_DEADLOCK_SEARCH|RESOURCE_QUEUE|SERVER_IDLE_CHECK|SLEEP_BPOOL_FLUSH|SLEEP_DBSTARTUP|SLEEP_DCOMSTARTUP|SLEEP_MASTERDBREADY|SLEEP_MASTERMDREADY|SLEEP_MASTERUPGRADED|SLEEP_MSDBSTARTUP|SLEEP_SYSTEMTASK|SLEEP_TASK|SLEEP_TEMPDBSTARTUP|SNI_HTTP_ACCEPT|SOS_WORK_DISPATCHER|SP_SERVER_DIAGNOSTICS_SLEEP|SQLTRACE_BUFFER_FLUSH|SQLTRACE_INCREMENTAL_FLUSH_SLEEP|SQLTRACE_WAIT_ENTRIES|VDI_CLIENT_OTHER|WAIT_FOR_RESULTS|WAITFOR|WAITFOR_TASKSHUTDOW|WAIT_XTP_RECOVERY|WAIT_XTP_HOST_WAIT|WAIT_XTP_OFFLINE_CKPT_NEW_LOG|WAIT_XTP_CKPT_CLOSE|XE_DISPATCHER_JOIN|XE_DISPATCHER_WAIT|XE_TIMER_EVENT|MEMORY_ALLOCATION_EXT|ONDEMAND_TASK_QUEUE|PREEMPTIVE_HADR_LEASE_MECHANISM|PREEMPTIVE_SP_SERVER_DIAGNOSTICS|PREEMPTIVE_ODBCOPS|PREEMPTIVE_OS_LIBRARYOPS|PREEMPTIVE_OS_COMOPS|PREEMPTIVE_OS_CRYPTOPS|PREEMPTIVE_OS_PIPEOPS|PREEMPTIVE_OS_AUTHENTICATIONOPS|PREEMPTIVE_OS_GENERICOPS|PREEMPTIVE_OS_VERIFYTRUST|PREEMPTIVE_OS_FILEOPS|PREEMPTIVE_OS_DEVICEOPS|PREEMPTIVE_OS_QUERYREGISTRY|PREEMPTIVE_OS_WRITEFILE|PREEMPTIVE_XE_CALLBACKEXECUTEPREEMPTIVE_XE_DISPATCHER|PREEMPTIVE_XE_GETTARGETSTATEPREEMPTIVE_XE_SESSIONCOMMIT|PREEMPTIVE_XE_TARGETINITPREEMPTIVE_XE_TARGETFINALIZE|PREEMPTIVE_XHTTP|PWAIT_EXTENSIBILITY_CLEANUP_TASK|PREEMPTIVE_OS_DISCONNECTNAMEDPIPE|PREEMPTIVE_OS_DELETESECURITYCONTEXT|PREEMPTIVE_OS_CRYPTACQUIRECONTEXT|PREEMPTIVE_HTTP_REQUEST|RESOURCE_GOVERNOR_IDLE|HADR_FABRIC_CALLBACK|PVS_PREALLOCATE)'}[1m])) / ignoring(wait_type) group_left sum(rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance',measurement_db_type="SQLServer",wait_type!~'(BROKER_EVENTHANDLER|BROKER_RECEIVE_WAITFOR|BROKER_TASK_STOP|BROKER_TO_FLUSH|BROKER_TRANSMITTER|CHECKPOINT_QUEUE|CHKPT|CLR_AUTO_EVENT|CLR_MANUAL_EVENT|CLR_SEMAPHORE|DBMIRROR_DBM_EVENT|DBMIRROR_EVENTS_QUEUE|DBMIRROR_WORKER_QUEUE|DBMIRRORING_CMD|DIRTY_PAGE_POLL|DISPATCHER_QUEUE_SEMAPHORE|EXECSYNC|FSAGENT|FT_IFTS_SCHEDULER_IDLE_WAIT|FT_IFTSHC_MUTEX|KSOURCE_WAKEUP|LAZYWRITER_SLEEP|LOGMGR_QUEUE|MEMORY_ALLOCATION_EXT|ONDEMAND_TASK_QUEUE|PARALLEL_REDO_DRAIN_WORKER|PARALLEL_REDO_LOG_CACHE|PARALLEL_REDO_TRAN_LIST|PARALLEL_REDO_WORKER_SYNC|PARALLEL_REDO_WORKER_WAIT_WORK|PREEMPTIVE_OS_FLUSHFILEBUFFERS|PREEMPTIVE_XE_GETTARGETSTATE|PWAIT_ALL_COMPONENTS_INITIALIZED|PWAIT_DIRECTLOGCONSUMER_GETNEXT|QDS_PERSIST_TASK_MAIN_LOOP_SLEEP|QDS_ASYNC_QUEUE|QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP|QDS_SHUTDOWN_QUEUE|REDO_THREAD_PENDING_WORK|REQUEST_FOR_DEADLOCK_SEARCH|RESOURCE_QUEUE|SERVER_IDLE_CHECK|SLEEP_BPOOL_FLUSH|SLEEP_DBSTARTUP|SLEEP_DCOMSTARTUP|SLEEP_MASTERDBREADY|SLEEP_MASTERMDREADY|SLEEP_MASTERUPGRADED|SLEEP_MSDBSTARTUP|SLEEP_SYSTEMTASK|SLEEP_TASK|SLEEP_TEMPDBSTARTUP|SNI_HTTP_ACCEPT|SOS_WORK_DISPATCHER|SP_SERVER_DIAGNOSTICS_SLEEP|SQLTRACE_BUFFER_FLUSH|SQLTRACE_INCREMENTAL_FLUSH_SLEEP|SQLTRACE_WAIT_ENTRIES|VDI_CLIENT_OTHER|WAIT_FOR_RESULTS|WAITFOR|WAITFOR_TASKSHUTDOW|WAIT_XTP_RECOVERY|WAIT_XTP_HOST_WAIT|WAIT_XTP_OFFLINE_CKPT_NEW_LOG|WAIT_XTP_CKPT_CLOSE|XE_DISPATCHER_JOIN|XE_DISPATCHER_WAIT|XE_TIMER_EVENT|MEMORY_ALLOCATION_EXT|ONDEMAND_TASK_QUEUE|PREEMPTIVE_HADR_LEASE_MECHANISM|PREEMPTIVE_SP_SERVER_DIAGNOSTICS|PREEMPTIVE_ODBCOPS|PREEMPTIVE_OS_LIBRARYOPS|PREEMPTIVE_OS_COMOPS|PREEMPTIVE_OS_CRYPTOPS|PREEMPTIVE_OS_PIPEOPS|PREEMPTIVE_OS_AUTHENTICATIONOPS|PREEMPTIVE_OS_GENERICOPS|PREEMPTIVE_OS_VERIFYTRUST|PREEMPTIVE_OS_FILEOPS|PREEMPTIVE_OS_DEVICEOPS|PREEMPTIVE_OS_QUERYREGISTRY|PREEMPTIVE_OS_WRITEFILE|PREEMPTIVE_XE_CALLBACKEXECUTEPREEMPTIVE_XE_DISPATCHER|PREEMPTIVE_XE_GETTARGETSTATEPREEMPTIVE_XE_SESSIONCOMMIT|PREEMPTIVE_XE_TARGETINITPREEMPTIVE_XE_TARGETFINALIZE|PREEMPTIVE_XHTTP|PWAIT_EXTENSIBILITY_CLEANUP_TASK|PREEMPTIVE_OS_DISCONNECTNAMEDPIPE|PREEMPTIVE_OS_DELETESECURITYCONTEXT|PREEMPTIVE_OS_CRYPTACQUIRECONTEXT|PREEMPTIVE_HTTP_REQUEST|RESOURCE_GOVERNOR_IDLE|HADR_FABRIC_CALLBACK|PVS_PREALLOCATE)'}[1m]))) >= 0.1

I got 3 relevant wait types with their correspond ratio in the specified time range.

Pretty cool stuff but we must now to go through the second requirement. We want to graph the average value of the identified wait types within a specified time range in Grafana dashboard. First thing consists in including the above Prometheus query as variable in the Grafana dashboard. Here how I setup my Top5Waits variable in Grafana:

Some interesting points here: variable dependency kicks in with my $Top5Waits variable that depends hierarchically on another $Instance variable in my dashboard (from another Prometheus query). You probably have noticed the use of [${__range_s}s] to determine the range interval but depending on the Grafana $__interval may be a good fit as well.

In turn, $Top5Waits can be used from another query but this time directly in a Grafana dashboard panel with the average value of most relevant wait types as shown below:

Calculating wait type average is not a hard task by itself. In fact, we can apply the same methods than previously by matching the sqlserver_waitstats_wait_tine_ms and sqlserver_waitstats_waiting_task_count and to divide their corresponding values to obtain the average wait time (in ms) for each step within the time range (remember how the rate () function works). Both metrics own the same set of labels, so we don’t need to use « on » or « ignoring » keywords in this case. But we must introduce the $Top5Waits variable in the label filter in the first metric as follows:

rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance',wait_type=~"$Top5Waits",measurement_db_type="SQLServer"}[$__rate_interval])/rate(sqlserver_waitstats_waiting_tasks_count{sql_instance='$Instance',wait_type=~"$Top5Waits",measurement_db_type="SQLServer"}[$__rate_interval])

We finally managed to get an interesting dynamic measurement of SQL Server telemetry wait stats. Hope this blog post helps!
Let me know your feedback if your are using SQL Server wait stats in Prometheus and Grafana in a different way !

Why we moved SQL Server monitoring on Prometheus and Grafana

mikedavem — Tue, 22 Dec 2020 16:55:12 +0000

During this year, I spent a part of my job on understanding the processes and concepts around monitoring in my company. The DevOps mindset mainly drove the idea to move our SQL Server monitoring to the existing Prometheus and Grafana infrastructure. Obviously, there were some technical decisions behind the scene, but the most important part of this write-up is dedicated to explaining other and likely most important reasons of this decision.

But let’s precise first, this write-up doesn’t constitute any guidance or any kind of best practices for DBAs but only some sharing of my own experience on the topic. As usual, any comment will be appreciated.

That’s said, let’s continue with the context. At the beginning of this year, I started my new DBA position in a customer-centric company where DevOps culture, microservices and CI/CD are omnipresent. What does it mean exactly? To cut the story short, development and operation teams are used a common approach for agile software development and delivery. Tools and processes are used to automate build, test, deploy and to monitor applications with speed, quality and control. In other words, we are talking about Continuous Delivery and in my company, release cycle is faster than traditional shops I encountered so far with several releases per day including database changes. Another interesting point is that we are following the « Operate what you build » principle each team that develops a service is also responsible for operating and supporting it. It presents some advantages for both developers and operations but pushing out changes requires to get feedback and to observe impact on the system on both sides.

In addition, in operation teams we try to act as a centralized team and each member should understand the global scope and topics related to the infrastructure and its ecosystem. This is especially true when you’re dealing with nightly on-calls. Each has its own segment responsibility (regarding their specialized areas) but following DevOps principles, we encourage shared ownership to break down internal silos for optimizing feedback and learning. It implies anyone should be able to temporarily overtake any operational task to some extent assuming the process is well-documented, and learnin has been done correctly. But world is not perfect and this model has its downsides. For example, it will prioritize effectiveness in broader domains leading to increase cognitive load of each team member and to lower visibility in for vertical topics when deeper expertise is sometimes required. Having an end-to-end observable system including infrastructure layer and databases may help to reduce time for investigating and fixing issues before end users experience them.

The initial scenario

Let me give some background info and illustration of the initial scenario:

… and my feeling of what could be improved:

1) From a DBA perspective, at a first glance there are many potential issues. Indeed, a lot of automated or semi-manual deployment processes are out of the control and may have a direct impact on the database environment stability. Without better visibility, there is likely no easy way to address the famous question: He, we are experiencing performance degradations for two days, has something happened on database side?

2) Silos are encouraged between DBA and DEVs in this scenario. Direct consequence is to limit drastically the adding value of the DBA role in a DevOps context. Obviously, primary concerns include production tasks like ensuring integrity, backups and maintenance of databases. But in a DevOps oriented company where we have automated « database-as-code » pipelines, they remain lots of unnecessary complexity and disruptive scripts that DBA should take care. If this role is placed only at the end of the delivery pipeline, collaboration and continuous learning with developer teams will restricted at minimum.

3) There is a dedicated monitoring tool for SQL Server infrastructure and this is a good point. It provides necessary baselining and performance insights for DBAs. But in other hand, the tool in place targets only DBA profiles and its usage is limited to the infrastructure team. This doesn’t contribute to help improving the scalability in the operations team and beyond. Another issue with the existing tooling is correlation can be difficult with external events that come from either the continuous delivery pipeline or configuration changes performed by operations teams on the SQL Server instances. In this case, establishment of observability (the why) may be limited and this is what teams need to respond quickly and resolve emergencies in modern and distributed software.

What is observability?

You probably noticed the word « observability » in my previous sentence, so I think it deserves some explanations before to continue. Observability might seem like a buzzword but in fact it is not a new concept but became prominent in DevOps software development lifecycle (SDLC) methodologies and distributed infrastructure systems. Referring to the Wikipedia definition, Observability is the ability to infer internal states of a system based on the system’s external outputs. To be honest, it has not helped me very much and further readings were necessary to shed the light on what observability exactly is and what difference exist with monitoring.

Let’s start instead with monitoring which is the ability to translate infrastructure log metrics data into meaningful and actionable insights. It helps knowing when something goes wrong and starting your response quickly. This is the basis for monitoring tool and the existing one is doing a good job on it. In DBA world, monitoring is often related to performance but reporting performance is only as useful as that reporting accurately represents the internal state of the global system and not only your database environment. For example, in the past I went to some customer shops where I was in charge to audit their SQL Server infrastructure. Generally, customers were able to present their context, but they didn’t get the possibility to provide real facts or performance metrics of their application. In this case, you usually rely on a top-down approach and if you’re either lucky or experimented enough, you manage to find what is going wrong. But sometimes I got relevant SQL Server metrics that would have highlighted a database performance issue, but we didn’t make a clear correlation with those identified on application side. In this case, relying only on database performance metrics was not enough for inferring the internal state of the application. From my experience, many shops are concerned with such applications that have been designed for success and not for failure. They often lake of debuggability monitoring and telemetry is often missing. Collecting data is as the base of observability.

Observability provides not only the when of an error or issue, but more importantly the why. With modern software architectures including micro-services and the emphasis of DevOps, monitoring goals are no longer limited to collecting and processing log data, metrics, and event traces. Instead, it should be employed to improve observability by getting a better understanding of the properties of an application and its performance across distributed systems and delivery pipeline. Referring to the new context I’m working now, metric capture and analysis is started with deployment of each micro-service and it provides better observability by measuring all the work done across all dependencies.

White-Box vs. Black-Box Monitoring

In my company as many other companies, different approaches are used when it comes monitoring: White-box and Black-Box monitoring.
White-box monitoring focuses on exposing internals of a system. For example, this approach is used by many SQL Server performance tools on the market that make effort to set a map of the system with a bunch of internal statistic data about index or internal cache usage, existing wait stats, locks and so on …

In contrast, black-Box monitoring is symptom oriented and tests externally visible behavior as a user would see it. Goal is only monitoring the system from the outside and seeing ongoing problems in the system. There are many ways to achieve black-box monitoring and the first obvious one is using probes which will collect CPU or memory usage, network communications, HTTP health check or latency and so on … Another option is to use a set of integration tests that run all the time to test the system from a behavior / business perspective.

White-Box vs. Black-Box Monitoring: Which is finally more important? All are and can work together. In my company, both are used at different layers of the micro-service architecture including software and infrastructure components.

RED vs USE monitoring

When you’re working in a web-oriented and customer-centric company, you are quickly introduced to The Four Golden Signals monitoring concept which defines a series of metrics originally from Google Site Reliability Engineering including latency, traffic, errors and saturation. The RED method is a subset of “Four Golden Signals” and focus on micro-service architectures and include following metrics:

Rate: number of requests our service is serving per second
Error: number of failed requests per second
Duration: amount of time it takes to process a request

Those metrics are relatively straightforward to understand and may reduce time to figure out which service was throwing the errors and then eventually look at the logs or to restart the service, whatever.

For HTTP Metrics the RED Method is a good fit while the USE Method is more suitable for infrastructure side where main concern is to keep physical resources under control. The latter is based on 3 metrics:

Utilization: Mainly represented in percentage and indicates if a resource is in underload or overload state.
Saturation: Work in a queue and waiting to be processed
Errors: Count of event errors

Those metrics are commonly used by DBAs to monitor performance. It is worth noting that utilization metric can be sometimes misinterpreted especially when maximum value depends of the context and can go over 100%.

SQL Server infrastructure monitoring expectations

Referring to the starting scenario and all concepts surfaced above, it was clear for us to evolve our existing SQL Server monitoring architecture to improve our ability to reach the following goals:

Keeping analyzing long-term trends to respond usual questions like how my daily-workload is evolving? How big is my database? …
Alerting to respond for a broken issue we need to fix or for an issue that is going on and we must check soon.
Building comprehensive dashboards – dashboards should answer basic questions about our SQL Server instances, and should include some form of the advanced SQL telemetry and logging for deeper analysis.
Conducting an ad-hoc retrospective analysis with easier correlation: from example an http response latency that increased in one service. What happened around? Is-it related to database issue? Or blocking issue raised on the SQL Server instance? Is it related to a new query or schema change deployed from the automated delivery pipeline? In other words, good observability should be part of the new solution.
Automated discovery and telemetry collection for every SQL Server instance installed on our environment, either on VM or in container.
To rely entirely on the common platform monitoring based on Prometheus and Grafana. Having the same tooling make often communication easier between people (human factor is also an important aspect of DevOps)

Prometheus, Grafana and Telegraf

Prometheus and Grafana are the central monitoring solution for our micro-service architecture. Some others exist but we’ll focus on these tools in the context of this write-up.
Prometheus is an open-source ecosystem for monitoring and alerting. It uses a multi-dimensional data model based on time series data identified by metric name and key/value pairs. WQL is the query language used by Prometheus to aggregate data in real time and data are directly shown or consumed via HTTP API to allow external system like Grafana. Unlike previous tooling, we appreciated collecting SQL Server metrics as well as those of the underlying infrastructure like VMWare and others. It allows to comprehensive picture of a full path between the database services and infrastructure components they rely on.

Grafana is an open source software used to display time series analytics. It allows us to query, visualize and generate alerts from our metrics. It is also possible to integrate a variety of data sources in addition of Prometheus increasing the correlation and aggregation capabilities of metrics from different sources. Finally, Grafana comes with a native annotation store and the ability to add annotation events directly from the graph panel or via the HTTP API. This feature is especially useful to store annotations and tags related to external events and we decided to use it for tracking software releases or SQL Server configuration changes. Having such event directly on dashboard may reduce troubleshooting effort by responding faster to the why of an issue.

For collecting data we use Telegraf plugin for SQL Server. The plugin exposes all configured metrics to be polled by a Prometheus server. The plugin can be used for both on-prem and Azure instances including Azure SQL DB and Azure SQL MI. Automated deployment and configuration requires low effort as well.

The high-level overview of the new implemented monitoring solution is as follows:

SQL Server telemetry is achieved through Telegraf + Prometheus and includes both Black-box and White-box oriented metrics. External events like automated deployment, server-level and database-level configuration changes are monitored through a centralized scheduled framework based on PowerShell. Then annotations + tags are written accordingly to Grafana and event details are recorded to logging tables for further troubleshooting.

Did the new monitoring met our expectations?

Well, having experienced the new monitoring solution during this year, I would say we are on a good track. We worked mainly on 2 dashboards. The first one exposes basic black-box metrics to show quickly if something is going wrong while the second one is DBA oriented with a plenty of internal counters to dig further and to perform retrospective analysis.

Here a sample of representative issues we faced this year and we managed to fix with the new monitoring solution:

1) Resource pressure and black-box monitoring in action:

For this scenario, the first dashboard highlighted resource pressure issues, but it is worth noting that even if the infrastructure was burning, users didn’t experience any side effects or slowness on application side. After corresponding alerts raised on our side, we applied proactive and temporary fixes before users experience them. I would say, this scenario is something we would able to manage with previous monitoring and the good news is we didn’t notice any regression on this topic.

2) Better observability for better resolution of complex issue

This scenario was more interesting because the first symptom started from the application side without alerting the infrastructure layer. We started suffering from HTTP request slowness on November around 12:00am and developers got alerted with sporadic timeout issues from the logging system. After they traversed the service graph, they spotted on something went wrong on the database service by correlating http slowness with blocked processes on SQL Server dashboard as shown below. I put a simplified view on the dashboards, but we need to cross several routes between the front-end services and databases.

Then I got a call from them and we started investigating blocking processes from the logging tables in place on SQL Server side. At a first glance, different queries with a longer execution time than usual and neither release deployments nor configuration updates may explain such sudden behavior change. The issue kept around and at 15:42 it started appearing more frequently to deserve a deeper look at the SQL Server internal metrics. We quickly found out some interesting correlation with other metrics and we finally managed to figure out why things went wrong as show below:

Root cause was related to transaction replication slowness within Always On availability group databases and we directly jumped on storage issue according to error log details on secondary:

End-to-End observability by including the database services to the new monitoring system drastically reduces the time for finding the root cause. But we also learnt from this experience and to continuously improve the observability we added a black-box oriented metric related to availability group replication latency (see below) to detect faster any potential issue.

And what’s next?

Having such monitoring is not the endpoint of this story. As said at the beginning of this write-up, continuous delivery comes with its own DBA challenges illustrated by the starting scenario. Traditionally the DBA role is siloed, turning requests or tickets into work and they can be lacking context about the broader business or technology used in the company. I experienced myself several situations where you get alerted during the night when developer’s query exceeds some usage threshold. Having discussed the point with many DBAs, they tend to be conservative about database changes (normal reaction?) especially when you are at the end of the delivery process without clear view of what will could deployed exactly.

Here the new situation:

Implementing new monitoring stuff changed the way to observe the system (at least from a DBA perspective). Again, I believe the adding value of DBA role in a company with a strong DevOps mindset is being part of both production DBAs and Development DBAs. Making observability consistent across all the delivery pipeline including databases is likely part of the success and may help DBA getting a broader picture of system components. Referring to my context, I’m now able to get more interaction with developer teams on early phases and to provide them contextual feedbacks (and not generic feedbacks) for improvements regarding SQL production telemetry. They also have access to them and can check by themselves impact of their development. In the same way, feedbacks and work with my team around database infrastructure topic may appear more relevant.

It is finally a matter of collaboration

Interesting use case of using dummy columnstore indexes and temp tables

mikedavem — Fri, 20 Nov 2020 17:06:06 +0000

Columnstore indexes are a very nice feature and well-suited for analytics queries. Using them for our datawarehouse helped to accelerate some big ETL processing and to reduce resource footprint such as CPU, IO and memory as well. In addition, SQL Server 2016 takes columnstore index to a new level and allows a fully updateable non-clustered columnstore index on a rowstore table making possible operational and analytics workloads. Non-clustered columnstore index are a different beast to manage with OLTP workload and we got both good and bad experiences on it. In this blog post, let’s talk about good effects and an interesting case where we use them for reducing CPU consumption of a big reporting query.

In fact, the concerned query follows a common T-SQL anti-pattern for performance: A complex layer of nested views and CTEs which are an interesting mix to getting chances to prevent a cleaned-up execution plan. The SQL optimizer gets tricked easily in this case. So, for illustration let’s start with the following query pattern:

;WITH CTE1 AS (
SELECT col ..., SUM(col2), ...
FROM [VIEW]
GROUP BY col ...
),
CTE2 AS (
SELECT col ..., ROW_NUMBER()
FROM (
SELECT col ...
JOIN CTE1 ON ...
JOIN [VIEW2] ON ...
JOIN [TABLE] ON ...
) AS VT
),
CTE3 A (
SELECT col ...
FROM [VIEW]
JOIN [VIEW4] ON ...
)
...
SELECT col ...
FROM (
SELECT
col,
STUFF((SELECT '', '' + col
FROM CTE2
WHERE CTE2.ID = CTE1.ID
FOR XML PATH('''')), 1, 1, '''') AS colconcat,
...
FROM (
SELECT col ...
FROM CTE1
LEFT JOIN CTE2 ON ...
LEFT JOIN (
SELECT col
FROM CTE3
GROUP BY col
) AS T1 ON ...
) AS T2
GROUP BY col ...
)

Sometimes splitting the big query into small pieces and storing pre-aggregation within temporary tables may help. This is what it has been done and led to some good effects with a global reduction of the query execution time.

CREATE TABLE #T1 ...
CREATE TABLE #T2 ...
CREATE TABLE #T3 ...

;WITH CTE1 AS (
SELECT col ..., SUM(col2), ...
FROM [VIEW]
GROUP BY col ...
)
INSERT INTO #T1 ...
SELECT col FROM CTE1 ...
;

WITH CTE2 AS (
SELECT col ..., ROW_NUMBER()
FROM (
SELECT col ...
JOIN #T1 ON ...
JOIN [VIEW2] ON ...
JOIN [TABLE] ON ...
) AS VT
)
INSERT INTO #T2 ...
SELECT col FROM CTE2 ...
;

WITH CTE3 A (
SELECT col ...
FROM [VIEW]
JOIN [VIEW4] ON ...
)
INSERT INTO #T3 ...
SELECT col FROM CTE3 ...
;

SELECT col ...
FROM (
SELECT
col,
STUFF((SELECT '', '' + col
FROM CTE2
WHERE CTE2.ID = CTE1.ID
FOR XML PATH('''')), 1, 1, '''') AS colconcat,
...
FROM (
SELECT col ...
FROM #T1
LEFT JOIN #T2 ON ...
LEFT JOIN (
SELECT col
FROM #T3
GROUP BY col
) AS T1 ON ...
) AS T2
GROUP BY col ...
)

However, it was not enough, and the query continued to consume a lot of CPU time as shown below:

CPU time was around 20s per execution. CPU time is greater than duration time due to parallelization. Regarding the environment you are, you would say having such CPU time can be common for reporting queries and you’re probably right. But let’s say in my context where all reporting queries are offloaded in a secondary availability group replica (SQL Server 2017), we wanted to keep the read only CPU footprint as low as possible to guarantee a safety margin of CPU resources to address scenarios where all the traffic (including both R/W and R/O queries) is redirected to the primary replica (maintenance, failure and so on). The concerned report is executed on-demand by users and mostly contribute to create high CPU spikes among other reporting queries as shown below:

Testing this query on DEV environment gave following statistic execution outcomes:

SQL Server Execution Times:
CPU time = 12988 ms, elapsed time = 6084 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

… with the related execution plan (not estimated). In fact, I put only the final SELECT step because it was the main culprit of high CPU consumption for this query – (plan was anonymized by SQL Sentry Plan Explorer):

Real content of the query doesn’t matter for this write-up, but you probably have noticed I put explicitly concatenation stuff with XML PATH construct previously and I identified the execution path in the query plan above. This point will be important in the last section of this write-up.

First, because CPU is my main concern, I only selected CPU cost and you may notice top consumers are repartition streams and hash match operators followed by Lazy spool used with XML PATH and correlated subquery.

Then rewriting the query could be a good option but we first tried to find out some quick wins to avoid engaging too much time for refactoring stuff. Focusing on the different branches of this query plan and operators engaged from the right to the left, we make assumption that experimenting batch mode could help reducing the overall CPU time on the highlighted branch. But because we are not dealing with billion of rows within temporary tables, we didn’t want to get extra overhead of maintaining compressed columnstore index structure. I remembered reading an very interesting article in 2016 about the creation of dummy non-clustered columnstore indexes (NCCI) with filter capabilities to enable batch mode and it seemed perfectly fit with our scenario. In parallel, we went through inline index creation capabilities to neither trigger recompilation of the batch statement nor to prevent temp table caching. The target is to save CPU time

So, the temp table and inline non-clustered columnstore index DDL was as follows:

CREATE TABLE #T1 ( col ..., INDEX CCI_IDX_T1 NONCLUSTERED COLUMNSTORE (col) ) WHERE col < 1
CREATE TABLE #T2 ( col ..., INDEX CCI_IDX_T2 NONCLUSTERED COLUMNSTORE (col) ) WHERE col < 1
CREATE TABLE #T3 ( col ..., INDEX CCI_IDX_T3 NONCLUSTERED COLUMNSTORE (col) ) WHERE col < 1
…

Note the WHERE clause here with an out-of-range value to create an empty NCCI.

After applying the changes here, the new statistic execution metrics we got:

SQL Server Execution Times:
CPU time = 2842 ms, elapsed time = 6536 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

… and the related execution plan:

A drop of CPU time consumption (2.8s vs 12s) per execution when the batch mode kicked-in. A good news for sure but something continued to draw my attention because even if batch mode came into play here, it was not propagated to the left and seem to stop at the level of XML PATH execution. After reading my preferred reference on this topic (thank you Niko), I was able to confirm my suspicion of unsupported XML operation with batch mode. Unfortunately, I was out of luck to confirm with column_store_expression_filter_apply extended event that seem to not work for me.

Well, to allow the propagation of batch mode to the left side of the execution plan, it was necessary to write and move the correlated subquery with XML path to a simple JOIN and STRING_AGG() function – available since SQL Server 2016:

-- Concat with XML PATH
SELECT
col,
STUFF((SELECT '', '' + col
FROM CTE2
WHERE CTE2.col = CTE1.col
FOR XML PATH('''')), 1, 1, '''') AS colconcat,
...
FROM [TABLE]

-- Concat with STRING_AGG
SELECT
col,
V.colconcat,
...
FROM [TABLE] AS T
JOIN (
SELECT
col,
STRING_AGG(col2, ', ') AS colconcat
FROM #T2
GROUP BY col
) AS V ON V.col = T.col

The new change gave this following outcome:

SQL Server Execution Times:
CPU time = 2109 ms, elapsed time = 1872 ms.

and new execution plan:

First, batch mode is now propagated from the right to the left of the query execution plan because we eliminated all inhibitors including XML construct. We got not real CPU reduction this time, but we managed to reduce global execution time. The hash match aggregate operator is the main CPU consumer and it is the main candidate to benefit from batch mode. All remaining operators on the left side process few rows and my guess is that batch mode may appear less efficient than with the main consumer in this case. But anyway, note we also got rid of the Lazy Spool operator with the refactoring of the XML path and correlated subquery by STRING_AGG() and JOIN construct.

The new result is better by far compared to the initial scenario (New CPU time: 3s vs Old CPU Time: 20s). It also had good effect of the overall workload on the AG read only replica:

Not so bad for a quick win!
See you

Universal usage of NVARCHAR type and performance impact

mikedavem — Wed, 27 May 2020 17:06:24 +0000

A couple of weeks, I read an article from Brent Ozar about using NVARCHAR as a universal parameter. It was a good reminder and from my experience, I confirm this habit has never been a good idea. Although it depends on the context, chances are you will almost find an exception that proves the rule.

A couple of days ago, I felt into a situation that illustrated perfectly this issue, and, in this blog, I decided to share my experience and demonstrate how the impact may be in a real production scenario.
So, let’s start with the culprit. I voluntary masked some contextual information but the principal is here. The query is pretty simple:

DECLARE @P0 DATETIME
DECLARE @P1 INT
DECLARE @P2 NVARCHAR(4000)
DECLARE @P3 DATETIME
DECLARE @P4 NVARCHAR(4000)

UPDATE TABLE SET DATE = @P0
WHERE ID = @P1
AND IDENTIFIER = @P2
AND P_DATE >= @P3
AND W_O_ID = (
SELECT TOP 1 ID FROM TABLE2
WHERE Identifier = @P4
ORDER BY ID DESC)

And the corresponding execution plan:

The most interesting part concerns the TABLE2 table. As you may notice the @P4 input parameter type is NVARCHAR and it is evident we get a CONVERT_IMPLICIT in the concerned Predicate section above. The CONVERT_IMPLICIT function is required because of data type precedence. It results to a costly operator that will scan all the data from TABLE2. As you probably know, CONVERT_IMPLICT prevents sargable condition and normally this is something we could expect here referring to the distribution value in the statistic histogram and the underlying index on the Identifier column.

EXEC sp_helpindex 'TABLE2';

DBCC SHOW_STATISTICS ('TABLE2', 'IX___IDENTIFIER')
WITH HISTOGRAM;

Another important point to keep in mind is that scanning all the data from the TABLE 2 table may be at a certain cost (> 1GB) even if data resides in memory.

EXEC sp_spaceused 'TABLE2'

The execution plan warning confirms the potential overhead of retrieving few rows in the TABLE2 table:

To set a little bit more the context, the concerned application queries are mainly based on JDBC Prepared statements which imply using NVARCHAR(4000) with string parameters regardless the column type in the database (VARCHAR / NVARCHAR). This is at least what we noticed from during my investigations.

So, what? Well, in our DEV environment the impact was imperceptible, and we had interesting discussions with the DEV team on this topic and we basically need to improve the awareness and the visibility on this field. (Another discussion and probably another blog post) …

But chances are your PROD environment will tell you a different story when it comes a bigger workload and concurrent query executions. In my context, from an infrastructure standpoint, the symptom was an abnormal increase of the CPU consumption a couple of days ago. Usually, the CPU consumption was roughly 20% up to 30% and in fact, the issue was around for a longer period, but we didn’t catch it due to a « normal » CPU footprint on this server.

So, what happened here? We’re using SQL Server 2017 with Query Store enabled on the concerned database. This feature came to the rescue and brought attention to the first clue: A query plan regression that led increasing IO consumption in the second case (and implicitly the additional CPU resource consumption as well).

You have probably noticed both the execution plans are using an index scan at the right but the more expensive one (at the bottom) uses a different index strategy. Instead of using the primary key and clustered index (PK_xxx), a non-clustered index on the IX_xxx_Identifier column in the second query execution plan is used with the same CONVERT_IMPLICIT issue.

According to the query store statistics, number of executions per business day is roughly 25000 executions with ~ 8.5H of CPU time consumed during this period (18.05.2020 – 26.05.2020) that was a very different order of magnitude compared to what we may have in the DEV environment

At this stage, I would say investigating why a plan regression occurred doesn’t really matter because in both cases the most expensive operator concerns an index scan and again, we expect an index seek. Getting rid of the implicit conversion by using VARCHAR type to make the conditional clause sargable was a better option for us. Thus, the execution plan would be:

The first workaround in mind was to force the better plan in the query store (automatic tuning with FORCE_LAST_GOOD_PLAN = ON is disabled) but having discussed this point with the DEV team, we managed to deploy a fix very fast to address this issue and to reduce drastically the CPU consumption on this SQL Server instance as shown below. The picture is self-explanatory:

The fix consisted in adding CAST / CONVERT function to the right side of the equality (parameter and not the column) to avoid side effect on the JDBC driver. Therefore, we get another version of the query and a different query hash as well. The query update is pretty similar to the following one:

DECLARE @P0 DATETIME
DECLARE @P1 INT
DECLARE @P2 NVARCHAR(4000)
DECLARE @P3 DATETIME
DECLARE @P4 NVARCHAR(4000)

UPDATE TABLE SET DATE = @P0
WHERE ID = @P1
AND IDENTIFIER = CAST(@P2 AS varchar(50))
AND P_DATE >= @P3
AND W_O_ID = (
SELECT TOP 1 ID FROM TABLE2
WHERE Identifier = CAST(@P4 AS varchar(50))
ORDER BY ID DESC)

Sometime later, we gathered query store statistics of both the former and new query to confirm the performance improvement as shown below:

Finally changing the data type led to enable using an index seek operator to reduce drastically the SQL Server CPU consumption and logical read operations by far.

QED!

SQL Server on Linux and new FUA support for XFS filesystem

mikedavem — Mon, 13 Apr 2020 17:34:32 +0000

I wrote a (dbi services) blog post concerning Linux and SQL Server IO behavior changes before and after SQL Server 2017 CU6. Now, I was looking forward seeing some new improvements with Force Unit Access (FUA) that was implemented with Linux XFS enhancements since the Linux Kernel 4.18.

As reminder, SQL Server 2017 CU6 provides added a way to guarantee data durability by using « forced flush » mechanism explained here. To cut the story short, SQL Server has strict storage requirement such as Write Ordering, FUA and things go differently on Linux than Windows to achieve durability. What is FUA and why is it important for SQL Server? From Wikipedia: Force Unit Access (aka FUA) is an I/O write command option that forces written data all the way to stable storage. FUA appeared in the SCSI command set but good news, it was later adopted by other standards over the time. SQL Server relies on it to meet WAL and ACID capabilities.

On the Linux world and before the Kernel 4.18, FUA was handled and optimized only for the filesystem journaling. However, data storage always used the multi-step flush process that could introduce SQL Server IO storage slowness (Issue write to block device for the data + issue block device flush to ensure durability with O_DSYNC).

On the Windows world, installing and using a SQL Server instance assumes you are compliant with the Microsoft storage requirements and therefore the first RTM version shipped on Linux came only with O_DIRECT assuming you already ensure that SQL Server IO are able to be written directly into a non-volatile storage through the kernel, drivers and hardware before the acknowledgement. Forced flush mechanism – based on fdatasync() – was then introduced to address scenarios with no safe DIRECT_IO capabilities.

But referring to the Bob Dorr article, Linux Kernel 4.18 comes with XFS enhancements to handle FUA for data storage and it is obviously of benefit to SQL Server. FUA support is intended to improve write requests by shorten the path of write requests as shown below:

Picture from existing IO workflow on Bob Dorr’s article

This is an interesting improvement for write intensive workload and it seems to be confirmed from the tests performed by Microsoft and Bob Dorr in his article.

Let’s the experiment begins with my lab environment based on a Centos 7 on Hyper-V with an upgraded kernel version: 5.6.3-1.e17.elrepo.x86_64.

$uname -r
5.6.3-1.el7.elrepo.x86_64

$cat /etc/os-release | grep VERSION
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Let’s precise that my tests are purely experimental and instead of upgrading the Kernel to a newer version you may directly rely on RHEL 8 based distros which comes with kernel version 4.18 for example.

My lab environment includes 2 separate SSD disks to host the DATA + TLOG database files as follows:

I:\ drive : SQL Data volume (sdb – XFS filesystem)
T:\ drive : SQL TLog volume (sda – XFS filesystem)

The general performance is not so bad

Initially I just dedicated on disk for both SQL DATA and TLOG but I quickly noticed some IO waits (iostats output) leading to make me lunconfident with my test results

Spreading IO on physically separate volumes helped to reduce drastically these phenomena afterwards:

First, I enabled FUA capabilities on Hyper-V side as follows:

Set-VMHardDiskDrive -VMName CENTOS7 -ControllerType SCSI -OverrideCacheAttributes WriteCacheAndFUAEnabled

Get-VMHardDiskDrive -VMName CENTOS7 | `
ft VMName, ControllerType, ControllerLocation, Path, WriteHardeningMethod -AutoSize

Then I checked if FUA is enabled and supported from an OS perspective including sda (TLOG) and sdb (SQL DATA) disks:

$ lsblk -f
NAME FSTYPE LABEL UUID MOUNTPOINT
sdb
└─sdb1 xfs 06910f69-27a3-4711-9093-f8bf80d15d72 /sqldata
sr0
sda
├─sda2 xfs f5a9bded-130f-4642-bd6f-9f27563a4e16 /boot
├─sda3 LVM2_member QsbKEt-28yT-lpfZ-VCbj-v5W5-vnVr-2l7nih
│ ├─centos-swap swap 7eebbb32-cef5-42e9-87c3-7df1a0b79f11 [SWAP]
│ └─centos-root xfs 90f6eb2f-dd39-4bef-a7da-67aa75d1843d /
└─sda1 vfat 7529-979E /boot/efi

$ dmesg | grep sda
[ 1.665478] sd 0:0:0:0: [sda] 83886080 512-byte logical blocks: (42.9 GB/40.0 GiB)
[ 1.665479] sd 0:0:0:0: [sda] 4096-byte physical blocks
[ 1.665774] sd 0:0:0:0: [sda] Write Protect is off
[ 1.665775] sd 0:0:0:0: [sda] Mode Sense: 0f 00 10 00
[ 1.670321] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 1.683833] sda: sda1 sda2 sda3
[ 1.708938] sd 0:0:0:0: [sda] Attached SCSI disk
[ 5.607914] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)

Finally according to the documentation, I configured the trace flag 3979 and control.alternatewritethrough=0 parameters at startup parameters for my SQL Server instance.

$ /opt/mssql/bin/mssql-conf traceflag 3979 on

$ /opt/mssql/bin/mssql-conf set control.alternatewritethrough 0

$ systemctl restart mssql-server

The first I performed was pretty similar to those in my previous (dbi services) blog post.

CREATE TABLE dummy_test (
id INT IDENTITY,
col1 VARCHAR(2000) DEFAULT REPLICATE('T', 2000)
);

INSERT INTO dummy_test DEFAULT VALUES;
GO 67

For a sake of curiosity, I looked at the corresponding strace output:

$ cat sql_strace_fua.txt
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
78.13 360.618066 61739 5841 2219 futex
6.88 31.731833 1511040 21 15 restart_syscall
3.81 17.592176 130312 135 io_getevents
2.95 13.607314 98604 138 epoll_wait
2.88 13.313667 633984 21 21 rt_sigtimedwait
2.60 11.997925 1333103 9 nanosleep
1.79 8.279781 242 34256 gettid
0.84 3.876021 226 17124 getcpu
0.03 0.138836 347 400 sched_yield
0.01 0.062348 254 245 getrusage
0.01 0.056065 406 138 69 readv
0.01 0.038107 343 111 read
0.01 0.037883 743 51 mmap
0.01 0.037498 180 208 epoll_ctl
0.01 0.035654 517 69 writev
0.01 0.025542 370 69 io_submit
0.00 0.019760 282 70 write
0.00 0.019555 477 41 open
0.00 0.016285 1629 10 rt_sigaction
0.00 0.012359 301 41 close
0.00 0.010069 205 49 munmap
0.00 0.006977 303 23 rt_sigprocmask
0.00 0.006256 153 41 fstat
0.00 0.004646 465 10 10 stat
0.00 0.000860 215 4 madvise
0.00 0.000321 161 2 sched_setaffinity
0.00 0.000295 148 2 set_robust_list
0.00 0.000281 141 2 clone
0.00 0.000236 118 2 sigaltstack
0.00 0.000093 47 2 arch_prctl
0.00 0.000046 23 2 sched_getaffinity
------ ----------- ----------- --------- --------- ----------------
100.00 461.546755 59137 2334 total

… And as I expected, with FUA enabled no fsync() / fdatasync() called anymore and writing to a stable storage is achieved directly by FUA commands. Now iomap_dio_rw() is determining if REQ_FUA can be used and issuing generic_write_sync() is still necessary. To dig further to the IO layer we need to rely to another tool blktrace (mentioned to the Bob Dorr’s article as well).

In my case I got to different pictures of blktrace output between forced flushed mechanism (the default) and FUA oriented IO:

-> With forced flush

34.694734500 14225 18425192 8,16 0 17164 A WS 2048 sqlservr
34.694735000 14225 18425192 8,16 0 17165 Q WS 2048 sqlservr
34.694737000 14225 18425192 8,16 0 17166 X WS 1024 sqlservr
34.694738100 14225 18425192 8,16 0 17167 G WS 1024 sqlservr
34.694739800 14225 18426216 8,16 0 17169 G WS 1024 sqlservr
34.694740900 14225 18425192 8,16 0 17171 D WS 1024 sqlservr
34.694747200 14225 18426216 8,16 0 17174 D WS 1024 sqlservr
34.713665000 14225 0 8,16 0 17175 Q FWS 0 sqlservr
34.713668100 14225 0 8,16 0 17176 G FWS 0 sqlservr

WS (Write Synchronous) is performed but SQL Server still needs to go through the multi-step flush process with the additional FWS (PERFLUSH|WRITE|SYNC).

-> FUA

0.000000000 16305 55106536 8,0 0 1 A WFS 8 sqlservr
0.000000400 16305 57615336 8,0 0 2 A WFS 8 sqlservr
0.000001100 16305 57615336 8,0 0 3 Q WFS 8 sqlservr
0.000005200 16305 57615336 8,0 0 4 G WFS 8 sqlservr
0.001377800 16305 55106544 8,0 0 6 A WFS 16 sqlservr

FWS has disappeared with only WFS commands which are basically REQ_WRITE with the REQ_FUA request

I spent some times to read some interesting discussions in addition to the Bob Dorr’s wonderful article. Here an interesting pointer to a a discussion about REQ_FUA for instance.

But what about performance gain?

I had 2 simple scenarios to play with in order to bring out FUA helpfulness including the harden the dirty pages in the BP with checkpoint process and harden the log buffer to disk during the commit phase. When forced flush method is used, each component relies on additional FlushFileBuffers() function to achieve durability. This event can be easily tracked from an XE session including flush_file_buffers and make_writes_durable events.

First scenario (10K inserts within a transaction and checkpoint)

In this scenario my intention was to stress the checkpoint process with a bunch of buffers and dirty pages to flush to disk when it kicks in.

USE dummy;

SET NOCOUNT ON;
-- Disable checkpoint to control when it will kick in
DBCC TRACEON(3505);
-- Check traceflag
DBCC TRACESTATUS;

DECLARE @i INT = 0;
DECLARE @iteration INT = 0;
DECLARE @start_upd DATETIME;
DECLARE @start_chkpt DATETIME;
DECLARE @end_upd DATETIME;
DECLARE @end_chkpt DATETIME;

TRUNCATE TABLE dummy_test;

WHILE @iteration < 251
BEGIN

SET @start_upd = GETDATE();

BEGIN TRAN;

WHILE @i <= 10000
BEGIN
INSERT INTO dummy_test DEFAULT VALUES;
SET @i += 1;
END

COMMIT TRAN;

SET @end_upd = GETDATE();

SET @i = 0;

SET @start_chkpt = GETDATE();
CHECKPOINT;
SET @end_chkpt = GETDATE();
PRINT 'INS: ' + CAST(DATEDIFF(ms, @start_upd, @end_upd) AS VARCHAR(50)) + ' - CHKPT: ' + CAST(DATEDIFF(ms, @start_chkpt, @end_chkpt) AS VARCHAR(50));

SET @iteration += 1;
END

The result is as follows:

In my case, I noticed ~ 17% of improvement for the checkpoint process and ~7% for the insert transaction including the commit phase with flushing data to the TLog. In parallel, looking at the extended event aggregated output confirms that FUA avoids a lot of additional operations to persist data on disk illustrated by flush_file_buffers and make_writes_durable events.

Second scenario (100x 1 insert within a transaction and checkpoint)

In this scenario, I wanted to stress the log writer by forcing a lot of small transactions to commit. I updated the TSQL code as shown below:

USE dummy;

SET NOCOUNT ON;
-- Disable checkpoint to control when it will kick in
DBCC TRACEON(3505);
-- Check traceflag
DBCC TRACESTATUS;

DECLARE @i INT = 0;
DECLARE @iteration INT = 0;
DECLARE @start_upd DATETIME;
DECLARE @start_chkpt DATETIME;
DECLARE @end_upd DATETIME;
DECLARE @end_chkpt DATETIME;

TRUNCATE TABLE dummy_test;

WHILE @iteration < 251
BEGIN

SET @start_upd = GETDATE();

WHILE @i <= 100
BEGIN
INSERT INTO dummy_test DEFAULT VALUES;
SET @i += 1;
END

SET @end_upd = GETDATE();

SET @i = 0;

SET @start_chkpt = GETDATE();
CHECKPOINT;
SET @end_chkpt = GETDATE();
PRINT 'INS: ' + CAST(DATEDIFF(ms, @start_upd, @end_upd) AS VARCHAR(50)) + ' - CHKPT: ' + CAST(DATEDIFF(ms, @start_chkpt, @end_chkpt) AS VARCHAR(50));

SET @iteration += 1;
END

The new picture is the following:

This time the improvement is definitely more impressive with a decrease of ~80% of the execution time about the INSERT + COMMIT and ~77% concerning the checkpoint phase!!!

Looking at the extended event session confirms the shorten IO path has something to do with it

Well, shortening the IO path and relying directing on initial FUA instructions was definitely a good idea both to join performance and to meet WAL and ACID capabilities. Anyway, I’m glad to see Microsoft to contribute improving to the Linux Kernel!!!

Mitigating Scalar UDF’s procedural code performance with SQL 2019 and Scalar UDF Inlining capabilities

mikedavem — Thu, 05 Mar 2020 15:10:02 +0000

A couple of days ago, I read the write-up of my former colleague @FranckPachot about refactoring procedural code to SQL. This is recurrent subject in the database world and I was interested in transposing this article to SQL Server because it was about refactoring a Scalar-Valued function to a SQL view. The latter one is a great alternative when it comes performance but something new was shipped with SQL Server 2019 and could address (or at least could mitigate) this recurrent scenario.

First of all, Scalar-Valued functions (from the User Defined Function category) are interesting objects for code modularity, factoring and reusability. No surprise to see them widely used by DEVs. But they are not always suited to performance considerations especially when it concerns the “impedance mismatch” problem. This is term used to refer to the problems that occurs due to differences between the database model and the programming language model. Indeed, from one side, a database world with SQL language that is declarative, and with queries that are set or multiset-oriented. To another side, programing world with imperative-oriented languages requiring accessing each tuple individually for processing.

To cut the story short, Scalar UDF provides programing benefits for DEVs but when performance matters, we discourage to use them for the aforementioned reasons. Before continuing, let’s precise that all the scripts and demos in the next sections are based on salika-db project on GitHub. Franck Pachot used the mysql version and fortunately there exists a sample for SQL Server as well. Furthermore, the mysql function used as initial example by Franck may be translated to SQL Server as follows:

-- Scalar function
CREATE OR ALTER FUNCTION inventory_in_stock (@p_inventory_id INT)
RETURNS BIT
BEGIN
DECLARE @v_rentals INT;
DECLARE @v_out INT;
DECLARE @verif BIT;

--AN ITEM IS IN-STOCK IF THERE ARE EITHER NO ROWS IN THE rental TABLE
--FOR THE ITEM OR ALL ROWS HAVE return_date POPULATED

SET @v_rentals = (SELECT COUNT(*) FROM rental WHERE inventory_id = @p_inventory_id);

IF @v_rentals = 0
BEGIN
SET @verif = 1
END
ELSE
BEGIN
SET @v_out = (SELECT COUNT(rental_id)
FROM inventory
LEFT JOIN rental ON inventory.inventory_id = rental.inventory_id
WHERE inventory.inventory_id = @p_inventory_id
AND rental.return_date IS NULL)

IF @v_out > 0
SET @verif = 0;
ELSE
SET @verif = 1;
END;

RETURN @verif;
END
GO

During his write-up, Franck provided a natural alternative of this UDF based on a SQL view and here a similar solution applied to SQL Server:

CREATE OR ALTER VIEW v_inventory_stock_status
AS

SELECT
i.inventory_id,
CASE
WHEN NOT EXISTS (SELECT 1 FROM dbo.rental AS r WHERE r.inventory_id = i.inventory_id AND r.return_date IS NULL) THEN 1
ELSE 0
END AS inventory_in_stock
FROM dbo.inventory AS i
GO

Then similar to what Franck did, we can join this view with the inventory table to get the expected outcome:

select count(v.inventory_id),inventory_in_stock
from inventory AS i
left join v_inventory_stock_status AS v ON i.inventory_id = v.inventory_id
group by v.inventory_in_stock;
go

There is another alternative that could be use here base on a CTE rather than a TSQL view as follows. However, the performance is similar in both cases and it is up to each DEV which solution fits with their needs:

;with cte
as
(
SELECT
i.inventory_id,
CASE
WHEN NOT EXISTS (SELECT 1 FROM dbo.rental AS r WHERE r.inventory_id = i.inventory_id AND r.return_date IS NULL) THEN 1
ELSE 0
END AS inventory_in_stock
FROM dbo.inventory AS i
)
select count(v.inventory_id),inventory_in_stock
from inventory AS i
left join cte AS v ON i.inventory_id = v.inventory_id
group by v.inventory_in_stock;
go

I compared then the performance between the UDF based version and the TSQL view:

-- udf
select count(*),dbo.inventory_in_stock(inventory_id)
from inventory
group by dbo.inventory_in_stock(inventory_id)
GO
-- view
select count(v.inventory_id),inventory_in_stock
from inventory AS i
left join v_inventory_stock_status AS v ON i.inventory_id = v.inventory_id
group by v.inventory_in_stock;
go

The outcome below (CPU, Reads, Writes, Duration) is as expected. The SQL view is the winner by far.

Similar to Franck’s finding, the performance gain is as the cost of rewriting the code for DEVs in this scenario. But SQL Server 2019 provides another interesting way to continue using the UDF abstraction without compromising on performance: Scalar T-SQL UDF Inlining feature and I was curious to see how much improvement we get with such capabilities for this scenario.

First time I executed the following UDF-based TSQL script on SQL Server 2019 RTM (be sure to be in 150 compatibility mode), I ran into some OOM issues for the second query:

-- SQL 2017-
ALTER DATABASE SCOPED CONFIGURATION SET TSQL_SCALAR_UDF_INLINING = OFF;
GO
SELECT dbo.inventory_in_stock(10)
GO
-- SQL 2019+
ALTER DATABASE SCOPED CONFIGURATION SET TSQL_SCALAR_UDF_INLINING = ON;
GO
SELECT dbo.inventory_in_stock(10)

Msg 8624, Level 16, State 17, Line 14
Internal Query Processor Error: The query processor could not produce a query plan. For more information, contact Customer Support Services.

To be honest, not a surprise to be honest because I already aware of it by reading the blog post of @sqL_handle a couple of weeks ago. Updating to CU2 fixed my issue. The second shot revealed some interesting outcomes.
The query plan of first query (<= SQL 2017) is as we may expected usually from executing a TSQL scalar function. From an execution perspective, this black box is materialized in the form of the compute scalar operator as shown below:

But the story has changed with Scalar UDF Inlining capability. This is illustrated by the below pictures which are sample of a larger execution plan:

…

The query optimizer has inferred some relation operations from my (imperative based) scalar UDF based on the Froid framework and provides several benefits including compiler optimization and parallelism (initially not possible with UDFs).

Let’s perform the same benchmark test that I performed between the UDF-based and the TSQL view based queries. In fact, I had to propose a slightly variation of the query to hope kicking in the Scalar UDF Inline capability:

-- First UDF query
select count(*),dbo.inventory_in_stock(inventory_id)
from inventory
group by dbo.inventory_in_stock(inventory_id)
GO

-- Variation of the first query
;with cte
as
(
select inventory_id,dbo.inventory_in_stock(inventory_id) as inventory_in_stock
from inventory
)
select count(*), inventory_in_stock
from cte
group by inventory_in_stock
GO

From a performance perspective, it is worth noting the improvement is not necessarily on the read operation but more the CPU and Duration times.

But let’s push the tests further by increasing the amount of data. As a reminder, the performance of the test is tied to the number of UDF execution and implicitly number of records in the Inventory table.

So, let’s add a bunch of records to the Inventory table …

INSERT inventory (film_id, store_id, last_update)
SELECT
film_id,
store_id,
GETDATE()
FROM inventory;

… and let’s execute this script to get respectively a total of 146592 and 2345472 rows for each test. Here the corresponding performance outcomes:

I noticed more rows there are in the inventory table better performance we get for each corresponding test:

…

Well, interesting outcome without rewriting any code isn’t it? An 80% decrease in average for query duration time and 61% for CPU time execution. For a sake of curiosity let’s take a look at the different query plans:

Scalar UDF Inlining not enabled

Again, the real cost is hidden by the UDF black box through the compute scalar operator but we guess easily that every row processed by compute Scalar operator implies the dbo.inventory_in_stock() function.

Scalar UDF Inlining enabled

Without going into details of the execution plan, something that draw attention is compiler optimizer tricks kicked in including parallelism. All the optimization stuff done by the query processor is helpful to improve the overall performance of the query.

So last point, does Scalar UDF Inlining scale better than the SQL view?

This last output seems to confirm the SQL view remains the winner among the alternatives in this specific scenario and you will have to choose best solution and likely the acceptable tradeoff that will fit with your context.

See you!

Collaborative way and tooling to debug SQL Server blocked processes scenarios

mikedavem — Thu, 30 Jan 2020 14:13:48 +0000

A quick blog post to show how helpful an extended event and few other tools can be to help fixing orphan transactions in a real use case scenario. I often gave training with customers about SQL Server performance and tools, but I noticed how difficult it can be to explain the importance of a tool if you only explain theory without any illustration with a real customer case.
Well, let’s start my own story that began a couple of days ago with an SQL alert to indicate a blocking scenario issue. Looking at our SQL dashboard (below), we were able to confirm quickly we are running into an annoying issue and it would be getting worse over if we do nothing.

…

Let’s precise before continuing there are a plenty of tools to help digging into blocked processes scenarios and my intention is not to favor one specific tool over another one. So, in my case the first tool I used was the sp_WhoIsActive procedure from Adam Machanic. One great feature of it is to get a comprehensive picture of what is happening on your system at the moment you’re executing the procedure.
Here a sample of output I got. Let’s precise it doesn’t reflect exactly my context (which was a little more complex) but anyway my intention is not to focus on this specific part

As you may see, I got quickly interesting information about the blocking leaders including session_id, the application name and the command executed as this moment. But the interesting part of this story was to get a request_id to NULL as well as a status value to SLEEPING. After some researches, having these values indicate likely that SQL Server has completed the command and the connection is waiting for the next command to come from the client. In addition, looking at the open_tran_count value (=1) confirmed the transaction was still opened. We started monitoring the transaction a couple of minutes to see if the application could manage to commit (or to rollback) the transaction but nothing happened. So, we had to kill the corresponding session to get back to a normal situation. A few minutes later, the situation came back with the same pattern and we applied the same temporary fix (KILL session).
The next step consisted in discussing with the DEV team to fix this annoying issue once and for all. We managed to reproduce this scenario in DEV environment, but it was not clear what happened exactly inside the application because we got not specific errors even when looking at the tracing infrastructure. To help the DEV team investigating the issue, we decided to create a special extended event session that both monitor all activity scoped to the concerned application and the transactions that remain open during the tracing timeframe. Events will be easily correlated relying on causality tracking capabilities of extended events.

So, the extended event session used two targets including the Event File and Histogram. Respectively, the first one was intended to write workload activity into a file on disk and the second one aimed identifying quickly opened transactions.

CREATE EVENT SESSION [OrphanedTransactionHunter] ON SERVER
ADD EVENT sqlserver.database_transaction_begin(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.database_transaction_end(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.error_reported(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.module_end(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.module_start(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.rpc_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.rpc_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sp_statement_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sp_statement_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sql_statement_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sql_statement_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack))
ADD TARGET package0.event_file(SET filename=N'OrphanedTransactionHunter'),
ADD TARGET package0.pair_matching(SET begin_event=N'sqlserver.database_transaction_begin',begin_matching_actions=N'sqlserver.session_id',end_event=N'sqlserver.database_transaction_end',end_matching_actions=N'sqlserver.session_id',respond_to_memory_pressure=(1))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=5 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)
GO

The outputs were as follows:

> Target histogram (with opened transactions)

The only transaction that remained opened during our test (just be careful other noisy records not relevant at the moment when you start the XE session) concerned the session_id = 873. The attach_activity_id, attach_activity_id_xfer and session column values were helpful here to correlate events recorded to the event file target.

> Event File (Workload activity)

Here the events after applying a filter with above values.

We noticed the transaction with session_id = 873 was started but never ended. In addition, we were able to identify the sequence of code executed by the application (mainly based on prepared statement and stored procedures in our context).This information helps the DEV team to focus on the right portion of code to fix. Without getting into details, it was very interesting to see this root cause was a SQL statement and duplicate key issue not thrown and managed correctly by the application. I was just surprised the application didn’t catch any errors in a such case. We finally understood that prepared statements and stored procedure calls were done through the DButils class including the closeQuietly() method to close connections. Referring to the Apache documentation, closeQuietly() was designed to hide SQL Exceptions when happen which definitely not help identifying easily the issue from an application side. Never mind, thanks to the collaboration with the DEV team we managed to get rid of this issue

David Barbarin