David Barbarin » SQL Server 2017

Creating dynamic Grafana dashboard for SQL Server

mikedavem — Sun, 11 Apr 2021 19:52:09 +0000

A couple of months ago I wrote about “Why we moved SQL Server monitoring to Prometheus and Grafana”. I talked about the creation of two dashboards. The first one is blackbox monitoring-oriented and aims to spot in (near) real-time resource pressure / saturation issues with self-explained gauges, numbers and colors indicating healthy (green) or unhealthy resources (orange / red). We also include availability group synchronization health metric in the dashboard. We will focus on it in this write-up.

As a reminder, this Grafana dashboard gets information from Prometheus server and metrics related to MSSQL environments. For a sake of clarity, in this dashboard, environment defines one availability group and a set of 2 AG replicas (A or B) in synchronous replication mode. In other words, ENV1 value corresponds to availability group name and to SQL instance names member of the AG group with SERVERA\ENV1 (first replica), SERVERB\ENV1 (second replica).

In the picture above, you can notice 2 sections. One is for availability group and health monitoring and the second includes a set of black box metrics related to saturation and latencies (CPU, RAM, Network, AG replication delay, SQL Buffer Pool, blocked processes …). Good job for one single environment but what if I want to introduce more availability groups and SQL instances in the game?

The first and easiest (or naïve) way we went through when we started writing this dashboard was to copy / paste all the stuff done for one environment the panels as shown below:

After creating a new row (can be associated to section in the present context) at the bottom, all panels were copied from ENV1 to the new fresh section ENV2. New row is created by converting anew panel into row as show below:

Then I need to modify manually ALL the new metrics with the new environment. Let’s illustrate the point with Batch Requests/sec metric as example. The corresponding Prometheus query for the first replica (A) is: (the initial query has been simplified for the purpose of this blog post):

irate(sqlserver_performance{sql_instance='SERVERA:ENV1',counter="Batch Requests/sec"}[$__range])

Same query exists for secondary replica (B) but with a different label value:

irate(sqlserver_performance{sql_instance='SERVERB:ENV1',counter="Batch Requests/sec"}[$__range])

SERVERA:ENV1 and SERVERB:ENV1 are static values that correspond to the name of each SQL Server instance – respectively SERVERA\ENV1 and SERVERB\ENV1. As you probably already guessed and according to our naming convention, for the new environment and related panels, we obviously changed initial values ENV1 with new one ENV2. But having more environments or providing filtering capabilities to focus only on specific environments make the current process tedious and we need introduce dynamic stuff in the game … Good news, Grafana provides such capabilities with dynamic creation of rows and panels. and rows.

Generating dynamic panels in the same section (row)

Referring to the dashboard, first section concerns availability group health metric. When adding a new environment – meaning a new availability group – we want a new dedicated panel creating automatically in the same section (AG health).
Firstly, we need to add a multi-value variable in the dashboard. Values can be static or dynamic from another query regarding your context. (up to you to choose the right solution according to your context).

Once created, a drop-down list appears at the upper left in the dashboard and now we can perform multi selections or we can filter to specific ones.

Then we need to make panel in the AG Heath section dynamic as follows:
– Change the title value with corresponding dashboard (optional)
– Configure repeat options values with the variable (mandatory). You can also define max panel per row

According to this setup, we can display 4 panels (or availability groups) max per row. The 5th will be created and placed to a new line in the same section as shown below:

Finally, we must replace static label values defined in the query by the variable counterpart. For the availability group we are using sqlserver_hadr_replica_states_replica_synchronization_health metric as follows (again, I voluntary put a sample of the entire query for simplicity purpose):

… sqlserver_hadr_replica_states_replica_synchronization_health{sql_instance=~'SERVER[A|B]:$ENV',measurement_db_type="SQLServer"}) …

You can notice the regex expression used to get information from SQL Instances either from primary (A) or secondary (B). The most interesting part concerns the environment that is now dynamic with $ENV variable.

Generating dynamic sections (rows)

As said previously, sections are in fact rows in the Grafana dashboard and rows can contain panels. If we add new environment, we want also to see a new section (and panels) related to it. Configuring dynamic rows is pretty similar to panels. We only need to change the “Repeat for section” with the environment variable as follows (Title remains optional):

As for AG Health panel, we also need to replace static label values in ALL panels with the new environment variable. Thus, referring to the previous Batch Requests / sec example, the updated Prometheus query will be as follows: (respectively for primary and secondary replicas):

irate(sqlserver_performance{sql_instance='SERVERA:$ENV',counter="Batch Requests/sec"}[$__range])

…

irate(sqlserver_performance{sql_instance='SERVERB:$ENV',counter="Batch Requests/sec"}[$__range])

The dashboard is now ready, and all dynamic kicks in when a new SQL Server instance is added to the list of monitored items. Here an example of outcome in our context:

Happy monitoring!

Extending SQL Server monitoring with Raspberry PI and Lametric

mikedavem — Thu, 07 Jan 2021 21:59:25 +0000

First blog of this new year 2021 and I will start with a fancy and How-To Geek topic

In my last blog post, I discussed about monitoring and how it should help to address quickly a situation that is going degrading. Alerts are probably the first way to raise your attention and, in my case, they are often in the form of emails in a dedicated folder. That remains a good thing, at least if you’re not focusing too long in other daily tasks or projects. In work office, I know I would probably better focus on new alerts but as I said previously, telework changed definitely the game.

I wanted to find a way to address this concern at least for main SQL Server critical alerts and I thought about relying on my existing home lab infrastructure to address the point. Reasons are it is always a good opportunity to learn something and to improve my skills by referring to a real case scenario.

My home lab infrastructure includes a cluster of Raspberry PI 4 nodes. Initially, I use it to improve my skills on K8s or to study some IOT stuff for instance. It is a good candidate for developing and deploying a new app for detecting new incoming alerts in my mailbox and sending notifications to my Lametric accordingly.

Lametric is a basically a connected clock but works also as a highly-visible display showing notifications from devices or apps via REST APIs. First time I saw such device in action was in a DevOps meetup in 2018 around Docker and Jenkins deployment with Eric Dusquenoy and Tim Izzo (@5ika_). In addition, one of my previous customers had also one in his office and we had some discussions about cool customization through Lametric apps.

Connection through VPN to my company network is mandatory to work from home and unfortunately Lametric device doesn’t support this scenario because communication is limited to local network only. So, I need an app that run on my local (home) network and able to connect to my mailbox, get new incoming emails and finally sending notifications to my Lametric device.

Here my setup:

There are plenty of good blog posts to create a Raspberry cluster on the internet and I would suggest to read that of Andrew Pruski (@dbafromthecold).

As shown above, there are different paths for SQL alerts referring our infrastructure (On-prem and Azure SQL databases) but all of them are send to a dedicated distribution list for DBA.

The app is a simple PowerShell script that relies on Exchange Webservices APIs for connecting to the mailbox and to get new mails. Sending notifications to my Lametric device is achieved by a simple REST API call with well-formatted body. Details can be found the Lametric documentation. As prerequisite, you need to create a notification app from Lametric Developer site as follows:

As said previously, I used PowerShell for this app. It can help to find documentation and tutorials when it comes Microsoft product. But if you are more confident with Python, APIs are also available in a dedicated package. But let’s precise that using PowerShell doesn’t necessarily mean using Windows-based container and instead I relied on Linux-based image with PowerShell core for ARM architecture. Image is provided by Microsoft on Docker Hub. Finally, sensitive information like Lametric Token or mailbox credentials are stored in K8s secret for security reasons. My app project is available on my GitHub. Feel free to use it.

Here some results:

– After deploying my pod:

– The app is running and checking new incoming emails (kubectl logs command)

When email is detected, notification is sendig to Lametric device accordingly

Geek fun good (bad?) idea to start this new year 2021

Why we moved SQL Server monitoring on Prometheus and Grafana

mikedavem — Tue, 22 Dec 2020 16:55:12 +0000

During this year, I spent a part of my job on understanding the processes and concepts around monitoring in my company. The DevOps mindset mainly drove the idea to move our SQL Server monitoring to the existing Prometheus and Grafana infrastructure. Obviously, there were some technical decisions behind the scene, but the most important part of this write-up is dedicated to explaining other and likely most important reasons of this decision.

But let’s precise first, this write-up doesn’t constitute any guidance or any kind of best practices for DBAs but only some sharing of my own experience on the topic. As usual, any comment will be appreciated.

That’s said, let’s continue with the context. At the beginning of this year, I started my new DBA position in a customer-centric company where DevOps culture, microservices and CI/CD are omnipresent. What does it mean exactly? To cut the story short, development and operation teams are used a common approach for agile software development and delivery. Tools and processes are used to automate build, test, deploy and to monitor applications with speed, quality and control. In other words, we are talking about Continuous Delivery and in my company, release cycle is faster than traditional shops I encountered so far with several releases per day including database changes. Another interesting point is that we are following the « Operate what you build » principle each team that develops a service is also responsible for operating and supporting it. It presents some advantages for both developers and operations but pushing out changes requires to get feedback and to observe impact on the system on both sides.

In addition, in operation teams we try to act as a centralized team and each member should understand the global scope and topics related to the infrastructure and its ecosystem. This is especially true when you’re dealing with nightly on-calls. Each has its own segment responsibility (regarding their specialized areas) but following DevOps principles, we encourage shared ownership to break down internal silos for optimizing feedback and learning. It implies anyone should be able to temporarily overtake any operational task to some extent assuming the process is well-documented, and learnin has been done correctly. But world is not perfect and this model has its downsides. For example, it will prioritize effectiveness in broader domains leading to increase cognitive load of each team member and to lower visibility in for vertical topics when deeper expertise is sometimes required. Having an end-to-end observable system including infrastructure layer and databases may help to reduce time for investigating and fixing issues before end users experience them.

The initial scenario

Let me give some background info and illustration of the initial scenario:

… and my feeling of what could be improved:

1) From a DBA perspective, at a first glance there are many potential issues. Indeed, a lot of automated or semi-manual deployment processes are out of the control and may have a direct impact on the database environment stability. Without better visibility, there is likely no easy way to address the famous question: He, we are experiencing performance degradations for two days, has something happened on database side?

2) Silos are encouraged between DBA and DEVs in this scenario. Direct consequence is to limit drastically the adding value of the DBA role in a DevOps context. Obviously, primary concerns include production tasks like ensuring integrity, backups and maintenance of databases. But in a DevOps oriented company where we have automated « database-as-code » pipelines, they remain lots of unnecessary complexity and disruptive scripts that DBA should take care. If this role is placed only at the end of the delivery pipeline, collaboration and continuous learning with developer teams will restricted at minimum.

3) There is a dedicated monitoring tool for SQL Server infrastructure and this is a good point. It provides necessary baselining and performance insights for DBAs. But in other hand, the tool in place targets only DBA profiles and its usage is limited to the infrastructure team. This doesn’t contribute to help improving the scalability in the operations team and beyond. Another issue with the existing tooling is correlation can be difficult with external events that come from either the continuous delivery pipeline or configuration changes performed by operations teams on the SQL Server instances. In this case, establishment of observability (the why) may be limited and this is what teams need to respond quickly and resolve emergencies in modern and distributed software.

What is observability?

You probably noticed the word « observability » in my previous sentence, so I think it deserves some explanations before to continue. Observability might seem like a buzzword but in fact it is not a new concept but became prominent in DevOps software development lifecycle (SDLC) methodologies and distributed infrastructure systems. Referring to the Wikipedia definition, Observability is the ability to infer internal states of a system based on the system’s external outputs. To be honest, it has not helped me very much and further readings were necessary to shed the light on what observability exactly is and what difference exist with monitoring.

Let’s start instead with monitoring which is the ability to translate infrastructure log metrics data into meaningful and actionable insights. It helps knowing when something goes wrong and starting your response quickly. This is the basis for monitoring tool and the existing one is doing a good job on it. In DBA world, monitoring is often related to performance but reporting performance is only as useful as that reporting accurately represents the internal state of the global system and not only your database environment. For example, in the past I went to some customer shops where I was in charge to audit their SQL Server infrastructure. Generally, customers were able to present their context, but they didn’t get the possibility to provide real facts or performance metrics of their application. In this case, you usually rely on a top-down approach and if you’re either lucky or experimented enough, you manage to find what is going wrong. But sometimes I got relevant SQL Server metrics that would have highlighted a database performance issue, but we didn’t make a clear correlation with those identified on application side. In this case, relying only on database performance metrics was not enough for inferring the internal state of the application. From my experience, many shops are concerned with such applications that have been designed for success and not for failure. They often lake of debuggability monitoring and telemetry is often missing. Collecting data is as the base of observability.

Observability provides not only the when of an error or issue, but more importantly the why. With modern software architectures including micro-services and the emphasis of DevOps, monitoring goals are no longer limited to collecting and processing log data, metrics, and event traces. Instead, it should be employed to improve observability by getting a better understanding of the properties of an application and its performance across distributed systems and delivery pipeline. Referring to the new context I’m working now, metric capture and analysis is started with deployment of each micro-service and it provides better observability by measuring all the work done across all dependencies.

White-Box vs. Black-Box Monitoring

In my company as many other companies, different approaches are used when it comes monitoring: White-box and Black-Box monitoring.
White-box monitoring focuses on exposing internals of a system. For example, this approach is used by many SQL Server performance tools on the market that make effort to set a map of the system with a bunch of internal statistic data about index or internal cache usage, existing wait stats, locks and so on …

In contrast, black-Box monitoring is symptom oriented and tests externally visible behavior as a user would see it. Goal is only monitoring the system from the outside and seeing ongoing problems in the system. There are many ways to achieve black-box monitoring and the first obvious one is using probes which will collect CPU or memory usage, network communications, HTTP health check or latency and so on … Another option is to use a set of integration tests that run all the time to test the system from a behavior / business perspective.

White-Box vs. Black-Box Monitoring: Which is finally more important? All are and can work together. In my company, both are used at different layers of the micro-service architecture including software and infrastructure components.

RED vs USE monitoring

When you’re working in a web-oriented and customer-centric company, you are quickly introduced to The Four Golden Signals monitoring concept which defines a series of metrics originally from Google Site Reliability Engineering including latency, traffic, errors and saturation. The RED method is a subset of “Four Golden Signals” and focus on micro-service architectures and include following metrics:

Rate: number of requests our service is serving per second
Error: number of failed requests per second
Duration: amount of time it takes to process a request

Those metrics are relatively straightforward to understand and may reduce time to figure out which service was throwing the errors and then eventually look at the logs or to restart the service, whatever.

For HTTP Metrics the RED Method is a good fit while the USE Method is more suitable for infrastructure side where main concern is to keep physical resources under control. The latter is based on 3 metrics:

Utilization: Mainly represented in percentage and indicates if a resource is in underload or overload state.
Saturation: Work in a queue and waiting to be processed
Errors: Count of event errors

Those metrics are commonly used by DBAs to monitor performance. It is worth noting that utilization metric can be sometimes misinterpreted especially when maximum value depends of the context and can go over 100%.

SQL Server infrastructure monitoring expectations

Referring to the starting scenario and all concepts surfaced above, it was clear for us to evolve our existing SQL Server monitoring architecture to improve our ability to reach the following goals:

Keeping analyzing long-term trends to respond usual questions like how my daily-workload is evolving? How big is my database? …
Alerting to respond for a broken issue we need to fix or for an issue that is going on and we must check soon.
Building comprehensive dashboards – dashboards should answer basic questions about our SQL Server instances, and should include some form of the advanced SQL telemetry and logging for deeper analysis.
Conducting an ad-hoc retrospective analysis with easier correlation: from example an http response latency that increased in one service. What happened around? Is-it related to database issue? Or blocking issue raised on the SQL Server instance? Is it related to a new query or schema change deployed from the automated delivery pipeline? In other words, good observability should be part of the new solution.
Automated discovery and telemetry collection for every SQL Server instance installed on our environment, either on VM or in container.
To rely entirely on the common platform monitoring based on Prometheus and Grafana. Having the same tooling make often communication easier between people (human factor is also an important aspect of DevOps)

Prometheus, Grafana and Telegraf

Prometheus and Grafana are the central monitoring solution for our micro-service architecture. Some others exist but we’ll focus on these tools in the context of this write-up.
Prometheus is an open-source ecosystem for monitoring and alerting. It uses a multi-dimensional data model based on time series data identified by metric name and key/value pairs. WQL is the query language used by Prometheus to aggregate data in real time and data are directly shown or consumed via HTTP API to allow external system like Grafana. Unlike previous tooling, we appreciated collecting SQL Server metrics as well as those of the underlying infrastructure like VMWare and others. It allows to comprehensive picture of a full path between the database services and infrastructure components they rely on.

Grafana is an open source software used to display time series analytics. It allows us to query, visualize and generate alerts from our metrics. It is also possible to integrate a variety of data sources in addition of Prometheus increasing the correlation and aggregation capabilities of metrics from different sources. Finally, Grafana comes with a native annotation store and the ability to add annotation events directly from the graph panel or via the HTTP API. This feature is especially useful to store annotations and tags related to external events and we decided to use it for tracking software releases or SQL Server configuration changes. Having such event directly on dashboard may reduce troubleshooting effort by responding faster to the why of an issue.

For collecting data we use Telegraf plugin for SQL Server. The plugin exposes all configured metrics to be polled by a Prometheus server. The plugin can be used for both on-prem and Azure instances including Azure SQL DB and Azure SQL MI. Automated deployment and configuration requires low effort as well.

The high-level overview of the new implemented monitoring solution is as follows:

SQL Server telemetry is achieved through Telegraf + Prometheus and includes both Black-box and White-box oriented metrics. External events like automated deployment, server-level and database-level configuration changes are monitored through a centralized scheduled framework based on PowerShell. Then annotations + tags are written accordingly to Grafana and event details are recorded to logging tables for further troubleshooting.

Did the new monitoring met our expectations?

Well, having experienced the new monitoring solution during this year, I would say we are on a good track. We worked mainly on 2 dashboards. The first one exposes basic black-box metrics to show quickly if something is going wrong while the second one is DBA oriented with a plenty of internal counters to dig further and to perform retrospective analysis.

Here a sample of representative issues we faced this year and we managed to fix with the new monitoring solution:

1) Resource pressure and black-box monitoring in action:

For this scenario, the first dashboard highlighted resource pressure issues, but it is worth noting that even if the infrastructure was burning, users didn’t experience any side effects or slowness on application side. After corresponding alerts raised on our side, we applied proactive and temporary fixes before users experience them. I would say, this scenario is something we would able to manage with previous monitoring and the good news is we didn’t notice any regression on this topic.

2) Better observability for better resolution of complex issue

This scenario was more interesting because the first symptom started from the application side without alerting the infrastructure layer. We started suffering from HTTP request slowness on November around 12:00am and developers got alerted with sporadic timeout issues from the logging system. After they traversed the service graph, they spotted on something went wrong on the database service by correlating http slowness with blocked processes on SQL Server dashboard as shown below. I put a simplified view on the dashboards, but we need to cross several routes between the front-end services and databases.

Then I got a call from them and we started investigating blocking processes from the logging tables in place on SQL Server side. At a first glance, different queries with a longer execution time than usual and neither release deployments nor configuration updates may explain such sudden behavior change. The issue kept around and at 15:42 it started appearing more frequently to deserve a deeper look at the SQL Server internal metrics. We quickly found out some interesting correlation with other metrics and we finally managed to figure out why things went wrong as show below:

Root cause was related to transaction replication slowness within Always On availability group databases and we directly jumped on storage issue according to error log details on secondary:

End-to-End observability by including the database services to the new monitoring system drastically reduces the time for finding the root cause. But we also learnt from this experience and to continuously improve the observability we added a black-box oriented metric related to availability group replication latency (see below) to detect faster any potential issue.

And what’s next?

Having such monitoring is not the endpoint of this story. As said at the beginning of this write-up, continuous delivery comes with its own DBA challenges illustrated by the starting scenario. Traditionally the DBA role is siloed, turning requests or tickets into work and they can be lacking context about the broader business or technology used in the company. I experienced myself several situations where you get alerted during the night when developer’s query exceeds some usage threshold. Having discussed the point with many DBAs, they tend to be conservative about database changes (normal reaction?) especially when you are at the end of the delivery process without clear view of what will could deployed exactly.

Here the new situation:

Implementing new monitoring stuff changed the way to observe the system (at least from a DBA perspective). Again, I believe the adding value of DBA role in a company with a strong DevOps mindset is being part of both production DBAs and Development DBAs. Making observability consistent across all the delivery pipeline including databases is likely part of the success and may help DBA getting a broader picture of system components. Referring to my context, I’m now able to get more interaction with developer teams on early phases and to provide them contextual feedbacks (and not generic feedbacks) for improvements regarding SQL production telemetry. They also have access to them and can check by themselves impact of their development. In the same way, feedbacks and work with my team around database infrastructure topic may appear more relevant.

It is finally a matter of collaboration

Interesting use case of using dummy columnstore indexes and temp tables

mikedavem — Fri, 20 Nov 2020 17:06:06 +0000

Columnstore indexes are a very nice feature and well-suited for analytics queries. Using them for our datawarehouse helped to accelerate some big ETL processing and to reduce resource footprint such as CPU, IO and memory as well. In addition, SQL Server 2016 takes columnstore index to a new level and allows a fully updateable non-clustered columnstore index on a rowstore table making possible operational and analytics workloads. Non-clustered columnstore index are a different beast to manage with OLTP workload and we got both good and bad experiences on it. In this blog post, let’s talk about good effects and an interesting case where we use them for reducing CPU consumption of a big reporting query.

In fact, the concerned query follows a common T-SQL anti-pattern for performance: A complex layer of nested views and CTEs which are an interesting mix to getting chances to prevent a cleaned-up execution plan. The SQL optimizer gets tricked easily in this case. So, for illustration let’s start with the following query pattern:

;WITH CTE1 AS (
SELECT col ..., SUM(col2), ...
FROM [VIEW]
GROUP BY col ...
),
CTE2 AS (
SELECT col ..., ROW_NUMBER()
FROM (
SELECT col ...
JOIN CTE1 ON ...
JOIN [VIEW2] ON ...
JOIN [TABLE] ON ...
) AS VT
),
CTE3 A (
SELECT col ...
FROM [VIEW]
JOIN [VIEW4] ON ...
)
...
SELECT col ...
FROM (
SELECT
col,
STUFF((SELECT '', '' + col
FROM CTE2
WHERE CTE2.ID = CTE1.ID
FOR XML PATH('''')), 1, 1, '''') AS colconcat,
...
FROM (
SELECT col ...
FROM CTE1
LEFT JOIN CTE2 ON ...
LEFT JOIN (
SELECT col
FROM CTE3
GROUP BY col
) AS T1 ON ...
) AS T2
GROUP BY col ...
)

Sometimes splitting the big query into small pieces and storing pre-aggregation within temporary tables may help. This is what it has been done and led to some good effects with a global reduction of the query execution time.

CREATE TABLE #T1 ...
CREATE TABLE #T2 ...
CREATE TABLE #T3 ...

;WITH CTE1 AS (
SELECT col ..., SUM(col2), ...
FROM [VIEW]
GROUP BY col ...
)
INSERT INTO #T1 ...
SELECT col FROM CTE1 ...
;

WITH CTE2 AS (
SELECT col ..., ROW_NUMBER()
FROM (
SELECT col ...
JOIN #T1 ON ...
JOIN [VIEW2] ON ...
JOIN [TABLE] ON ...
) AS VT
)
INSERT INTO #T2 ...
SELECT col FROM CTE2 ...
;

WITH CTE3 A (
SELECT col ...
FROM [VIEW]
JOIN [VIEW4] ON ...
)
INSERT INTO #T3 ...
SELECT col FROM CTE3 ...
;

SELECT col ...
FROM (
SELECT
col,
STUFF((SELECT '', '' + col
FROM CTE2
WHERE CTE2.ID = CTE1.ID
FOR XML PATH('''')), 1, 1, '''') AS colconcat,
...
FROM (
SELECT col ...
FROM #T1
LEFT JOIN #T2 ON ...
LEFT JOIN (
SELECT col
FROM #T3
GROUP BY col
) AS T1 ON ...
) AS T2
GROUP BY col ...
)

However, it was not enough, and the query continued to consume a lot of CPU time as shown below:

CPU time was around 20s per execution. CPU time is greater than duration time due to parallelization. Regarding the environment you are, you would say having such CPU time can be common for reporting queries and you’re probably right. But let’s say in my context where all reporting queries are offloaded in a secondary availability group replica (SQL Server 2017), we wanted to keep the read only CPU footprint as low as possible to guarantee a safety margin of CPU resources to address scenarios where all the traffic (including both R/W and R/O queries) is redirected to the primary replica (maintenance, failure and so on). The concerned report is executed on-demand by users and mostly contribute to create high CPU spikes among other reporting queries as shown below:

Testing this query on DEV environment gave following statistic execution outcomes:

SQL Server Execution Times:
CPU time = 12988 ms, elapsed time = 6084 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

… with the related execution plan (not estimated). In fact, I put only the final SELECT step because it was the main culprit of high CPU consumption for this query – (plan was anonymized by SQL Sentry Plan Explorer):

Real content of the query doesn’t matter for this write-up, but you probably have noticed I put explicitly concatenation stuff with XML PATH construct previously and I identified the execution path in the query plan above. This point will be important in the last section of this write-up.

First, because CPU is my main concern, I only selected CPU cost and you may notice top consumers are repartition streams and hash match operators followed by Lazy spool used with XML PATH and correlated subquery.

Then rewriting the query could be a good option but we first tried to find out some quick wins to avoid engaging too much time for refactoring stuff. Focusing on the different branches of this query plan and operators engaged from the right to the left, we make assumption that experimenting batch mode could help reducing the overall CPU time on the highlighted branch. But because we are not dealing with billion of rows within temporary tables, we didn’t want to get extra overhead of maintaining compressed columnstore index structure. I remembered reading an very interesting article in 2016 about the creation of dummy non-clustered columnstore indexes (NCCI) with filter capabilities to enable batch mode and it seemed perfectly fit with our scenario. In parallel, we went through inline index creation capabilities to neither trigger recompilation of the batch statement nor to prevent temp table caching. The target is to save CPU time

So, the temp table and inline non-clustered columnstore index DDL was as follows:

CREATE TABLE #T1 ( col ..., INDEX CCI_IDX_T1 NONCLUSTERED COLUMNSTORE (col) ) WHERE col < 1
CREATE TABLE #T2 ( col ..., INDEX CCI_IDX_T2 NONCLUSTERED COLUMNSTORE (col) ) WHERE col < 1
CREATE TABLE #T3 ( col ..., INDEX CCI_IDX_T3 NONCLUSTERED COLUMNSTORE (col) ) WHERE col < 1
…

Note the WHERE clause here with an out-of-range value to create an empty NCCI.

After applying the changes here, the new statistic execution metrics we got:

SQL Server Execution Times:
CPU time = 2842 ms, elapsed time = 6536 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

… and the related execution plan:

A drop of CPU time consumption (2.8s vs 12s) per execution when the batch mode kicked-in. A good news for sure but something continued to draw my attention because even if batch mode came into play here, it was not propagated to the left and seem to stop at the level of XML PATH execution. After reading my preferred reference on this topic (thank you Niko), I was able to confirm my suspicion of unsupported XML operation with batch mode. Unfortunately, I was out of luck to confirm with column_store_expression_filter_apply extended event that seem to not work for me.

Well, to allow the propagation of batch mode to the left side of the execution plan, it was necessary to write and move the correlated subquery with XML path to a simple JOIN and STRING_AGG() function – available since SQL Server 2016:

-- Concat with XML PATH
SELECT
col,
STUFF((SELECT '', '' + col
FROM CTE2
WHERE CTE2.col = CTE1.col
FOR XML PATH('''')), 1, 1, '''') AS colconcat,
...
FROM [TABLE]

-- Concat with STRING_AGG
SELECT
col,
V.colconcat,
...
FROM [TABLE] AS T
JOIN (
SELECT
col,
STRING_AGG(col2, ', ') AS colconcat
FROM #T2
GROUP BY col
) AS V ON V.col = T.col

The new change gave this following outcome:

SQL Server Execution Times:
CPU time = 2109 ms, elapsed time = 1872 ms.

and new execution plan:

First, batch mode is now propagated from the right to the left of the query execution plan because we eliminated all inhibitors including XML construct. We got not real CPU reduction this time, but we managed to reduce global execution time. The hash match aggregate operator is the main CPU consumer and it is the main candidate to benefit from batch mode. All remaining operators on the left side process few rows and my guess is that batch mode may appear less efficient than with the main consumer in this case. But anyway, note we also got rid of the Lazy Spool operator with the refactoring of the XML path and correlated subquery by STRING_AGG() and JOIN construct.

The new result is better by far compared to the initial scenario (New CPU time: 3s vs Old CPU Time: 20s). It also had good effect of the overall workload on the AG read only replica:

Not so bad for a quick win!
See you

Building a more robust and efficient statistic maintenance with large tables

mikedavem — Mon, 26 Oct 2020 21:05:34 +0000

In a past, I went to different ways for improving update statistic maintenance in different shops according to their context, requirement and constraints as well as the SQL Server version used at this moment. All are important inputs for creating a good maintenance strategy which can be very simple with execution of sp_updatestats or specialized scripts to focus on some tables.

One of my latest experiences on this topic was probably one of the best although we go to circuitous way for dealing with long update statistic maintenance task on a large database. We used a mix of statistic analysis stuff and improvements provided by SQL Server 2014 SP1 CU6 and parallel update statistic capabilities. I wrote a blog post if you are interested in learning more on this experience.

I’m working now for a new company meaning a different context … At the moment of this write-up, we are running on SQL Server 2017 CU21 and database sizes are in different order of magnitude (more than 100GB compressed) compared to my previous experience. However, switching from default sampling method to FULLSCAN for some large tables drastically increased the update statistic task beyond to the allowed Windows time frame (00:00AM to 03:00AM) without any optimization.

Why to change the update statistic sampling method?

Let’s start from the beginning: why we need to change default statistic sample? In fact, this topic has been already covered in detail in the internet and to make the story short, good statistics are part of the recipe for efficient execution plans and queries. Default sampling size used by both auto update mechanism or UPDATE STATISTIC command without any specification come from a non-linear algorithm and may not produce good histogram with large tables. Indeed, the sampling size decreases as the table get bigger leading to a rough picture of values in the table which may affect cardinality estimation in execution plan … Exactly the side effects we experienced on with a couple of our queries and we wanted to minimize in the future. Therefore, we decided to improve cardinality estimation by switching to FULLSCAN method only for some big tables to produce better histogram. But this method comes also at the cost of a direct impact on consumed resources and execution time because the optimizer needs to read more data to build a better picture of data distribution and sometimes with an higher tempdb usage. Our first attempt on ACC environment increased the update statistic maintenance task from initially 5min with default sampling size to 3.5 hours with the FULLSCAN method and only for large tables … Obviously an unsatisfactory solution because we were out of the allowed Windows maintenance timeframe.

Context matters

But first let’s set the context a little bit more: The term “large” can be relative according to the environment. In my context, it means tables with more than 100M of rows and less than 100GB in size for the biggest one and 10M of rows and 10GB in size for lower ones. In fact, for partitioned tables total size includes the archive partition’s compression.

Another gusty detail: concerned databases are part of availability groups and maxdop for primary replica was setup to 1. There is a long story behind this value with some side effects encountered in the past when switching to maxdop > 1 and cost threshold for parallelism = 50. At certain times of the year, the workload increased a lot and we faced memory allocation issues for some parallel queries (parallel queries usually require more memory). This is something we need to investigate further but we switched back to maxdop=1 for now and I would say so far so good …

Because we don’t really have index structures heavily fragmented between two rebuild index operations, we’re not in favor of frequent rebuilding index operations. Even if such operation can be either done online or is resumable with SQL Server 2017 EE, it remains a very resource intensive operation including log block replication on the underlying Always On infrastructure. In addition, there is a strong commitment of minimizing resource overhead during the Windows maintenance because of concurrent business workload in the same timeframe.

Options available to speed-up update statistic task

Using MAXDOP / PERSIST_SAMPLE_PERCENT with UPDATE STATISTICS command

KB4041809 describes new support added for MAXDOP option for the CREATE STATISTICS and UPDATE STATISTICS statements in Microsoft SQL Server 2014, 2016 and 2017. This is especially helpful to override MAXDOP settings defined at the server or database-scope level. As a reminder, maxdop value is forced to 1 in our context on availability group primary replicas.

For partitioned tables we don’t go through this setting because update statistic is done at partition level (see next section). The concerned tables own 2 partitions, respectively CURRENT and ARCHIVE. We keep the former small in size and with a relative low number of rows (only last 2 weeks of data). Therefore, there is no real benefit of using MAXDOP to force update statistics to run with parallelism in this case.

But non-partitioned large tables (>=10 GB) are good candidate. According to the following picture, we noticed an execution time reduction of 57% by increasing maxdop value to 4 for some large tables with these specifications:
– ~= 10GB
– ~ 11M rows
– 112 columns
– 71 statistics

Another feature we went through is described in KB4039284 and available since with SQL Server 2016+. In our context, the maintenance of statistics relies on a custom stored procedure (not Ola maintenance scripts yet) and we have configured default sampling rate method for all statistics and we wanted to make exception only for targeted large tables. In the past, we had to use NO_RECOMPUTE option to exclude statistics for automatic updates. The new PERSIST_SAMPLE_PERCENT option indicates SQL Server to lock the sampling rate for future update operations and we are using it for non-partitioned large tables.

Incremental statistics

SQL Server 2017 provides interesting options to reduce maintenance overhead. Surprisingly some large tables were already partitioned but no incremental statistics were configured. Incremental statistics are especially useful for tables where only few partitions are changed at a time and are a great feature to improve efficiency of statistic maintenance because operations are done at the partition level since SQL Server 2014. Another blog post written a couple of years ago and here was a great opportunity to apply theorical concepts to a practical use case. Because we already implemented partition-level maintenance for indexes, it made sense to apply the same method for statistics to minimize overhead with FULLSCAN method and to benefit from statistic update threshold at the partition level. As said in the previous section, partitioned tables own 2 partitions CURRENT (last 2 weeks) and ARCHIVE and the goal was to only update statistics on the CURRENT partition on daily basis. However, let’s precise that although statistic objects are managed are the partition level, the SQL Server optimizer is not able to use them directly (no change since SQL Server 2014 to SQL Server 2019 as far as I know) and refers instead to the global statistic object.

Let’s demonstrate with the following example:

Let’s consider BIG TABLE with 2 partitions for CURRENT (last 2 weeks) and ARCHIVE values as shown below:

SELECT
s.object_id,
s.name AS stat_name,
sp.rows,
sp.rows_sampled,
sp.node_id,
sp.left_boundary,
sp.right_boundary,
sp.partition_number
FROM sys.stats AS s
CROSS APPLY sys.dm_db_stats_properties_internal(s.object_id, s.stats_id) AS sp
WHERE s.object_id = OBJECT_ID('[dbo].[BIG TABLE]')
AND s.name = 'XXXX_OID'

Statistic object is incremental, and we got an internal picture of per-partition statistics and the global one. You need to enable trace flag 2309 and to add node id reference to the DBCC SHOW_STATISTICS command as well. Let’s dig into the ARCHIVE partition to find a specific value within the histogram step:

DBCC TRACEON ( 2309 );
GO
DBCC SHOW_STATISTICS('[dbo].[BIG TABLE]', 'XXX_OID', 7) WITH HISTOGRAM;

Then, I used the value 9246258 in the WHERE clause of the following query:

SELECT *
FROM dbo.[BIG TABLE]
WHERE XXXX_OID = 9246258

It gives an estimated cardinality of 37.689 rows as show below …

… Cardinality estimation is 37.689 while we should expect a value of 12 rows here referring to the statistic histogram above. Let’s now have a look at the global statistic (nodeid = 1):

DBCC SHOW_STATISTICS('[dbo].[BIG TABLE]', 'XXX_OID', 1) WITH HISTOGRAM;

In fact, the query optimizer estimates rows by using AVG_RANGE_ROWS value between 9189129 and 9473685 in the global statistic. Well, it is likely not as perfect as we may expect. Incremental statistics do helps in reducing time taken to gather stats for sure, but it may not be enough to represent the entire data distribution in the table – We are still limited to 200 steps in the global statistic object. Pragmatically, I think we may mitigate this point by saying things could be worst somehow if we need either to use default sample algorithm or to decrease the sample size of your update statistic operation.

Let’s illustrate with the BIG TABLE. To keep things simple, I have voluntary chosen a (real) statistic where data is evenly distributed. Here some pictures of real data distribution:

The first one is a simple view of MIN, MAX boundaries as well as AVG of occurrences (let’s say duplicate records for a better understanding) by distinct value:

Referring to the picture above, we may notice there is no high variation of number of occurrences per distinct value represented by the leading XXX_OID column in the related index. In the picture below, another representation of data distribution where each histogram bucket includes the number of distinct values per number of occurrences.

For example, we have roughly 2.3% of distinct values in the BIG TABLE with 29 duplicate records. The same applies for values 28, 31 and so on … In short, this histogram confirms a certain degree of homogeneity of data distribution and avg_occurences value is not so far from the truth.

Let’s using default sample value for UPDATE STATISTICS. A very low sample of rows are taken into account leading to very approximative statistics as show below:

SELECT
rows,
rows_sampled,
CAST(rows_sampled * 100. / rows AS DECIMAL(5,2)) AS [sample_%],
steps
FROM sys.dm_db_stats_properties(OBJECT_ID('[dbo].[BIG TABLE]), 1)

SELECT *
FROM sys.dm_db_stats_histogram(OBJECT_ID('[dbo].[BIG TABLE]), 1)

Focusing on average_range_rows colum values, we may notice estimation is not representative of real distribution in the BIG TABLE.

After running FULLSCAN method with UPDATE STATISTICS command, the story has changed, and estimation is now closer to the reality:

As a side note, one additional benefit of using FULLSCAN method is to get a representative statistic histogram in fewer steps. This is well-explained in the SQL Tiger team’s blog post and we noticed this specific behavior with some statistic histograms where frequency is low … mainly primary key and unique index related statistics.

How benefit was incremental statistic?

The picture below refers to one of our biggest partitioned large table with the following characteristics:
– ~ 410M rows
– ~ 63GB in size (including compressed partition size)
– 67 columns
– 30 statistics

As noticed in the picture above, overriding maxdop setting at the database-scoped level resulted to an interesting drop in execution time when FULLSCAN method is used (from 03h30 to 17s in the best case)
Similarly, combining efforts done for both non-partitioned and partitioned larges tables resulted to reduced execution time of update statistic task from ~ 03h30 to 15min – 30min in production that is a better fit with our requirements.

Going through more sophisticated process to update statistic may seem more complicated but strongly required in some specific scenarios. Fortunately, SQL Server provides different features to help optimizing this process. I’m looking forward to seeing features that will be shipped with next versions of SQL Server.

Curious case of locking scenario with SQL Server audits

mikedavem — Mon, 05 Oct 2020 19:25:47 +0000

In high mission-critical environments, ensuring high level of availability is a prerequisite and usually IT department addresses required SLAs (the famous 9’s) with high available architecture solutions. As stated by Wikipedia: availability measurement is subject to some degree of interpretation. Thus, IT department generally focus on uptime metric whereas for other departments availability is often related to application response time or tied to slowness / unresponsiveness complains. The latter is about application throughput and database locks may contribute to reduce it. This is something we are constantly monitoring in addition of the uptime in my company.

A couple of weeks ago, we began to experience suddenly some unexpected blocking issues that included some specific query patterns and SQL Server audit feature. This is all more important as this specific scenario began from one specific database and led to create a long hierarchy tree of blocked processes with blocked SQL Server audit operation first and then propagated to all databases on the SQL Server instance. A very bad scenario we definitely want to avoid … Here a sample of the blocking processes tree:

First, let’s set the context :

We are using SQL Server audit for different purposes since the SQL Server 2014 version and we actually running on SQL Server 2017 CU21 at the moment of this write-up. The obvious one is for security regulatory compliance with login events. We also rely on SQL Server audits to extend the observability of our monitoring system (based on Prometheus and Grafana). Configuration changes are audited with specific events and we link concerned events with annotations in our SQL Server Grafana dashboards. Thus, we are able to quickly correlate events with some behavior changes that may occur on the database side. The high-level of the audit infrastructure is as follows:

As shown in the picture above, a PowerShell script carries out stopping and restarting the audit target and then we use the archive audit file to import related data to a dedicated database.
Let’s precise we use this process without any issues since a couple of years and we were surprised to experience such behavior at this moment. Enough surprising for me to write a blog post … Digging further to the root cause, we pointed out to a specific pattern that seemed to be the root cause of our specific issue:

1. Open transaction
2. Foreach row in a file execute an UPSERT statement
3. Commit transaction

This is a RBAR pattern and it may become slow according the number of lines it has to deal with. In addition, the logic is encapsulated within a single transaction leading to accumulate locks during all the transaction duration. Thinking about it, we didn’t face the specific locking issue with other queries so far because they are executed within short transactions by design.

This point is important because enabling SQL Server audits implies also extra metadata locks. We decided to mimic this behavior on a TEST environment in order to figure out what happened exactly.

Here the scripts we used for that purpose:

TSQL script:

- Create audit
USE [master]
GO

CREATE SERVER AUDIT [Audit-Target-Login]
TO FILE
( FILEPATH = N'/var/opt/mssql/log/'
,MAXSIZE = 0 MB
,MAX_ROLLOVER_FILES = 2147483647
,RESERVE_DISK_SPACE = OFF
)
WITH
( QUEUE_DELAY = 1000
,ON_FAILURE = CONTINUE
)
WHERE (
[server_principal_name] like '%\%'
AND NOT [server_principal_name] like '%\svc%'
AND NOT [server_principal_name] like 'NT SERVICE\%'
AND NOT [server_principal_name] like 'NT AUTHORITY\%'
AND NOT [server_principal_name] like '%XDCP%'
);

ALTER SERVER AUDIT [Audit-Target-Login] WITH (STATE = ON);
GO

CREATE SERVER AUDIT SPECIFICATION [Server-Audit-Target-Login]
FOR SERVER AUDIT [Audit-Target-Login]
ADD (FAILED_DATABASE_AUTHENTICATION_GROUP),
ADD (SUCCESSFUL_DATABASE_AUTHENTICATION_GROUP),
ADD (FAILED_LOGIN_GROUP),
ADD (SUCCESSFUL_LOGIN_GROUP),
ADD (LOGOUT_GROUP)
WITH (STATE = ON)
GO

USE [DBA]
GO

-- Tables to simulate the scenario
CREATE TABLE dbo.T (
id INT,
col1 VARCHAR(50)
);

CREATE TABLE dbo.T2 (
id INT,
col1 VARCHAR(50)
);

INSERT INTO dbo.T VALUES (1, REPLICATE('T',20));
INSERT INTO dbo.T2 VALUES (1, REPLICATE('T',20));

PowerShell scripts:

Session 1: Simulating SQL pattern

# Scenario simulation
$server ='127.0.0.1'
$Database ='DBA'

$Connection =New-Object System.Data.SQLClient.SQLConnection
$Connection.ConnectionString = "Server=$server;Initial Catalog=$Database;Integrated Security=false;User ID=sa;Password=P@SSw0rd1;Application Name=TESTLOCK"
$Connection.Open()

$Command = New-Object System.Data.SQLClient.SQLCommand
$Command.Connection = $Connection
$Command.CommandTimeout = 500

$sql =
"
MERGE T AS T
USING T2 AS S ON T.id = S.id
WHEN MATCHED THEN UPDATE SET T.col1 = 'TT'
WHEN NOT MATCHED THEN INSERT (col1) VALUES ('TT');

WAITFOR DELAY '00:00:03'
"

#Begin Transaction
$command.Transaction = $connection.BeginTransaction()

# Simulate for each file => Execute merge statement
while(1 -eq 1){

$Command.CommandText =$sql
$Result =$Command.ExecuteNonQuery()

}

$command.Transaction.Commit()
$Connection.Close()

Session 2: Simulating stopping / starting SQL Server audit for archiving purpose

$creds = New-Object System.Management.Automation.PSCredential -ArgumentList ($user, $password)

$Query = "
USE master;
ALTER SERVER AUDIT [Audit-Target-Login]
WITH ( STATE = OFF );

ALTER SERVER AUDIT [Audit-Target-Login]
WITH ( STATE = ON );
"

Invoke-DbaQuery `
-SqlInstance $server `
-Database $Database `
-SqlCredential $creds `
-Query $Query

First, we wanted to get a comprehensive picture of locks acquired during the execution of this specific SQL pattern by with an extended event session and lock_acquired event as follows:

CREATE EVENT SESSION [locks]
ON SERVER
ADD EVENT sqlserver.lock_acquired
(
ACTION(sqlserver.client_app_name,
sqlserver.session_id,
sqlserver.transaction_id)
WHERE ([sqlserver].[client_app_name]=N'TESTLOCK'))
ADD TARGET package0.histogram
(
SET filtering_event_name=N'sqlserver.lock_acquired',
source=N'resource_type',source_type=(0)
)
WITH
(
MAX_MEMORY=4096 KB,
EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY=30 SECONDS,
MAX_EVENT_SIZE=0 KB,
MEMORY_PARTITION_MODE=NONE,
TRACK_CAUSALITY=OFF,
STARTUP_STATE=OFF
)
GO

Here the output we got after running the first PowerShell session:

We confirm METADATA locks in addition to usual locks acquired to the concerned structures. We correlated this output with sp_WhoIsActive (and @get_locks = 1) after running the second PowerShell session. Let’s precise that you may likely have to run the 2nd query several times to reproduce the initial issue.

Here a picture of locks respectively acquired by session 1 and in waiting state by session 2:

…

We may identify clearly metadata locks acquired on the SQL Server audit itself (METDATA.AUDIT_ACTIONS with Sch-S) and the second query with ALTER SERVER AUDIT … WITH (STATE = OFF) statement that is waiting on the same resource (Sch-M). Unfortunately, my google Fu didn’t provide any relevant information on this topic excepted the documentation related to sys.dm_tran_locks DMV. My guess is writing events to audits requires a stable the underlying infrastructure and SQL Server needs to protect concerned components (with Sch-S) against concurrent modifications (Sch-M). Anyway, it is easy to figure out that subsequent queries could be blocked (with incompatible Sch-S on the audit resource) while the previous ones are running.

The query pattern exposed previously (unlike short transactions) is a good catalyst for such blocking scenario due to the accumulation and duration of locks within one single transaction. It may be confirmed by the XE’s output:

We managed to get a reproductible scenario with TSQL and PowerShell scripts. In addition, I also ran queries from other databases to confirm it may compromise responsiveness of the entire workload on the same instance (respectively DBA3 and DBA4 databases in my test).

How we fixed this issue?

Even it is only one part of the solution, I’m a strong believer this pattern remains a performance killer and using a set-bases approach may help to reduce drastically number and duration of locks and implicitly chances to make this blocking scenario happen again. Let’s precise it is not only about MERGE statement because I managed to reproduce the same issue with INSERT and UPDATE statements as well.

Then, this scenario really made us think about a long-term solution because we cannot guarantee this pattern will not be used by other teams in the future. Looking further at the PowerShell script which carries out steps of archiving the audit file and inserting data to the audit database, we finally added a QueryTimeout parameter value to 10s to the concerned Invoke-DbaQuery command as follows:

...

$query = "
USE [master];

IF EXISTS (SELECT 1
FROM sys.dm_server_audit_status
WHERE [name] = '$InstanceAuditPrefix-$AuditName')
BEGIN
ALTER SERVER AUDIT [$InstanceAuditPrefix-$AuditName]
WITH (STATE = OFF);
END

ALTER SERVER AUDIT [$InstanceAuditPrefix-$AuditName]
WITH (STATE = ON);
"

Invoke-DbaQuery `
-SqlInstance $Instance `
-SqlCredential $SqlCredential `
-Database master `
-Query $query `
-EnableException `
-QueryTimeout 5

...

Therefore, because we want to prioritize the business workload over the SQL Server audit operation, if such situation occurs again, stopping the SQL Server audit will timeout after reaching 5s which was relevant in our context. The next iteration of the PowerShell is able to restart at the last stage executed previously.

Hope this blog post helps.

See you!

SQL Server index rebuid online and blocking scenario

mikedavem — Sun, 30 Aug 2020 21:18:28 +0000

A couple of months ago, I experienced a problem about index rebuild online operation on SQL Server. In short, the operation was supposed to be online and to never block concurrent queries. But in fact, it was not the case (or to be more precise, it was partially the case) and to make the scenario more complex, we experienced different behaviors regarding the context. Let’s start the story with the initial context: in my company, we usually go through continuous deployment including SQL modification scripts and because we usually rely on daily pipeline, we must ensure related SQL operations are not too disruptive to avoid impacting the user experience.

Sometimes, we must introduce new indexes to deployment scripts and according to how disruptive the script can be, a discussion between Devs and Ops is initiated, and it results either to manage manually by the Ops team or to deploy it automatically through the automatic deployment pipeline by Devs.

Non-disruptive operations can be achieved in many ways and ONLINE capabilities of SQL Server may be part of the solution and this is what I suggested with one of our scripts. Let’s illustrate this context with the following example. I created a table named dbo.t1 with a bunch of rows:

USE [test];

SET NOCOUNT ON;

DROP TABLE IF EXISTS dbo.t1;
GO

CREATE TABLE dbo.t1 (
id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
col1 VARCHAR(50) NULL
);
GO

INSERT INTO dbo.t1 (col1) VALUES (REPLICATE('T', 50));
GO …
EXEC sp_spaceused 'dbo.t1'
--name rows reserved data index_size unused
--t1 5226496 1058000 KB 696872 KB 342888 KB 18240 KB

Go ahead and let’ set the context with a pattern of scripts deployment we went through during this specific deployment. Let’s precise this script is over simplified, but I keep the script voluntary simple to focus only on the most important part. You will notice the script includes two steps with operations on the same table including updating / fixing values in col2 first and then rebuilding index on col1.

/* Code before */

-- Update some values in the col1 colum
UPDATE [dbo].[t1]
SET col1 = REPLICATE('B', 50)

-- Then create an index on col1 column
CREATE INDEX [col1]
ON [dbo].[t1] (col1) WITH (ONLINE = ON);
GO

At the initial stage, the creation of index was by default (OFFLINE). Having discussed this point with the DEV team, we decided to create the index ONLINE in this context. The choice between OFFLINE / ONLINE operation is often not trivial and should be evaluated carefully but to keep simple, let’s say it was the right way to go in our context. Generally speaking, online operations are slower, but the tradeoff was acceptable in order to minimize blocking issues during this deployment. At least, this is what I thought …

In my demo, without any concurrent workload against the dbo.t1 table, creating the index offline took 6s compared to the online method with 12s. So, an expected result here …

Let’s run this another query in another session:

SELECT id, col1
FROM dbo.t1
WHERE id BETWEEN 1 AND 2

In a normal situation, this query should be blocked in a short time corresponding to the duration of the update operation. But once the update is done, blocking situation should disappear even during the index rebuild operation that is performed ONLINE.

But now let’s add Flyway to the context. Flyway is an open source tool we are using for automatic deployment of SQL objects. The deployment script was executed from it in ACC environment and we noticed longer blocked concurrent accesses this time. This goes against what we would ideally like. Digging through this issue with the DEV team, we also noticed the following message when running the deployment script:

Warning: Online index operation on table ‘dbo.t1 will proceed but concurrent access to the table may be limited due to residual lock on the table from a previous operation in the same transaction.

This is something I didn’t noticed from SQL Server Management Studio when I tested the same deployment script. So, what happened here?

Referring to the Flyway documentation, it is mentioned that Flyway always wraps the execution of an entire migration within a single transaction by default and it was exactly the root cause of the issue.

Let’s try with some experimentations:

Test 1: Update + rebuilding index online in implicit transaction mode (one transaction per query).

-- Update some values in the col1 colum
UPDATE [dbo].[t1]
SET col1 = REPLICATE('B', 50)

-- Then create an index on col1 column
CREATE INDEX [col1]
ON [dbo].[t1] (col1) WITH (ONLINE = ON);
GO
-- In another session
SELECT id, col1
FROM dbo.t1
WHERE id BETWEEN 1 AND 2

Test 2: Update + rebuilding index online within one single explicit transaction

BEGIN TRAN;

-- Update some values in the col1 colum
UPDATE [dbo].[t1]
SET col1 = REPLICATE('B', 50)

-- Then create an index on col1 column
CREATE INDEX [col1]
ON [dbo].[t1] (col1) WITH (ONLINE = ON);
GO
COMMIT TRAN;
-- In another session
SELECT id, col1
FROM dbo.t1
WHERE id BETWEEN 1 AND 2

After running these two scripts, we can notice the blocking duration of SELECT query is longer in test2 as shown in the picture below:

In the test 1, the duration of the blocking session corresponds to that for updating operation (first step of the script). However, in the test 2, we must include the time for creating the index but let’s precise the index is not the blocking operation at all, but it increases the residual locking put by the previous update operation. In short, this is exactly what the warning message is telling us. I think you can imagine easily which impact such situation may implies if the index creation takes a long time. You may get exactly the opposite of what you really expected.

Obviously, this is not a recommended situation and creating an index should be run in very narrow and constrained transaction.But from my experience, things are never always obvious and regarding your context, you should keep an eye of how transactions are managed especially when it comes automatic deployment stuff that could be quickly out of the scope of the DBA / Ops team. Strong collaboration with DEV team is recommended to anticipate this kind of issue.

See you !!

Universal usage of NVARCHAR type and performance impact

mikedavem — Wed, 27 May 2020 17:06:24 +0000

A couple of weeks, I read an article from Brent Ozar about using NVARCHAR as a universal parameter. It was a good reminder and from my experience, I confirm this habit has never been a good idea. Although it depends on the context, chances are you will almost find an exception that proves the rule.

A couple of days ago, I felt into a situation that illustrated perfectly this issue, and, in this blog, I decided to share my experience and demonstrate how the impact may be in a real production scenario.
So, let’s start with the culprit. I voluntary masked some contextual information but the principal is here. The query is pretty simple:

DECLARE @P0 DATETIME
DECLARE @P1 INT
DECLARE @P2 NVARCHAR(4000)
DECLARE @P3 DATETIME
DECLARE @P4 NVARCHAR(4000)

UPDATE TABLE SET DATE = @P0
WHERE ID = @P1
AND IDENTIFIER = @P2
AND P_DATE >= @P3
AND W_O_ID = (
SELECT TOP 1 ID FROM TABLE2
WHERE Identifier = @P4
ORDER BY ID DESC)

And the corresponding execution plan:

The most interesting part concerns the TABLE2 table. As you may notice the @P4 input parameter type is NVARCHAR and it is evident we get a CONVERT_IMPLICIT in the concerned Predicate section above. The CONVERT_IMPLICIT function is required because of data type precedence. It results to a costly operator that will scan all the data from TABLE2. As you probably know, CONVERT_IMPLICT prevents sargable condition and normally this is something we could expect here referring to the distribution value in the statistic histogram and the underlying index on the Identifier column.

EXEC sp_helpindex 'TABLE2';

DBCC SHOW_STATISTICS ('TABLE2', 'IX___IDENTIFIER')
WITH HISTOGRAM;

Another important point to keep in mind is that scanning all the data from the TABLE 2 table may be at a certain cost (> 1GB) even if data resides in memory.

EXEC sp_spaceused 'TABLE2'

The execution plan warning confirms the potential overhead of retrieving few rows in the TABLE2 table:

To set a little bit more the context, the concerned application queries are mainly based on JDBC Prepared statements which imply using NVARCHAR(4000) with string parameters regardless the column type in the database (VARCHAR / NVARCHAR). This is at least what we noticed from during my investigations.

So, what? Well, in our DEV environment the impact was imperceptible, and we had interesting discussions with the DEV team on this topic and we basically need to improve the awareness and the visibility on this field. (Another discussion and probably another blog post) …

But chances are your PROD environment will tell you a different story when it comes a bigger workload and concurrent query executions. In my context, from an infrastructure standpoint, the symptom was an abnormal increase of the CPU consumption a couple of days ago. Usually, the CPU consumption was roughly 20% up to 30% and in fact, the issue was around for a longer period, but we didn’t catch it due to a « normal » CPU footprint on this server.

So, what happened here? We’re using SQL Server 2017 with Query Store enabled on the concerned database. This feature came to the rescue and brought attention to the first clue: A query plan regression that led increasing IO consumption in the second case (and implicitly the additional CPU resource consumption as well).

You have probably noticed both the execution plans are using an index scan at the right but the more expensive one (at the bottom) uses a different index strategy. Instead of using the primary key and clustered index (PK_xxx), a non-clustered index on the IX_xxx_Identifier column in the second query execution plan is used with the same CONVERT_IMPLICIT issue.

According to the query store statistics, number of executions per business day is roughly 25000 executions with ~ 8.5H of CPU time consumed during this period (18.05.2020 – 26.05.2020) that was a very different order of magnitude compared to what we may have in the DEV environment

At this stage, I would say investigating why a plan regression occurred doesn’t really matter because in both cases the most expensive operator concerns an index scan and again, we expect an index seek. Getting rid of the implicit conversion by using VARCHAR type to make the conditional clause sargable was a better option for us. Thus, the execution plan would be:

The first workaround in mind was to force the better plan in the query store (automatic tuning with FORCE_LAST_GOOD_PLAN = ON is disabled) but having discussed this point with the DEV team, we managed to deploy a fix very fast to address this issue and to reduce drastically the CPU consumption on this SQL Server instance as shown below. The picture is self-explanatory:

The fix consisted in adding CAST / CONVERT function to the right side of the equality (parameter and not the column) to avoid side effect on the JDBC driver. Therefore, we get another version of the query and a different query hash as well. The query update is pretty similar to the following one:

DECLARE @P0 DATETIME
DECLARE @P1 INT
DECLARE @P2 NVARCHAR(4000)
DECLARE @P3 DATETIME
DECLARE @P4 NVARCHAR(4000)

UPDATE TABLE SET DATE = @P0
WHERE ID = @P1
AND IDENTIFIER = CAST(@P2 AS varchar(50))
AND P_DATE >= @P3
AND W_O_ID = (
SELECT TOP 1 ID FROM TABLE2
WHERE Identifier = CAST(@P4 AS varchar(50))
ORDER BY ID DESC)

Sometime later, we gathered query store statistics of both the former and new query to confirm the performance improvement as shown below:

Finally changing the data type led to enable using an index seek operator to reduce drastically the SQL Server CPU consumption and logical read operations by far.

QED!

dbachecks and AlwaysOn availability group checks

mikedavem — Mon, 20 Apr 2020 19:57:31 +0000

When I started my DBA position in my new company, I was looking for a tool that was able to check periodically the SQL Server database environments for several reasons. First, as DBA one of my main concern is about maintaining and keeping the different mssql environments well-configured against an initial standard. It is also worth noting I’m not the only person to interact with databases and anyone in my team, which is member of sysadmin server role as well, is able to change any server-level configuration settings at any moment. In this case, chances are that having environments shifting from our initial standard over the time and my team and I need to keep confident by checking periodically the current mssql environment configurations, be alerting if configuration drifts exist and obviously fix it as faster as possible.

A while ago, I relied on SQL Server Policy Based Management feature (PBM) to carry out this task at one of my former customers and I have to say it did the job but with some limitations. Indeed, PBM is the instance-scope feature and doesn’t allow to check configuration settings outside the SQL Server instance for example. During my investigation, dbachecks framework drew my attention for several reasons:

– It allows to check different settings at different scopes including Operating System and SQL Server instance items
– It is an open source project and keeps evolving with SQL / PowerShell community contributions.
– It is extensible, and we may include custom checks to the list of predefined checks shipped with the targeted version.
– It is based on PowerShell, Pester framework and fits well with existing automation and GitOps process in my company

The first dbacheck version we deployed in production a couple of month ago was 1.2.24 and unfortunately it didn’t include reliable tests for availability groups. It was the starting point of my first contributions to open source projects and I felt proud and got honored when I noticed my 2 PRs validated for the dbacheck tool including Test Disk Allocation Unit and Availability Group checks:

Obviously, this is just an humble contribution and to be clear, I didn’t write the existing tests for AGs but I spent some times to apply fixes for a better detection of all AG environments including their replicas in a simple and complex topologies (several replicas on the same server and non-default ports for example).

So, here the current list of AG checks in the version 1.2.29 at the moment of this write-up:

– Cluster node should be up
– AG resource + IP Address in the cluster should be online
– Cluster private and public network should be up
– HADR should be enabled on each AG replica
– AG Listener + AG replicas should be pingable and reachable from client connections
– AG replica should be in the correct domain name
– AG replica port number should be equal to the port specified in your standard
– AG availability mode should not be in unknown state and should be in synchronized or synchronizing state regarding the replication type
– Each high available database (member of an AG) should be in synchronized / synchronizing state, ready for failover, joined to the AG and not in suspended state
– Each AG replica should have an extended event session called AlwaysOn_health which is in running state and configured in auto start mode

Mandatory parameters are app.cluster and domain.name.

Get-DbcCheck -Tag HADR | ft Group, Type, AllTags, Config -AutoSize

Group Type AllTags Config
----- ---- ------- ------
HADR ClusterNode ClusterHealth, HADR app.sqlinstance app.cluster skip.hadr.listener.pingcheck domain.name policy...

The starting point of the HADR checks is the Windows Failover Cluster component and hierarchically other tests are performed on each sub component including availability group, AG replicas and AG databases.

Then you may change the behavior on the HADR check process according to your context by using the following parameters:

– skip.hadr.listener.pingcheck => Skip ping check of hadr listener
– skip.hadr.listener.tcpport => Skip check of standard tcp port about AG listerners
– skip.hadr.replica.tcpport => Skip check of standard tcp port about AG replicas

For instance, in my context, I configured the hadr.replica.tcpport parameter to skip checks on replica ports because we own different environments that including several replicas on the same server and which listen on a non-default port.

Get-DbcConfig skip.hadr.*
Name Value Description
---- ----- -----------
skip.hadr.listener.pingcheck False Skip the HADR listener ping test (especially useful for Azure and AWS)
skip.hadr.listener.tcpport False Skip the HADR AG Listener TCP port number (If port number is not standard acro...
skip.hadr.replica.tcpport True Skip the HADR Replica TCP port number (If port number is not standard across t...

Running the HADR check can be simply run by using HADR tag as follows:

Invoke-DbcCheck -Tag HADR
Pester v4.10.1 Executing all tests in 'C:\Program Files\WindowsPowerShell\Modules\dbachecks\1.2.29\checks\HADR.Tests.ps1' with Tags HADR
...

Well, this a good start but I think some almost of checks are state-oriented and some configuration checks are missing. I’m already willing to add some of them in a near the future or/and feel free to add your own contribution as well

Stay tuned!

SQL Server on Linux and new FUA support for XFS filesystem

mikedavem — Mon, 13 Apr 2020 17:34:32 +0000

I wrote a (dbi services) blog post concerning Linux and SQL Server IO behavior changes before and after SQL Server 2017 CU6. Now, I was looking forward seeing some new improvements with Force Unit Access (FUA) that was implemented with Linux XFS enhancements since the Linux Kernel 4.18.

As reminder, SQL Server 2017 CU6 provides added a way to guarantee data durability by using « forced flush » mechanism explained here. To cut the story short, SQL Server has strict storage requirement such as Write Ordering, FUA and things go differently on Linux than Windows to achieve durability. What is FUA and why is it important for SQL Server? From Wikipedia: Force Unit Access (aka FUA) is an I/O write command option that forces written data all the way to stable storage. FUA appeared in the SCSI command set but good news, it was later adopted by other standards over the time. SQL Server relies on it to meet WAL and ACID capabilities.

On the Linux world and before the Kernel 4.18, FUA was handled and optimized only for the filesystem journaling. However, data storage always used the multi-step flush process that could introduce SQL Server IO storage slowness (Issue write to block device for the data + issue block device flush to ensure durability with O_DSYNC).

On the Windows world, installing and using a SQL Server instance assumes you are compliant with the Microsoft storage requirements and therefore the first RTM version shipped on Linux came only with O_DIRECT assuming you already ensure that SQL Server IO are able to be written directly into a non-volatile storage through the kernel, drivers and hardware before the acknowledgement. Forced flush mechanism – based on fdatasync() – was then introduced to address scenarios with no safe DIRECT_IO capabilities.

But referring to the Bob Dorr article, Linux Kernel 4.18 comes with XFS enhancements to handle FUA for data storage and it is obviously of benefit to SQL Server. FUA support is intended to improve write requests by shorten the path of write requests as shown below:

Picture from existing IO workflow on Bob Dorr’s article

This is an interesting improvement for write intensive workload and it seems to be confirmed from the tests performed by Microsoft and Bob Dorr in his article.

Let’s the experiment begins with my lab environment based on a Centos 7 on Hyper-V with an upgraded kernel version: 5.6.3-1.e17.elrepo.x86_64.

$uname -r
5.6.3-1.el7.elrepo.x86_64

$cat /etc/os-release | grep VERSION
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Let’s precise that my tests are purely experimental and instead of upgrading the Kernel to a newer version you may directly rely on RHEL 8 based distros which comes with kernel version 4.18 for example.

My lab environment includes 2 separate SSD disks to host the DATA + TLOG database files as follows:

I:\ drive : SQL Data volume (sdb – XFS filesystem)
T:\ drive : SQL TLog volume (sda – XFS filesystem)

The general performance is not so bad

Initially I just dedicated on disk for both SQL DATA and TLOG but I quickly noticed some IO waits (iostats output) leading to make me lunconfident with my test results

Spreading IO on physically separate volumes helped to reduce drastically these phenomena afterwards:

First, I enabled FUA capabilities on Hyper-V side as follows:

Set-VMHardDiskDrive -VMName CENTOS7 -ControllerType SCSI -OverrideCacheAttributes WriteCacheAndFUAEnabled

Get-VMHardDiskDrive -VMName CENTOS7 | `
ft VMName, ControllerType, ControllerLocation, Path, WriteHardeningMethod -AutoSize

Then I checked if FUA is enabled and supported from an OS perspective including sda (TLOG) and sdb (SQL DATA) disks:

$ lsblk -f
NAME FSTYPE LABEL UUID MOUNTPOINT
sdb
└─sdb1 xfs 06910f69-27a3-4711-9093-f8bf80d15d72 /sqldata
sr0
sda
├─sda2 xfs f5a9bded-130f-4642-bd6f-9f27563a4e16 /boot
├─sda3 LVM2_member QsbKEt-28yT-lpfZ-VCbj-v5W5-vnVr-2l7nih
│ ├─centos-swap swap 7eebbb32-cef5-42e9-87c3-7df1a0b79f11 [SWAP]
│ └─centos-root xfs 90f6eb2f-dd39-4bef-a7da-67aa75d1843d /
└─sda1 vfat 7529-979E /boot/efi

$ dmesg | grep sda
[ 1.665478] sd 0:0:0:0: [sda] 83886080 512-byte logical blocks: (42.9 GB/40.0 GiB)
[ 1.665479] sd 0:0:0:0: [sda] 4096-byte physical blocks
[ 1.665774] sd 0:0:0:0: [sda] Write Protect is off
[ 1.665775] sd 0:0:0:0: [sda] Mode Sense: 0f 00 10 00
[ 1.670321] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 1.683833] sda: sda1 sda2 sda3
[ 1.708938] sd 0:0:0:0: [sda] Attached SCSI disk
[ 5.607914] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)

Finally according to the documentation, I configured the trace flag 3979 and control.alternatewritethrough=0 parameters at startup parameters for my SQL Server instance.

$ /opt/mssql/bin/mssql-conf traceflag 3979 on

$ /opt/mssql/bin/mssql-conf set control.alternatewritethrough 0

$ systemctl restart mssql-server

The first I performed was pretty similar to those in my previous (dbi services) blog post.

CREATE TABLE dummy_test (
id INT IDENTITY,
col1 VARCHAR(2000) DEFAULT REPLICATE('T', 2000)
);

INSERT INTO dummy_test DEFAULT VALUES;
GO 67

For a sake of curiosity, I looked at the corresponding strace output:

$ cat sql_strace_fua.txt
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
78.13 360.618066 61739 5841 2219 futex
6.88 31.731833 1511040 21 15 restart_syscall
3.81 17.592176 130312 135 io_getevents
2.95 13.607314 98604 138 epoll_wait
2.88 13.313667 633984 21 21 rt_sigtimedwait
2.60 11.997925 1333103 9 nanosleep
1.79 8.279781 242 34256 gettid
0.84 3.876021 226 17124 getcpu
0.03 0.138836 347 400 sched_yield
0.01 0.062348 254 245 getrusage
0.01 0.056065 406 138 69 readv
0.01 0.038107 343 111 read
0.01 0.037883 743 51 mmap
0.01 0.037498 180 208 epoll_ctl
0.01 0.035654 517 69 writev
0.01 0.025542 370 69 io_submit
0.00 0.019760 282 70 write
0.00 0.019555 477 41 open
0.00 0.016285 1629 10 rt_sigaction
0.00 0.012359 301 41 close
0.00 0.010069 205 49 munmap
0.00 0.006977 303 23 rt_sigprocmask
0.00 0.006256 153 41 fstat
0.00 0.004646 465 10 10 stat
0.00 0.000860 215 4 madvise
0.00 0.000321 161 2 sched_setaffinity
0.00 0.000295 148 2 set_robust_list
0.00 0.000281 141 2 clone
0.00 0.000236 118 2 sigaltstack
0.00 0.000093 47 2 arch_prctl
0.00 0.000046 23 2 sched_getaffinity
------ ----------- ----------- --------- --------- ----------------
100.00 461.546755 59137 2334 total

… And as I expected, with FUA enabled no fsync() / fdatasync() called anymore and writing to a stable storage is achieved directly by FUA commands. Now iomap_dio_rw() is determining if REQ_FUA can be used and issuing generic_write_sync() is still necessary. To dig further to the IO layer we need to rely to another tool blktrace (mentioned to the Bob Dorr’s article as well).

In my case I got to different pictures of blktrace output between forced flushed mechanism (the default) and FUA oriented IO:

-> With forced flush

34.694734500 14225 18425192 8,16 0 17164 A WS 2048 sqlservr
34.694735000 14225 18425192 8,16 0 17165 Q WS 2048 sqlservr
34.694737000 14225 18425192 8,16 0 17166 X WS 1024 sqlservr
34.694738100 14225 18425192 8,16 0 17167 G WS 1024 sqlservr
34.694739800 14225 18426216 8,16 0 17169 G WS 1024 sqlservr
34.694740900 14225 18425192 8,16 0 17171 D WS 1024 sqlservr
34.694747200 14225 18426216 8,16 0 17174 D WS 1024 sqlservr
34.713665000 14225 0 8,16 0 17175 Q FWS 0 sqlservr
34.713668100 14225 0 8,16 0 17176 G FWS 0 sqlservr

WS (Write Synchronous) is performed but SQL Server still needs to go through the multi-step flush process with the additional FWS (PERFLUSH|WRITE|SYNC).

-> FUA

0.000000000 16305 55106536 8,0 0 1 A WFS 8 sqlservr
0.000000400 16305 57615336 8,0 0 2 A WFS 8 sqlservr
0.000001100 16305 57615336 8,0 0 3 Q WFS 8 sqlservr
0.000005200 16305 57615336 8,0 0 4 G WFS 8 sqlservr
0.001377800 16305 55106544 8,0 0 6 A WFS 16 sqlservr

FWS has disappeared with only WFS commands which are basically REQ_WRITE with the REQ_FUA request

I spent some times to read some interesting discussions in addition to the Bob Dorr’s wonderful article. Here an interesting pointer to a a discussion about REQ_FUA for instance.

But what about performance gain?

I had 2 simple scenarios to play with in order to bring out FUA helpfulness including the harden the dirty pages in the BP with checkpoint process and harden the log buffer to disk during the commit phase. When forced flush method is used, each component relies on additional FlushFileBuffers() function to achieve durability. This event can be easily tracked from an XE session including flush_file_buffers and make_writes_durable events.

First scenario (10K inserts within a transaction and checkpoint)

In this scenario my intention was to stress the checkpoint process with a bunch of buffers and dirty pages to flush to disk when it kicks in.

USE dummy;

SET NOCOUNT ON;
-- Disable checkpoint to control when it will kick in
DBCC TRACEON(3505);
-- Check traceflag
DBCC TRACESTATUS;

DECLARE @i INT = 0;
DECLARE @iteration INT = 0;
DECLARE @start_upd DATETIME;
DECLARE @start_chkpt DATETIME;
DECLARE @end_upd DATETIME;
DECLARE @end_chkpt DATETIME;

TRUNCATE TABLE dummy_test;

WHILE @iteration < 251
BEGIN

SET @start_upd = GETDATE();

BEGIN TRAN;

WHILE @i <= 10000
BEGIN
INSERT INTO dummy_test DEFAULT VALUES;
SET @i += 1;
END

COMMIT TRAN;

SET @end_upd = GETDATE();

SET @i = 0;

SET @start_chkpt = GETDATE();
CHECKPOINT;
SET @end_chkpt = GETDATE();
PRINT 'INS: ' + CAST(DATEDIFF(ms, @start_upd, @end_upd) AS VARCHAR(50)) + ' - CHKPT: ' + CAST(DATEDIFF(ms, @start_chkpt, @end_chkpt) AS VARCHAR(50));

SET @iteration += 1;
END

The result is as follows:

In my case, I noticed ~ 17% of improvement for the checkpoint process and ~7% for the insert transaction including the commit phase with flushing data to the TLog. In parallel, looking at the extended event aggregated output confirms that FUA avoids a lot of additional operations to persist data on disk illustrated by flush_file_buffers and make_writes_durable events.

Second scenario (100x 1 insert within a transaction and checkpoint)

In this scenario, I wanted to stress the log writer by forcing a lot of small transactions to commit. I updated the TSQL code as shown below:

USE dummy;

SET NOCOUNT ON;
-- Disable checkpoint to control when it will kick in
DBCC TRACEON(3505);
-- Check traceflag
DBCC TRACESTATUS;

DECLARE @i INT = 0;
DECLARE @iteration INT = 0;
DECLARE @start_upd DATETIME;
DECLARE @start_chkpt DATETIME;
DECLARE @end_upd DATETIME;
DECLARE @end_chkpt DATETIME;

TRUNCATE TABLE dummy_test;

WHILE @iteration < 251
BEGIN

SET @start_upd = GETDATE();

WHILE @i <= 100
BEGIN
INSERT INTO dummy_test DEFAULT VALUES;
SET @i += 1;
END

SET @end_upd = GETDATE();

SET @i = 0;

SET @start_chkpt = GETDATE();
CHECKPOINT;
SET @end_chkpt = GETDATE();
PRINT 'INS: ' + CAST(DATEDIFF(ms, @start_upd, @end_upd) AS VARCHAR(50)) + ' - CHKPT: ' + CAST(DATEDIFF(ms, @start_chkpt, @end_chkpt) AS VARCHAR(50));

SET @iteration += 1;
END

The new picture is the following:

This time the improvement is definitely more impressive with a decrease of ~80% of the execution time about the INSERT + COMMIT and ~77% concerning the checkpoint phase!!!

Looking at the extended event session confirms the shorten IO path has something to do with it

Well, shortening the IO path and relying directing on initial FUA instructions was definitely a good idea both to join performance and to meet WAL and ACID capabilities. Anyway, I’m glad to see Microsoft to contribute improving to the Linux Kernel!!!