David Barbarin » monitoring

Graphing SQL Server wait stats on Prometheus and Grafana

mikedavem — Thu, 09 Sep 2021 21:19:22 +0000

Wait stats are essential performance metrics for diagnosing SQL Server Performance problems. Related metrics can be monitored from different DMVs including sys.dm_os_wait_stats and sys.dm_db_wait_stats (Azure).

As you probably know, there are 2 categories of DMVs in SQL Server: Point in time versus cumulative and DMVs mentioned previously are in the second category. It means data in these DMVs are accumulative and incremented every time wait events occur. Values reset only when SQL Server restarts or when you intentionally run DBCC SQLPERF command. Baselining these metric values require taking snapshots to compare day-to-day activity or maybe simply trends for a given timeline. Paul Randal kindly provided a TSQL script for trend analysis in a specified time range in this blog post. The interesting part of this script is the focus of most relevant wait types and corresponding statistics. This is basically the kind of scripts I used for many years when I performed SQL Server audits at customer shops but today working as database administrator for a company, I can rely on our observability stack that includes Telegraf / Prometheus and Grafana to do the job.

In a previous write-up, I explained the choice of such platform for SQL Server. But transposing the Paul’s script logic to Prometheus and Grafana was not a trivial stuff, but the result was worthy. It was an interesting topic that I want to share with Ops and DBA who wants to baseline SQL Server telemetry on Prometheus and Grafana observability platform.

So, let’s start with metrics provided by Telegraf collector agent and then scraped by Prometheus job:
– sqlserver_waitstats_wait_time_ms
– sqlserver_waitstats_waiting_tasks_count
– sqlserver_waitstats_resource_wait_time_ms
– sqlserver_waitstats_signe_wait_time_ms

In the context of the blog post we will focus only on the first 2 ones of the above list, but the same logic applies for others.

As a reminder, we want to graph most relevant wait types and their average value within a time range specified in a Grafana dashboard. In fact, this is a 2 steps process:

1) Identifying most relevant wait types by computing their ratio with the total amount of wait time within the specific time range.
2) Graphing in Grafana these most relevant wait types with their corresponding average value for every Prometheus step in the time range.

To address the first point, we need to rely on special Prometheus rate() function and group_left modifier.

As per the Prometheus documentation, rate() gives the per second average rate of change over the specified range interval by using the boundary metric points in it. That is exactly what we need to compute the total average of wait time (in ms) per wait type in a specified time range. rate() needs a range vector as input. Let’s illustrate what is a range vector with the following example. For a sake of simplicity, I filtered with sqlserver_waitstats_wait_time_ms metric to one specific SQL Server instance and wait type (PAGEIOLATCH_EX). Range vector is expressed with a range interval at the end of the query as you can see below:

sqlserver_waitstats_wait_time_ms{sql_instance="$Instance",wait_type="PAGEIOLATCH_EX"}[1m]

The result is a set of data metrics within the specified range interval as show below:

We got for each data metric the value and the corresponding timestamp in epoch format. You can convert this epoch format to user friendly one by using date -r -j for example. Another important point here: The sqlserver_waitstats_wait_time_ms metric is a counter in Prometheus world because value keeps increasing over the time as you can see above (from top to bottom). The same concept exists in SQL Server with cumulative DMV category as explained at the beginning. It explains why we need to use rate() function for drawing the right representation of increase / decrease rate over the time between data metric points. We got 12 data metrics with an interval of 5s between each value. This is because in my context we defined a Prometheus scrape interval of 5s for SQL Server => 60s/5s = 12 data points and 11 steps. The next question is how rate calculates per-second rate of change between data points. Referring to my previous example, I can get the rate value by using the following prompQL query:

rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance",wait_type="PAGEIOLATCH_EX"}[1m])

… and the corresponding value:

To understand this value, let’s have a good reminder of mathematic lesson at school with slope calculation.

Image from Wikipedia

The basic idea of slope value is to find the rate of change of one variable compared to another. Less the distance between two data points we have, more chance we have to get a precise approximate value of the slope. And this is exactly what it is happening with Prometheus when you zoom in or out by changing the range interval. A good resolution is also determined by the Prometheus scraping interval especially when your metrics are extremely volatile. This is something to keep in mind with Prometheus. We are working with approximation by design. So let’s do some math with a slope calculation of the above range vector:

Slope = DV/DT = (332628-332582)/(@1631125796.971 – @1631125746.962) =~ 0.83

Excellent! This is how rate() works and the beauty of this function is that slope calculation is doing automatically for all the steps within the range interval.

But let’s go back to the initial requirement. We need to calculate per wait type the average value of wait time between the first and last point in the specified range vector. We can now step further by using Prometheus aggregation operator as follows:

sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance"}[1m]))

Please note we could have written it another way without using the sum by aggregator but it allows naturally to exclude all unwanted labels for the result metric. It will be particularly helpful for the next part. Anyway, Here a sample of the output:

Then we can compute label (wait type) ratio (or percentage). First attempt and naïve approach could be as follows:

sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance"}[1m]))/ sum(rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance'}[1m]))

But we get empty query result. Bad joke right? We need to understand that.

First part of the query gives total amount of wait time per wait type. I put a sample of the results for simplicity:

It results a new set of metrics with only one label for wait_type. Second part gives to total amount of wait time for all wait types as show below:

With SQL statement, we instinctively select columns that have matching values in concerned tables. Those columns are often concerned by primary or foreign keys. In Prometheus world, vector matching is performing the same way by using all labels at the starting point. But samples are selected or dropped from the result vector based either on « ignoring » and « on » keywords. In my case, they are no matching labels so we must tell Prometheus to ignore the remaining label (wait_type) on the first part of the query:

sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance"}[1m]))/ ignoring(wait_type) sum(rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance'}[1m]))

But another error message …

Error executing query: multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)

In the many-to -one or one-to-many vector matching with Prometheus, samples are selected using keywords like group_left or group_right. In other words, we are telling Prometheus to perform a cross join in this case with this final query before performing division between values:

sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance="$Instance"}[1m]))/ ignoring(wait_type) group_left sum(rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance'}[1m]))

Here we go!

We finally managed to calculate ratio per wait type with a specified range interval. Last thing is to select most relevant wait types by excluding first irrelevant wait types. Most of wait types come from the exclusion list provided by Paul Randal’s script. We also decided to only focus on max top 5 wait types with ratio > 10% but it is up to you to change these values:

topk(5, sum by (wait_type) (rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance',measurement_db_type="SQLServer",wait_type!~'(BROKER_EVENTHANDLER|BROKER_RECEIVE_WAITFOR|BROKER_TASK_STOP|BROKER_TO_FLUSH|BROKER_TRANSMITTER|CHECKPOINT_QUEUE|CHKPT|CLR_AUTO_EVENT|CLR_MANUAL_EVENT|CLR_SEMAPHORE|DBMIRROR_DBM_EVENT|DBMIRROR_EVENTS_QUEUE|DBMIRROR_WORKER_QUEUE|DBMIRRORING_CMD|DIRTY_PAGE_POLL|DISPATCHER_QUEUE_SEMAPHORE|EXECSYNC|FSAGENT|FT_IFTS_SCHEDULER_IDLE_WAIT|FT_IFTSHC_MUTEX|KSOURCE_WAKEUP|LAZYWRITER_SLEEP|LOGMGR_QUEUE|MEMORY_ALLOCATION_EXT|ONDEMAND_TASK_QUEUE|PARALLEL_REDO_DRAIN_WORKER|PARALLEL_REDO_LOG_CACHE|PARALLEL_REDO_TRAN_LIST|PARALLEL_REDO_WORKER_SYNC|PARALLEL_REDO_WORKER_WAIT_WORK|PREEMPTIVE_OS_FLUSHFILEBUFFERS|PREEMPTIVE_XE_GETTARGETSTATE|PWAIT_ALL_COMPONENTS_INITIALIZED|PWAIT_DIRECTLOGCONSUMER_GETNEXT|QDS_PERSIST_TASK_MAIN_LOOP_SLEEP|QDS_ASYNC_QUEUE|QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP|QDS_SHUTDOWN_QUEUE|REDO_THREAD_PENDING_WORK|REQUEST_FOR_DEADLOCK_SEARCH|RESOURCE_QUEUE|SERVER_IDLE_CHECK|SLEEP_BPOOL_FLUSH|SLEEP_DBSTARTUP|SLEEP_DCOMSTARTUP|SLEEP_MASTERDBREADY|SLEEP_MASTERMDREADY|SLEEP_MASTERUPGRADED|SLEEP_MSDBSTARTUP|SLEEP_SYSTEMTASK|SLEEP_TASK|SLEEP_TEMPDBSTARTUP|SNI_HTTP_ACCEPT|SOS_WORK_DISPATCHER|SP_SERVER_DIAGNOSTICS_SLEEP|SQLTRACE_BUFFER_FLUSH|SQLTRACE_INCREMENTAL_FLUSH_SLEEP|SQLTRACE_WAIT_ENTRIES|VDI_CLIENT_OTHER|WAIT_FOR_RESULTS|WAITFOR|WAITFOR_TASKSHUTDOW|WAIT_XTP_RECOVERY|WAIT_XTP_HOST_WAIT|WAIT_XTP_OFFLINE_CKPT_NEW_LOG|WAIT_XTP_CKPT_CLOSE|XE_DISPATCHER_JOIN|XE_DISPATCHER_WAIT|XE_TIMER_EVENT|MEMORY_ALLOCATION_EXT|ONDEMAND_TASK_QUEUE|PREEMPTIVE_HADR_LEASE_MECHANISM|PREEMPTIVE_SP_SERVER_DIAGNOSTICS|PREEMPTIVE_ODBCOPS|PREEMPTIVE_OS_LIBRARYOPS|PREEMPTIVE_OS_COMOPS|PREEMPTIVE_OS_CRYPTOPS|PREEMPTIVE_OS_PIPEOPS|PREEMPTIVE_OS_AUTHENTICATIONOPS|PREEMPTIVE_OS_GENERICOPS|PREEMPTIVE_OS_VERIFYTRUST|PREEMPTIVE_OS_FILEOPS|PREEMPTIVE_OS_DEVICEOPS|PREEMPTIVE_OS_QUERYREGISTRY|PREEMPTIVE_OS_WRITEFILE|PREEMPTIVE_XE_CALLBACKEXECUTEPREEMPTIVE_XE_DISPATCHER|PREEMPTIVE_XE_GETTARGETSTATEPREEMPTIVE_XE_SESSIONCOMMIT|PREEMPTIVE_XE_TARGETINITPREEMPTIVE_XE_TARGETFINALIZE|PREEMPTIVE_XHTTP|PWAIT_EXTENSIBILITY_CLEANUP_TASK|PREEMPTIVE_OS_DISCONNECTNAMEDPIPE|PREEMPTIVE_OS_DELETESECURITYCONTEXT|PREEMPTIVE_OS_CRYPTACQUIRECONTEXT|PREEMPTIVE_HTTP_REQUEST|RESOURCE_GOVERNOR_IDLE|HADR_FABRIC_CALLBACK|PVS_PREALLOCATE)'}[1m])) / ignoring(wait_type) group_left sum(rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance',measurement_db_type="SQLServer",wait_type!~'(BROKER_EVENTHANDLER|BROKER_RECEIVE_WAITFOR|BROKER_TASK_STOP|BROKER_TO_FLUSH|BROKER_TRANSMITTER|CHECKPOINT_QUEUE|CHKPT|CLR_AUTO_EVENT|CLR_MANUAL_EVENT|CLR_SEMAPHORE|DBMIRROR_DBM_EVENT|DBMIRROR_EVENTS_QUEUE|DBMIRROR_WORKER_QUEUE|DBMIRRORING_CMD|DIRTY_PAGE_POLL|DISPATCHER_QUEUE_SEMAPHORE|EXECSYNC|FSAGENT|FT_IFTS_SCHEDULER_IDLE_WAIT|FT_IFTSHC_MUTEX|KSOURCE_WAKEUP|LAZYWRITER_SLEEP|LOGMGR_QUEUE|MEMORY_ALLOCATION_EXT|ONDEMAND_TASK_QUEUE|PARALLEL_REDO_DRAIN_WORKER|PARALLEL_REDO_LOG_CACHE|PARALLEL_REDO_TRAN_LIST|PARALLEL_REDO_WORKER_SYNC|PARALLEL_REDO_WORKER_WAIT_WORK|PREEMPTIVE_OS_FLUSHFILEBUFFERS|PREEMPTIVE_XE_GETTARGETSTATE|PWAIT_ALL_COMPONENTS_INITIALIZED|PWAIT_DIRECTLOGCONSUMER_GETNEXT|QDS_PERSIST_TASK_MAIN_LOOP_SLEEP|QDS_ASYNC_QUEUE|QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP|QDS_SHUTDOWN_QUEUE|REDO_THREAD_PENDING_WORK|REQUEST_FOR_DEADLOCK_SEARCH|RESOURCE_QUEUE|SERVER_IDLE_CHECK|SLEEP_BPOOL_FLUSH|SLEEP_DBSTARTUP|SLEEP_DCOMSTARTUP|SLEEP_MASTERDBREADY|SLEEP_MASTERMDREADY|SLEEP_MASTERUPGRADED|SLEEP_MSDBSTARTUP|SLEEP_SYSTEMTASK|SLEEP_TASK|SLEEP_TEMPDBSTARTUP|SNI_HTTP_ACCEPT|SOS_WORK_DISPATCHER|SP_SERVER_DIAGNOSTICS_SLEEP|SQLTRACE_BUFFER_FLUSH|SQLTRACE_INCREMENTAL_FLUSH_SLEEP|SQLTRACE_WAIT_ENTRIES|VDI_CLIENT_OTHER|WAIT_FOR_RESULTS|WAITFOR|WAITFOR_TASKSHUTDOW|WAIT_XTP_RECOVERY|WAIT_XTP_HOST_WAIT|WAIT_XTP_OFFLINE_CKPT_NEW_LOG|WAIT_XTP_CKPT_CLOSE|XE_DISPATCHER_JOIN|XE_DISPATCHER_WAIT|XE_TIMER_EVENT|MEMORY_ALLOCATION_EXT|ONDEMAND_TASK_QUEUE|PREEMPTIVE_HADR_LEASE_MECHANISM|PREEMPTIVE_SP_SERVER_DIAGNOSTICS|PREEMPTIVE_ODBCOPS|PREEMPTIVE_OS_LIBRARYOPS|PREEMPTIVE_OS_COMOPS|PREEMPTIVE_OS_CRYPTOPS|PREEMPTIVE_OS_PIPEOPS|PREEMPTIVE_OS_AUTHENTICATIONOPS|PREEMPTIVE_OS_GENERICOPS|PREEMPTIVE_OS_VERIFYTRUST|PREEMPTIVE_OS_FILEOPS|PREEMPTIVE_OS_DEVICEOPS|PREEMPTIVE_OS_QUERYREGISTRY|PREEMPTIVE_OS_WRITEFILE|PREEMPTIVE_XE_CALLBACKEXECUTEPREEMPTIVE_XE_DISPATCHER|PREEMPTIVE_XE_GETTARGETSTATEPREEMPTIVE_XE_SESSIONCOMMIT|PREEMPTIVE_XE_TARGETINITPREEMPTIVE_XE_TARGETFINALIZE|PREEMPTIVE_XHTTP|PWAIT_EXTENSIBILITY_CLEANUP_TASK|PREEMPTIVE_OS_DISCONNECTNAMEDPIPE|PREEMPTIVE_OS_DELETESECURITYCONTEXT|PREEMPTIVE_OS_CRYPTACQUIRECONTEXT|PREEMPTIVE_HTTP_REQUEST|RESOURCE_GOVERNOR_IDLE|HADR_FABRIC_CALLBACK|PVS_PREALLOCATE)'}[1m]))) >= 0.1

I got 3 relevant wait types with their correspond ratio in the specified time range.

Pretty cool stuff but we must now to go through the second requirement. We want to graph the average value of the identified wait types within a specified time range in Grafana dashboard. First thing consists in including the above Prometheus query as variable in the Grafana dashboard. Here how I setup my Top5Waits variable in Grafana:

Some interesting points here: variable dependency kicks in with my $Top5Waits variable that depends hierarchically on another $Instance variable in my dashboard (from another Prometheus query). You probably have noticed the use of [${__range_s}s] to determine the range interval but depending on the Grafana $__interval may be a good fit as well.

In turn, $Top5Waits can be used from another query but this time directly in a Grafana dashboard panel with the average value of most relevant wait types as shown below:

Calculating wait type average is not a hard task by itself. In fact, we can apply the same methods than previously by matching the sqlserver_waitstats_wait_tine_ms and sqlserver_waitstats_waiting_task_count and to divide their corresponding values to obtain the average wait time (in ms) for each step within the time range (remember how the rate () function works). Both metrics own the same set of labels, so we don’t need to use « on » or « ignoring » keywords in this case. But we must introduce the $Top5Waits variable in the label filter in the first metric as follows:

rate(sqlserver_waitstats_wait_time_ms{sql_instance='$Instance',wait_type=~"$Top5Waits",measurement_db_type="SQLServer"}[$__rate_interval])/rate(sqlserver_waitstats_waiting_tasks_count{sql_instance='$Instance',wait_type=~"$Top5Waits",measurement_db_type="SQLServer"}[$__rate_interval])

We finally managed to get an interesting dynamic measurement of SQL Server telemetry wait stats. Hope this blog post helps!
Let me know your feedback if your are using SQL Server wait stats in Prometheus and Grafana in a different way !

Creating dynamic Grafana dashboard for SQL Server

mikedavem — Sun, 11 Apr 2021 19:52:09 +0000

A couple of months ago I wrote about “Why we moved SQL Server monitoring to Prometheus and Grafana”. I talked about the creation of two dashboards. The first one is blackbox monitoring-oriented and aims to spot in (near) real-time resource pressure / saturation issues with self-explained gauges, numbers and colors indicating healthy (green) or unhealthy resources (orange / red). We also include availability group synchronization health metric in the dashboard. We will focus on it in this write-up.

As a reminder, this Grafana dashboard gets information from Prometheus server and metrics related to MSSQL environments. For a sake of clarity, in this dashboard, environment defines one availability group and a set of 2 AG replicas (A or B) in synchronous replication mode. In other words, ENV1 value corresponds to availability group name and to SQL instance names member of the AG group with SERVERA\ENV1 (first replica), SERVERB\ENV1 (second replica).

In the picture above, you can notice 2 sections. One is for availability group and health monitoring and the second includes a set of black box metrics related to saturation and latencies (CPU, RAM, Network, AG replication delay, SQL Buffer Pool, blocked processes …). Good job for one single environment but what if I want to introduce more availability groups and SQL instances in the game?

The first and easiest (or naïve) way we went through when we started writing this dashboard was to copy / paste all the stuff done for one environment the panels as shown below:

After creating a new row (can be associated to section in the present context) at the bottom, all panels were copied from ENV1 to the new fresh section ENV2. New row is created by converting anew panel into row as show below:

Then I need to modify manually ALL the new metrics with the new environment. Let’s illustrate the point with Batch Requests/sec metric as example. The corresponding Prometheus query for the first replica (A) is: (the initial query has been simplified for the purpose of this blog post):

irate(sqlserver_performance{sql_instance='SERVERA:ENV1',counter="Batch Requests/sec"}[$__range])

Same query exists for secondary replica (B) but with a different label value:

irate(sqlserver_performance{sql_instance='SERVERB:ENV1',counter="Batch Requests/sec"}[$__range])

SERVERA:ENV1 and SERVERB:ENV1 are static values that correspond to the name of each SQL Server instance – respectively SERVERA\ENV1 and SERVERB\ENV1. As you probably already guessed and according to our naming convention, for the new environment and related panels, we obviously changed initial values ENV1 with new one ENV2. But having more environments or providing filtering capabilities to focus only on specific environments make the current process tedious and we need introduce dynamic stuff in the game … Good news, Grafana provides such capabilities with dynamic creation of rows and panels. and rows.

Generating dynamic panels in the same section (row)

Referring to the dashboard, first section concerns availability group health metric. When adding a new environment – meaning a new availability group – we want a new dedicated panel creating automatically in the same section (AG health).
Firstly, we need to add a multi-value variable in the dashboard. Values can be static or dynamic from another query regarding your context. (up to you to choose the right solution according to your context).

Once created, a drop-down list appears at the upper left in the dashboard and now we can perform multi selections or we can filter to specific ones.

Then we need to make panel in the AG Heath section dynamic as follows:
– Change the title value with corresponding dashboard (optional)
– Configure repeat options values with the variable (mandatory). You can also define max panel per row

According to this setup, we can display 4 panels (or availability groups) max per row. The 5th will be created and placed to a new line in the same section as shown below:

Finally, we must replace static label values defined in the query by the variable counterpart. For the availability group we are using sqlserver_hadr_replica_states_replica_synchronization_health metric as follows (again, I voluntary put a sample of the entire query for simplicity purpose):

… sqlserver_hadr_replica_states_replica_synchronization_health{sql_instance=~'SERVER[A|B]:$ENV',measurement_db_type="SQLServer"}) …

You can notice the regex expression used to get information from SQL Instances either from primary (A) or secondary (B). The most interesting part concerns the environment that is now dynamic with $ENV variable.

Generating dynamic sections (rows)

As said previously, sections are in fact rows in the Grafana dashboard and rows can contain panels. If we add new environment, we want also to see a new section (and panels) related to it. Configuring dynamic rows is pretty similar to panels. We only need to change the “Repeat for section” with the environment variable as follows (Title remains optional):

As for AG Health panel, we also need to replace static label values in ALL panels with the new environment variable. Thus, referring to the previous Batch Requests / sec example, the updated Prometheus query will be as follows: (respectively for primary and secondary replicas):

irate(sqlserver_performance{sql_instance='SERVERA:$ENV',counter="Batch Requests/sec"}[$__range])

…

irate(sqlserver_performance{sql_instance='SERVERB:$ENV',counter="Batch Requests/sec"}[$__range])

The dashboard is now ready, and all dynamic kicks in when a new SQL Server instance is added to the list of monitored items. Here an example of outcome in our context:

Happy monitoring!

Extending SQL Server monitoring with Raspberry PI and Lametric

mikedavem — Thu, 07 Jan 2021 21:59:25 +0000

First blog of this new year 2021 and I will start with a fancy and How-To Geek topic

In my last blog post, I discussed about monitoring and how it should help to address quickly a situation that is going degrading. Alerts are probably the first way to raise your attention and, in my case, they are often in the form of emails in a dedicated folder. That remains a good thing, at least if you’re not focusing too long in other daily tasks or projects. In work office, I know I would probably better focus on new alerts but as I said previously, telework changed definitely the game.

I wanted to find a way to address this concern at least for main SQL Server critical alerts and I thought about relying on my existing home lab infrastructure to address the point. Reasons are it is always a good opportunity to learn something and to improve my skills by referring to a real case scenario.

My home lab infrastructure includes a cluster of Raspberry PI 4 nodes. Initially, I use it to improve my skills on K8s or to study some IOT stuff for instance. It is a good candidate for developing and deploying a new app for detecting new incoming alerts in my mailbox and sending notifications to my Lametric accordingly.

Lametric is a basically a connected clock but works also as a highly-visible display showing notifications from devices or apps via REST APIs. First time I saw such device in action was in a DevOps meetup in 2018 around Docker and Jenkins deployment with Eric Dusquenoy and Tim Izzo (@5ika_). In addition, one of my previous customers had also one in his office and we had some discussions about cool customization through Lametric apps.

Connection through VPN to my company network is mandatory to work from home and unfortunately Lametric device doesn’t support this scenario because communication is limited to local network only. So, I need an app that run on my local (home) network and able to connect to my mailbox, get new incoming emails and finally sending notifications to my Lametric device.

Here my setup:

There are plenty of good blog posts to create a Raspberry cluster on the internet and I would suggest to read that of Andrew Pruski (@dbafromthecold).

As shown above, there are different paths for SQL alerts referring our infrastructure (On-prem and Azure SQL databases) but all of them are send to a dedicated distribution list for DBA.

The app is a simple PowerShell script that relies on Exchange Webservices APIs for connecting to the mailbox and to get new mails. Sending notifications to my Lametric device is achieved by a simple REST API call with well-formatted body. Details can be found the Lametric documentation. As prerequisite, you need to create a notification app from Lametric Developer site as follows:

As said previously, I used PowerShell for this app. It can help to find documentation and tutorials when it comes Microsoft product. But if you are more confident with Python, APIs are also available in a dedicated package. But let’s precise that using PowerShell doesn’t necessarily mean using Windows-based container and instead I relied on Linux-based image with PowerShell core for ARM architecture. Image is provided by Microsoft on Docker Hub. Finally, sensitive information like Lametric Token or mailbox credentials are stored in K8s secret for security reasons. My app project is available on my GitHub. Feel free to use it.

Here some results:

– After deploying my pod:

– The app is running and checking new incoming emails (kubectl logs command)

When email is detected, notification is sendig to Lametric device accordingly

Geek fun good (bad?) idea to start this new year 2021

Why we moved SQL Server monitoring on Prometheus and Grafana

mikedavem — Tue, 22 Dec 2020 16:55:12 +0000

During this year, I spent a part of my job on understanding the processes and concepts around monitoring in my company. The DevOps mindset mainly drove the idea to move our SQL Server monitoring to the existing Prometheus and Grafana infrastructure. Obviously, there were some technical decisions behind the scene, but the most important part of this write-up is dedicated to explaining other and likely most important reasons of this decision.

But let’s precise first, this write-up doesn’t constitute any guidance or any kind of best practices for DBAs but only some sharing of my own experience on the topic. As usual, any comment will be appreciated.

That’s said, let’s continue with the context. At the beginning of this year, I started my new DBA position in a customer-centric company where DevOps culture, microservices and CI/CD are omnipresent. What does it mean exactly? To cut the story short, development and operation teams are used a common approach for agile software development and delivery. Tools and processes are used to automate build, test, deploy and to monitor applications with speed, quality and control. In other words, we are talking about Continuous Delivery and in my company, release cycle is faster than traditional shops I encountered so far with several releases per day including database changes. Another interesting point is that we are following the « Operate what you build » principle each team that develops a service is also responsible for operating and supporting it. It presents some advantages for both developers and operations but pushing out changes requires to get feedback and to observe impact on the system on both sides.

In addition, in operation teams we try to act as a centralized team and each member should understand the global scope and topics related to the infrastructure and its ecosystem. This is especially true when you’re dealing with nightly on-calls. Each has its own segment responsibility (regarding their specialized areas) but following DevOps principles, we encourage shared ownership to break down internal silos for optimizing feedback and learning. It implies anyone should be able to temporarily overtake any operational task to some extent assuming the process is well-documented, and learnin has been done correctly. But world is not perfect and this model has its downsides. For example, it will prioritize effectiveness in broader domains leading to increase cognitive load of each team member and to lower visibility in for vertical topics when deeper expertise is sometimes required. Having an end-to-end observable system including infrastructure layer and databases may help to reduce time for investigating and fixing issues before end users experience them.

The initial scenario

Let me give some background info and illustration of the initial scenario:

… and my feeling of what could be improved:

1) From a DBA perspective, at a first glance there are many potential issues. Indeed, a lot of automated or semi-manual deployment processes are out of the control and may have a direct impact on the database environment stability. Without better visibility, there is likely no easy way to address the famous question: He, we are experiencing performance degradations for two days, has something happened on database side?

2) Silos are encouraged between DBA and DEVs in this scenario. Direct consequence is to limit drastically the adding value of the DBA role in a DevOps context. Obviously, primary concerns include production tasks like ensuring integrity, backups and maintenance of databases. But in a DevOps oriented company where we have automated « database-as-code » pipelines, they remain lots of unnecessary complexity and disruptive scripts that DBA should take care. If this role is placed only at the end of the delivery pipeline, collaboration and continuous learning with developer teams will restricted at minimum.

3) There is a dedicated monitoring tool for SQL Server infrastructure and this is a good point. It provides necessary baselining and performance insights for DBAs. But in other hand, the tool in place targets only DBA profiles and its usage is limited to the infrastructure team. This doesn’t contribute to help improving the scalability in the operations team and beyond. Another issue with the existing tooling is correlation can be difficult with external events that come from either the continuous delivery pipeline or configuration changes performed by operations teams on the SQL Server instances. In this case, establishment of observability (the why) may be limited and this is what teams need to respond quickly and resolve emergencies in modern and distributed software.

What is observability?

You probably noticed the word « observability » in my previous sentence, so I think it deserves some explanations before to continue. Observability might seem like a buzzword but in fact it is not a new concept but became prominent in DevOps software development lifecycle (SDLC) methodologies and distributed infrastructure systems. Referring to the Wikipedia definition, Observability is the ability to infer internal states of a system based on the system’s external outputs. To be honest, it has not helped me very much and further readings were necessary to shed the light on what observability exactly is and what difference exist with monitoring.

Let’s start instead with monitoring which is the ability to translate infrastructure log metrics data into meaningful and actionable insights. It helps knowing when something goes wrong and starting your response quickly. This is the basis for monitoring tool and the existing one is doing a good job on it. In DBA world, monitoring is often related to performance but reporting performance is only as useful as that reporting accurately represents the internal state of the global system and not only your database environment. For example, in the past I went to some customer shops where I was in charge to audit their SQL Server infrastructure. Generally, customers were able to present their context, but they didn’t get the possibility to provide real facts or performance metrics of their application. In this case, you usually rely on a top-down approach and if you’re either lucky or experimented enough, you manage to find what is going wrong. But sometimes I got relevant SQL Server metrics that would have highlighted a database performance issue, but we didn’t make a clear correlation with those identified on application side. In this case, relying only on database performance metrics was not enough for inferring the internal state of the application. From my experience, many shops are concerned with such applications that have been designed for success and not for failure. They often lake of debuggability monitoring and telemetry is often missing. Collecting data is as the base of observability.

Observability provides not only the when of an error or issue, but more importantly the why. With modern software architectures including micro-services and the emphasis of DevOps, monitoring goals are no longer limited to collecting and processing log data, metrics, and event traces. Instead, it should be employed to improve observability by getting a better understanding of the properties of an application and its performance across distributed systems and delivery pipeline. Referring to the new context I’m working now, metric capture and analysis is started with deployment of each micro-service and it provides better observability by measuring all the work done across all dependencies.

White-Box vs. Black-Box Monitoring

In my company as many other companies, different approaches are used when it comes monitoring: White-box and Black-Box monitoring.
White-box monitoring focuses on exposing internals of a system. For example, this approach is used by many SQL Server performance tools on the market that make effort to set a map of the system with a bunch of internal statistic data about index or internal cache usage, existing wait stats, locks and so on …

In contrast, black-Box monitoring is symptom oriented and tests externally visible behavior as a user would see it. Goal is only monitoring the system from the outside and seeing ongoing problems in the system. There are many ways to achieve black-box monitoring and the first obvious one is using probes which will collect CPU or memory usage, network communications, HTTP health check or latency and so on … Another option is to use a set of integration tests that run all the time to test the system from a behavior / business perspective.

White-Box vs. Black-Box Monitoring: Which is finally more important? All are and can work together. In my company, both are used at different layers of the micro-service architecture including software and infrastructure components.

RED vs USE monitoring

When you’re working in a web-oriented and customer-centric company, you are quickly introduced to The Four Golden Signals monitoring concept which defines a series of metrics originally from Google Site Reliability Engineering including latency, traffic, errors and saturation. The RED method is a subset of “Four Golden Signals” and focus on micro-service architectures and include following metrics:

Rate: number of requests our service is serving per second
Error: number of failed requests per second
Duration: amount of time it takes to process a request

Those metrics are relatively straightforward to understand and may reduce time to figure out which service was throwing the errors and then eventually look at the logs or to restart the service, whatever.

For HTTP Metrics the RED Method is a good fit while the USE Method is more suitable for infrastructure side where main concern is to keep physical resources under control. The latter is based on 3 metrics:

Utilization: Mainly represented in percentage and indicates if a resource is in underload or overload state.
Saturation: Work in a queue and waiting to be processed
Errors: Count of event errors

Those metrics are commonly used by DBAs to monitor performance. It is worth noting that utilization metric can be sometimes misinterpreted especially when maximum value depends of the context and can go over 100%.

SQL Server infrastructure monitoring expectations

Referring to the starting scenario and all concepts surfaced above, it was clear for us to evolve our existing SQL Server monitoring architecture to improve our ability to reach the following goals:

Keeping analyzing long-term trends to respond usual questions like how my daily-workload is evolving? How big is my database? …
Alerting to respond for a broken issue we need to fix or for an issue that is going on and we must check soon.
Building comprehensive dashboards – dashboards should answer basic questions about our SQL Server instances, and should include some form of the advanced SQL telemetry and logging for deeper analysis.
Conducting an ad-hoc retrospective analysis with easier correlation: from example an http response latency that increased in one service. What happened around? Is-it related to database issue? Or blocking issue raised on the SQL Server instance? Is it related to a new query or schema change deployed from the automated delivery pipeline? In other words, good observability should be part of the new solution.
Automated discovery and telemetry collection for every SQL Server instance installed on our environment, either on VM or in container.
To rely entirely on the common platform monitoring based on Prometheus and Grafana. Having the same tooling make often communication easier between people (human factor is also an important aspect of DevOps)

Prometheus, Grafana and Telegraf

Prometheus and Grafana are the central monitoring solution for our micro-service architecture. Some others exist but we’ll focus on these tools in the context of this write-up.
Prometheus is an open-source ecosystem for monitoring and alerting. It uses a multi-dimensional data model based on time series data identified by metric name and key/value pairs. WQL is the query language used by Prometheus to aggregate data in real time and data are directly shown or consumed via HTTP API to allow external system like Grafana. Unlike previous tooling, we appreciated collecting SQL Server metrics as well as those of the underlying infrastructure like VMWare and others. It allows to comprehensive picture of a full path between the database services and infrastructure components they rely on.

Grafana is an open source software used to display time series analytics. It allows us to query, visualize and generate alerts from our metrics. It is also possible to integrate a variety of data sources in addition of Prometheus increasing the correlation and aggregation capabilities of metrics from different sources. Finally, Grafana comes with a native annotation store and the ability to add annotation events directly from the graph panel or via the HTTP API. This feature is especially useful to store annotations and tags related to external events and we decided to use it for tracking software releases or SQL Server configuration changes. Having such event directly on dashboard may reduce troubleshooting effort by responding faster to the why of an issue.

For collecting data we use Telegraf plugin for SQL Server. The plugin exposes all configured metrics to be polled by a Prometheus server. The plugin can be used for both on-prem and Azure instances including Azure SQL DB and Azure SQL MI. Automated deployment and configuration requires low effort as well.

The high-level overview of the new implemented monitoring solution is as follows:

SQL Server telemetry is achieved through Telegraf + Prometheus and includes both Black-box and White-box oriented metrics. External events like automated deployment, server-level and database-level configuration changes are monitored through a centralized scheduled framework based on PowerShell. Then annotations + tags are written accordingly to Grafana and event details are recorded to logging tables for further troubleshooting.

Did the new monitoring met our expectations?

Well, having experienced the new monitoring solution during this year, I would say we are on a good track. We worked mainly on 2 dashboards. The first one exposes basic black-box metrics to show quickly if something is going wrong while the second one is DBA oriented with a plenty of internal counters to dig further and to perform retrospective analysis.

Here a sample of representative issues we faced this year and we managed to fix with the new monitoring solution:

1) Resource pressure and black-box monitoring in action:

For this scenario, the first dashboard highlighted resource pressure issues, but it is worth noting that even if the infrastructure was burning, users didn’t experience any side effects or slowness on application side. After corresponding alerts raised on our side, we applied proactive and temporary fixes before users experience them. I would say, this scenario is something we would able to manage with previous monitoring and the good news is we didn’t notice any regression on this topic.

2) Better observability for better resolution of complex issue

This scenario was more interesting because the first symptom started from the application side without alerting the infrastructure layer. We started suffering from HTTP request slowness on November around 12:00am and developers got alerted with sporadic timeout issues from the logging system. After they traversed the service graph, they spotted on something went wrong on the database service by correlating http slowness with blocked processes on SQL Server dashboard as shown below. I put a simplified view on the dashboards, but we need to cross several routes between the front-end services and databases.

Then I got a call from them and we started investigating blocking processes from the logging tables in place on SQL Server side. At a first glance, different queries with a longer execution time than usual and neither release deployments nor configuration updates may explain such sudden behavior change. The issue kept around and at 15:42 it started appearing more frequently to deserve a deeper look at the SQL Server internal metrics. We quickly found out some interesting correlation with other metrics and we finally managed to figure out why things went wrong as show below:

Root cause was related to transaction replication slowness within Always On availability group databases and we directly jumped on storage issue according to error log details on secondary:

End-to-End observability by including the database services to the new monitoring system drastically reduces the time for finding the root cause. But we also learnt from this experience and to continuously improve the observability we added a black-box oriented metric related to availability group replication latency (see below) to detect faster any potential issue.

And what’s next?

Having such monitoring is not the endpoint of this story. As said at the beginning of this write-up, continuous delivery comes with its own DBA challenges illustrated by the starting scenario. Traditionally the DBA role is siloed, turning requests or tickets into work and they can be lacking context about the broader business or technology used in the company. I experienced myself several situations where you get alerted during the night when developer’s query exceeds some usage threshold. Having discussed the point with many DBAs, they tend to be conservative about database changes (normal reaction?) especially when you are at the end of the delivery process without clear view of what will could deployed exactly.

Here the new situation:

Implementing new monitoring stuff changed the way to observe the system (at least from a DBA perspective). Again, I believe the adding value of DBA role in a company with a strong DevOps mindset is being part of both production DBAs and Development DBAs. Making observability consistent across all the delivery pipeline including databases is likely part of the success and may help DBA getting a broader picture of system components. Referring to my context, I’m now able to get more interaction with developer teams on early phases and to provide them contextual feedbacks (and not generic feedbacks) for improvements regarding SQL production telemetry. They also have access to them and can check by themselves impact of their development. In the same way, feedbacks and work with my team around database infrastructure topic may appear more relevant.

It is finally a matter of collaboration

Monitoring Azure SQL Databases with Azure Monitor and Automation

mikedavem — Sun, 23 Aug 2020 15:32:07 +0000

Supervising Cloud Infrastructure is an important aspect of Cloud administration and Azure SQL Databases are no exception. This is something we are continuously improving at my company.

On-prem, DBAs often rely on well-established products but with Cloud-based architectures, often implemented through DevOps projects and developers, monitoring should be been redefined and include some new topics as:

1) Cloud service usage and fees observability
2) Metrics and events detection that could affect bottom line
3) Implementing a single platform to report all data that comes from different sources
4) Trigger rules with data if workload reaches over or drops below certain levels or when an event is enough relevant to not meet the configuration standard and implies unwanted extra billing or when it compromises the company security rules.
5) Monitoring of the user experience

A key benefit often discussed about Cloud computing, and mainly driven by DevOps, is how it enables agility. One of the meaning of term agility is tied to the rapid provisioning of computer resources (in seconds or minutes) and this shortening provisioning path enables work to start quickly. You may be tempted to grant some provisioning permissions to DEV teams and from my opinion this is not a bad thing, but it may come with some drawbacks if not under control by Ops team including database area. Indeed, for example I have in mind some real cases including architecture configuration drift, security breaches created by unwanted item changes, or idle orphan resources for which you keep being charged. All of these scenarios may lead either to security issues or extra billing and I believe it is important to get clear visibility of such events.

In my company, Azure built-in capabilities with Azure Monitor architecture are our first target (at least in a first stage) and seem to address the aforementioned topics. To set the context, we already relied on Azure Monitor infrastructure for different things including Query Performance Insight, SQL Audit analysis through Log Analytics and Azure alerts for some performance metrics. Therefore, it was the obvious way to go further by adding activity log events to the story.

In this blog post, let’s focus on the items 2) 4). I would like to share some experimentations and thoughts about them. As a reminder, items 2) 4) are about catching relevant events to help identifying configuration and security drifts and performing actions accordingly. In addition, as many event-based architectures, additional events may appear or evolve over the time and we started thinking about the concept with the following basic diagram …

… that led to the creation of the two following workflows:
– Workflow 1: To get notified immediately for critical events that may compromise security or lead immediately to important extra billing
– Workflow 2: To get a report of other misconfigured items (including critical ones) on schedule basis but don’t require quick responsiveness of Ops team.

Concerning the first workflow, using alerts on activity logs, action groups and webhooks as input of an Azure automation runbook appeared to be a good solution. On another side, the second one only requires running an Azure automation workbook on schedule basis. In fact, this is the same runbook but with different input parameters according to the targeted environment (e.g. PROD / ACC / INT). In addition, the runbook should be able to identity unmanaged events and notified Ops team who will decide either to skip it or to integrate it to runbook processing.

Azure alerts which can be divided in different categories including metric, log alerts and activity log alerts. The last one drew our attention because it allows getting notified for operation of specific resources by email or by generating JSON schema reusable from Azure Automation runbook. Focusing on the latter, we had come up (I believe) with what we thought was a reasonable solution.

Here the high-level picture of the architecture we have implemented:

1- During the creation of an Azure SQL Server or a database, corresponding alerts are added with Administrative category with a specific scope. Let’s precise that concerned operations must be registered with Azure Resource Manager in order to be used in Activity Log and fortunately they are all including in the Microsoft.Sql resource provider in this case.
2- When an event occurs on the targeted environment, an alert is triggered as well as the concerned runbook.
3- The execution of the same runbook but with different input parameters is scheduled on weekly basis to a general configuration report of our Azure SQL environments.
4- According the event, Ops team gets notified and acts (either to update misconfigured item, or to delete the unauthorized item, or to update runbook code on Git Repo to handle the new event and so on …)

The skeleton of the Azure automation runbook is pretty similar to the following one:

[OutputType("PSAzureOperationResponse")]
param
(
[Parameter (Mandatory=$false)]
[object] $WebhookData
,
[parameter(Mandatory=$False)]
[ValidateSet("PROD","ACC","INT")]
[String]$EnvTarget
,
[parameter(Mandatory=$False)]
[Boolean]$DebugMode = $False
)

If ($WebhookData)
{

# Logic to allow for testing in test pane
If (-Not $WebhookData.RequestBody){
$WebhookData = (ConvertFrom-Json -InputObject $WebhookData)
}

$WebhookBody = (ConvertFrom-Json -InputObject $WebhookData.RequestBody)

$schemaId = $WebhookBody.schemaId

If ($schemaId -eq "azureMonitorCommonAlertSchema") {
# This is the common Metric Alert schema (released March 2019)
$Essentials = [object] ($WebhookBody.data).essentials
# Get the first target only as this script doesn't handle multiple
$status = $Essentials.monitorCondition

# Focus only on succeeded or Fired Events
If ($status -eq "Succeeded" -Or $Status -eq "Fired")
{
# Extract info from webook
$alertTargetIdArray = (($Essentials.alertTargetIds)[0]).Split("/")
$SubId = ($alertTargetIdArray)[2]
$ResourceGroupName = ($alertTargetIdArray)[4]
$ResourceType = ($alertTargetIdArray)[6] + "/" + ($alertTargetIdArray)[7]

# Determine code path depending on the resourceType
if ($ResourceType -eq "microsoft.sql/servers")
{
# DEBUG
Write-Output "This is a SQL Server Resource."

$firedDate = $Essentials.firedDateTime
$AlertContext = [object] ($WebhookBody.data).alertContext
$channel = $AlertContext.channels
$EventSource = $AlertContext.eventSource
$Level = $AlertContext.level
$Operation = $AlertContext.operationName
$Properties = [object] ($WebhookBody.data).alertContext.properties
$EventName = $Properties.eventName
$EventStatus = $Properties.status
$Description = $Properties.description_scrubbed
$Caller = $Properties.caller
$IPAddress = $Properties.ipAddress
$ResourceName = ($alertTargetIdArray)[8]
$DatabaseName = ($alertTargetIdArray)[10]
$Operation_detail = $Operation.Split('/')

# Check firewall rules
If ($EventName -eq 'OverwriteFirewallRules'){
Write-Output "Firewall Overwrite is detected ..."
# Code to handle firewall update event
}
# Update DB => No need to be monitored in real time
Elseif ($EventName -eq 'UpdateDatabase') {
# Code to handle Database config update event or skip
}
# Create DB
Elseif ($EventName -eq 'CreateDatabase' -Or `
$Operation -eq 'Microsoft.Sql/servers/databases/write'){
Write-Output "Azure Database creation has been detected ..."
# Code to handle Database creation event or skip
}
# Delete DB
Elseif ($EventName -eq 'DeleteDatabase' -Or `
$Operation -eq 'Microsoft.Sql/servers/databases/delete') {
Write-Output "Azure Database has been deleted ..."
# Code to handle Database deletion event or skip
}
Elseif ($Operation -eq 'Microsoft.Sql/servers/databases/transparentDataEncryption/write') {
Write-Output "Azure Database Encryption update has been detected ..."
# Code to handle Database encryption update event or skip
}
Elseif ($Operation -eq 'Microsoft.Sql/servers/databases/auditingSettings/write') {
Write-Output "Azure Database Audit update has been detected ..."
# Code to handle Database audit update event or skip
}
Elseif ($Operation -eq 'Microsoft.Sql/servers/databases/securityAlertPolicies/write' -or $Operation -eq 'Microsoft.Sql/servers/databases/vulnerabilityAssessments/write') {
Write-Output "Azure ADS update has been detected ..."
# Code to handle ADS update event or skip
}
ElseIf ($Operation -eq 'Microsoft.Sql/servers/databases/backupShortTermRetentionPolicies/write'){
Write-Output "Azure Retention Backup has been modified ..."
# Code to handle Database retention backup update event or skip
}
# ... other ones
Else {
Write-Output "Event not managed yet "
}
else {
# ResourceType not supported
Write-Error "$ResourceType is not a supported resource type for this runbook."
}
}
}
Else {
# The alert status was not 'Activated' or 'Fired' so no action taken
Write-Verbose ("No action taken. Alert status: " + $status) -Verbose
}
}
Else{
# SchemaID doesn't correspond to azureMonitorCommonAlertSchema =>> Skip
Write-Host "Skip ..."
}
}
Else {
Write-Output "No Webhook detected ... switch to normal mode ..."

If ([String]::IsNullOrEmpty($EnvTarget)){
Write-Error '$EnvTarget is mandatory in normal mode'
}

#########################################################
# Code for a complete check of Azure SQL DB environment #
#########################################################
}

Some comments about the PowerShell script:

1) Input parameters should include either the Webhook data or specific parameter values for a complete Azure SQL DB check.
2) The first section should include your own functions to respond to different events. In our context, currently we drew on DBAChecks thinking to develop a derived model but why not using directly DBAChecks in a near future?
3) When an event is triggered, a JSON schema is generated and provides insight. The point here is you must navigate through different properties according to the operation type (cf. BOL).
4) The increase of events to manage could be a potential issue making the runbook fat especially if we keep both the core functions and event processing. To mitigate this topic, we are thinking to move functions into modules in Azure automation (next step).

Bottom line

Thanks to Azure built-in capabilities we improved our visibility of events that occur on the Azure SQL environment (both expected and unexcepted) and we’re now able to act accordingly. But I should tell you that going this way is not a free lunch and we achieved a reasonable solution after some programming and testing efforts. If you can invest time, it is probably the kind of solution you can add to your study.

See you

dbachecks and AlwaysOn availability group checks

mikedavem — Mon, 20 Apr 2020 19:57:31 +0000

When I started my DBA position in my new company, I was looking for a tool that was able to check periodically the SQL Server database environments for several reasons. First, as DBA one of my main concern is about maintaining and keeping the different mssql environments well-configured against an initial standard. It is also worth noting I’m not the only person to interact with databases and anyone in my team, which is member of sysadmin server role as well, is able to change any server-level configuration settings at any moment. In this case, chances are that having environments shifting from our initial standard over the time and my team and I need to keep confident by checking periodically the current mssql environment configurations, be alerting if configuration drifts exist and obviously fix it as faster as possible.

A while ago, I relied on SQL Server Policy Based Management feature (PBM) to carry out this task at one of my former customers and I have to say it did the job but with some limitations. Indeed, PBM is the instance-scope feature and doesn’t allow to check configuration settings outside the SQL Server instance for example. During my investigation, dbachecks framework drew my attention for several reasons:

– It allows to check different settings at different scopes including Operating System and SQL Server instance items
– It is an open source project and keeps evolving with SQL / PowerShell community contributions.
– It is extensible, and we may include custom checks to the list of predefined checks shipped with the targeted version.
– It is based on PowerShell, Pester framework and fits well with existing automation and GitOps process in my company

The first dbacheck version we deployed in production a couple of month ago was 1.2.24 and unfortunately it didn’t include reliable tests for availability groups. It was the starting point of my first contributions to open source projects and I felt proud and got honored when I noticed my 2 PRs validated for the dbacheck tool including Test Disk Allocation Unit and Availability Group checks:

Obviously, this is just an humble contribution and to be clear, I didn’t write the existing tests for AGs but I spent some times to apply fixes for a better detection of all AG environments including their replicas in a simple and complex topologies (several replicas on the same server and non-default ports for example).

So, here the current list of AG checks in the version 1.2.29 at the moment of this write-up:

– Cluster node should be up
– AG resource + IP Address in the cluster should be online
– Cluster private and public network should be up
– HADR should be enabled on each AG replica
– AG Listener + AG replicas should be pingable and reachable from client connections
– AG replica should be in the correct domain name
– AG replica port number should be equal to the port specified in your standard
– AG availability mode should not be in unknown state and should be in synchronized or synchronizing state regarding the replication type
– Each high available database (member of an AG) should be in synchronized / synchronizing state, ready for failover, joined to the AG and not in suspended state
– Each AG replica should have an extended event session called AlwaysOn_health which is in running state and configured in auto start mode

Mandatory parameters are app.cluster and domain.name.

Get-DbcCheck -Tag HADR | ft Group, Type, AllTags, Config -AutoSize

Group Type AllTags Config
----- ---- ------- ------
HADR ClusterNode ClusterHealth, HADR app.sqlinstance app.cluster skip.hadr.listener.pingcheck domain.name policy...

The starting point of the HADR checks is the Windows Failover Cluster component and hierarchically other tests are performed on each sub component including availability group, AG replicas and AG databases.

Then you may change the behavior on the HADR check process according to your context by using the following parameters:

– skip.hadr.listener.pingcheck => Skip ping check of hadr listener
– skip.hadr.listener.tcpport => Skip check of standard tcp port about AG listerners
– skip.hadr.replica.tcpport => Skip check of standard tcp port about AG replicas

For instance, in my context, I configured the hadr.replica.tcpport parameter to skip checks on replica ports because we own different environments that including several replicas on the same server and which listen on a non-default port.

Get-DbcConfig skip.hadr.*
Name Value Description
---- ----- -----------
skip.hadr.listener.pingcheck False Skip the HADR listener ping test (especially useful for Azure and AWS)
skip.hadr.listener.tcpport False Skip the HADR AG Listener TCP port number (If port number is not standard acro...
skip.hadr.replica.tcpport True Skip the HADR Replica TCP port number (If port number is not standard across t...

Running the HADR check can be simply run by using HADR tag as follows:

Invoke-DbcCheck -Tag HADR
Pester v4.10.1 Executing all tests in 'C:\Program Files\WindowsPowerShell\Modules\dbachecks\1.2.29\checks\HADR.Tests.ps1' with Tags HADR
...

Well, this a good start but I think some almost of checks are state-oriented and some configuration checks are missing. I’m already willing to add some of them in a near the future or/and feel free to add your own contribution as well

Stay tuned!

SQL DB Azure, performance scaling thoughts

mikedavem — Thu, 20 Feb 2020 21:09:54 +0000

Let’s continue with Azure stories and performance scaling …

A couple of weeks ago, we studied opportunities to replace existing clustered indexes (CI) with columnstore indexes (CCI) for some facts. To cut the story short and to focus on the right topic of this write-up, we prepared a creation script for specific CCIs based on the Niko’s technique variation (no MAXDOP = 1 meaning we enable parallelism) in order to get a better segment alignment.

-- Recreation of clustered index
CREATE CLUSTERED INDEX [PK_FACT_IDX]
ON dbo.FactTable (KeyColumn)
WITH (DROP_EXISTING = ON, DATA_COMPRESSION = PAGE);

-- Creation of the CCI
CREATE CLUSTERED COLUMNSTORE INDEX [PK_FACT_IDX]
ON dbo.FactTable
WITH (DROP_EXISTING = ON);

-- Recreation of [[... n] nonclustered indexes
CREATE INDEX [IDX_xxx … n]
ON dbo.FactTable (column)
WITH (DROP_EXISTING = ON, DATA_COMPRESSION = PAGE);

Before deploying those indexes in our SQL DB Azure environment, we staged a first scenario in on-premises instance and the creation of all indexes took ~ 1h. It is worth noting that our tests are based on the same database with the same data in all cases. But guess what, the story was different in Azure and I got feedbacks from another team who was responsible to deploy indexes in Azure, the creation script was a bit longer (~ 4h).
I definitely enjoyed this story because we got a deeper understanding of DB Azure performance topic.

=> Moving to the cloud means we’ll get slower performance?

Before drawing conclusions to quickly a good habit to get is to compare specifications between environments. It’s not about comparing oranges and apples. Well let’s set my own context: from one side, the on-premises virtual SQL Server environment specification includes 8vCPUs (Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz), 64 GB of RAM and a high-performance storage array with micro latency device dedicated to our IO intensive workloads. From the vendor specifications, we may except very interesting IO performance with a general throughput greater than 100 KIOPs (Random) or 1GB/s (sequential). On another side, the SQL DB Azure is based on the service pricing tier General Purpose: Serverless Gen5, 8 vCores. We use the vCore purchasing model and referring to the Microsoft documentation, hardware generation 5 includes a compute specification based on Intel E5-2673 v4 (Broadwell) 2.3-GHz and Intel SP-8160 (Skylake) processors. Added to this, the service pricing tier comes with a remote SSD based storage including IO latency around 5-7ms and 2560 IOPs max. Given the opportunity of the infrastructure elasticity, we could scale to up 16 vCores, 48GB of RAM and 5120 IOPs for data. Obviously, latency remains the same in this case.

As illustration, creation of all indexes (CI + CCI + NCIs) performed in our on-premises environment gave the following storage performance figures: ~ 700MB/s and 13K IOPs for maximum values that were an aggregation of DATA + LOG activity on D: drive. Rebuilding indexes are high resource consuming operations in terms of CPU as well and we obviously noticed CPU saturation at different steps of the operation.

…

As an aside, we may notice the creation of CCI is a less intensive operation in terms of resources and we retrieve the same pattern in Azure below. Talking of which, let’s compare with our SQL Azure DB. There are different ways to get performance metrics including the portal which enables monitoring performance through easy-to-use interface or DMVs for each Azure DB like sys.dm_db_resource_stats. It is worth noting that in SQL Azure DB metrics are expressed as percentage of the service tier limit, so you need to adjust your analysis with the tier you’re using. First, we observed the same resource utilization pattern for all steps of the creation script but within a different timeline – duration has increased to 4h (as mentioned by another team). There is a clear picture of reaching the limit of the configured service tier, especially for Log IO (green line) and we already switched from GP_S_Gen5_8 to GP_S_Gen5_16 service tier

In addition, Wait stats gave interesting insights as well:

Excluding the traditional PAGEIOLATCH_xx waits, the LOG_RATE_GOVERNOR wait type appeared in the top waits and confirms that we bumped into the limits imposed on transaction log I/O by our performance tier.

=> Scaling vs Upgrading the Service for better performance?

With SQL DB Azure PaaS, we may benefit from elastic architecture. Firstly, scaling the number of CPUs is a factor of improvement and there is a direct relationship with storage (IOPs), memory or disk space allocated for tempdb for instance. But the order of magnitude varies with the service tier as shown below:

For General Purpose ServerLess Generation 5 service tier – Resources per Core

Something relevant here because even performance increases with the number of vCores provisioned, we can deduce Log IO saturation from our test in Azure (especially in the first step of the CI creation) results of max log rate limitation that doesn’t scale in the same way. This is especially relevant here because as said previously index creation can be an resource intensive operation with a huge impact on the transaction log.

What would be a solution to speed-up this operation?

First viable solution in our context would be to switch to SIMPLE recovery model that fits perfectly with our scenario because we could get minimally-logged capabilities and a lower impact on the transaction log and because it is suitable for DW environments. Unfortunately, at the moment of this write-up, this is not supported and I suggest you to vote on feedback Azure if you are interested in.
From an infrastructure standpoint, improving max log rate throughput is only possible by upgrading to a higher service tier (but at the cost of higher fees obviously). For a sake of curiosity, I did a try with the BC_Gen5_16 service tier specifications:

Even if this new service tier seems to be a better fit (suggested by the relative percentage of resource usage) …

… there are important notes here:

1) Business Critical Tier is not available for Serverless architecture

2) Moving to a different service is not instantaneous and it may require several hours according to the database size (~ 3h for a total size of ~500GB database size in my case). Well, this is not viable option even if get better performance. Indeed, if we add the time to upgrade to a higher service tier (3h) + time to run the creation script (3h or 25% of performance gain compared to the previous GP_S_Gen5_16 service tier). We may obviously upgrade again to reach performance closer to our on-premises environment but does it worth fighting for here only for an index creation script?

Concerning our scenario (Data Warehouse), it is generally easy to schedule a non-peak hours time frame that doesn’t overlap with the processing-oriented workload but it could not be the case for everyone!

See you!

SQL Server sur Linux: Introduction à DBFS

mikedavem — Tue, 02 Jan 2018 19:03:51 +0000

Il y a quelques mois, Microsoft annonçait 2 outils en lignes de commandes supplémentaires pour SQL Server avec notamment mssql-scripter et DBFS. Le dernier a tout particulièrement attiré mon attention car il expose les données en temps réels depuis les fameuses DMVs sur un pseudo système de fichiers virtuel à la façon Linux procfs.

> Lire la suite (en anglais)

David Barbarin
MVP & MCM SQL Server

SQL Saturdays 2013 à Paris : les slides

mikedavem — Mon, 07 Oct 2013 14:43:19 +0000

Comme promis voici les slides de ma session sur les événements étendus. Malheureusement je n’ai pas pu effectuer toutes les démonstrations que je voulais .. eh oui il faut bien commencer par expliquer comment fonctionne les événements étendus et cela m’a pris un peu plus de temps que prévu . Mais ce n’est que partie remise, je garde cette partie pour une session uniquement orientée démonstration. J’espère pouvoir vous la présenter rapidement.

Merci aux personnes qui ont voulu assister à ma présentation et de assister de manière générale à ce premier SQL Saturdays Paris. Ce fut un bon moment d’échange et de rencontres !

David BARBARIN (Mikedavem)
MVP SQL Server