David Barbarin » sqlserver

Extending SQL Server monitoring with Raspberry PI and Lametric

mikedavem — Thu, 07 Jan 2021 21:59:25 +0000

First blog of this new year 2021 and I will start with a fancy and How-To Geek topic

In my last blog post, I discussed about monitoring and how it should help to address quickly a situation that is going degrading. Alerts are probably the first way to raise your attention and, in my case, they are often in the form of emails in a dedicated folder. That remains a good thing, at least if you’re not focusing too long in other daily tasks or projects. In work office, I know I would probably better focus on new alerts but as I said previously, telework changed definitely the game.

I wanted to find a way to address this concern at least for main SQL Server critical alerts and I thought about relying on my existing home lab infrastructure to address the point. Reasons are it is always a good opportunity to learn something and to improve my skills by referring to a real case scenario.

My home lab infrastructure includes a cluster of Raspberry PI 4 nodes. Initially, I use it to improve my skills on K8s or to study some IOT stuff for instance. It is a good candidate for developing and deploying a new app for detecting new incoming alerts in my mailbox and sending notifications to my Lametric accordingly.

Lametric is a basically a connected clock but works also as a highly-visible display showing notifications from devices or apps via REST APIs. First time I saw such device in action was in a DevOps meetup in 2018 around Docker and Jenkins deployment with Eric Dusquenoy and Tim Izzo (@5ika_). In addition, one of my previous customers had also one in his office and we had some discussions about cool customization through Lametric apps.

Connection through VPN to my company network is mandatory to work from home and unfortunately Lametric device doesn’t support this scenario because communication is limited to local network only. So, I need an app that run on my local (home) network and able to connect to my mailbox, get new incoming emails and finally sending notifications to my Lametric device.

Here my setup:

There are plenty of good blog posts to create a Raspberry cluster on the internet and I would suggest to read that of Andrew Pruski (@dbafromthecold).

As shown above, there are different paths for SQL alerts referring our infrastructure (On-prem and Azure SQL databases) but all of them are send to a dedicated distribution list for DBA.

The app is a simple PowerShell script that relies on Exchange Webservices APIs for connecting to the mailbox and to get new mails. Sending notifications to my Lametric device is achieved by a simple REST API call with well-formatted body. Details can be found the Lametric documentation. As prerequisite, you need to create a notification app from Lametric Developer site as follows:

As said previously, I used PowerShell for this app. It can help to find documentation and tutorials when it comes Microsoft product. But if you are more confident with Python, APIs are also available in a dedicated package. But let’s precise that using PowerShell doesn’t necessarily mean using Windows-based container and instead I relied on Linux-based image with PowerShell core for ARM architecture. Image is provided by Microsoft on Docker Hub. Finally, sensitive information like Lametric Token or mailbox credentials are stored in K8s secret for security reasons. My app project is available on my GitHub. Feel free to use it.

Here some results:

– After deploying my pod:

– The app is running and checking new incoming emails (kubectl logs command)

When email is detected, notification is sendig to Lametric device accordingly

Geek fun good (bad?) idea to start this new year 2021

Why we moved SQL Server monitoring on Prometheus and Grafana

mikedavem — Tue, 22 Dec 2020 16:55:12 +0000

During this year, I spent a part of my job on understanding the processes and concepts around monitoring in my company. The DevOps mindset mainly drove the idea to move our SQL Server monitoring to the existing Prometheus and Grafana infrastructure. Obviously, there were some technical decisions behind the scene, but the most important part of this write-up is dedicated to explaining other and likely most important reasons of this decision.

But let’s precise first, this write-up doesn’t constitute any guidance or any kind of best practices for DBAs but only some sharing of my own experience on the topic. As usual, any comment will be appreciated.

That’s said, let’s continue with the context. At the beginning of this year, I started my new DBA position in a customer-centric company where DevOps culture, microservices and CI/CD are omnipresent. What does it mean exactly? To cut the story short, development and operation teams are used a common approach for agile software development and delivery. Tools and processes are used to automate build, test, deploy and to monitor applications with speed, quality and control. In other words, we are talking about Continuous Delivery and in my company, release cycle is faster than traditional shops I encountered so far with several releases per day including database changes. Another interesting point is that we are following the « Operate what you build » principle each team that develops a service is also responsible for operating and supporting it. It presents some advantages for both developers and operations but pushing out changes requires to get feedback and to observe impact on the system on both sides.

In addition, in operation teams we try to act as a centralized team and each member should understand the global scope and topics related to the infrastructure and its ecosystem. This is especially true when you’re dealing with nightly on-calls. Each has its own segment responsibility (regarding their specialized areas) but following DevOps principles, we encourage shared ownership to break down internal silos for optimizing feedback and learning. It implies anyone should be able to temporarily overtake any operational task to some extent assuming the process is well-documented, and learnin has been done correctly. But world is not perfect and this model has its downsides. For example, it will prioritize effectiveness in broader domains leading to increase cognitive load of each team member and to lower visibility in for vertical topics when deeper expertise is sometimes required. Having an end-to-end observable system including infrastructure layer and databases may help to reduce time for investigating and fixing issues before end users experience them.

The initial scenario

Let me give some background info and illustration of the initial scenario:

… and my feeling of what could be improved:

1) From a DBA perspective, at a first glance there are many potential issues. Indeed, a lot of automated or semi-manual deployment processes are out of the control and may have a direct impact on the database environment stability. Without better visibility, there is likely no easy way to address the famous question: He, we are experiencing performance degradations for two days, has something happened on database side?

2) Silos are encouraged between DBA and DEVs in this scenario. Direct consequence is to limit drastically the adding value of the DBA role in a DevOps context. Obviously, primary concerns include production tasks like ensuring integrity, backups and maintenance of databases. But in a DevOps oriented company where we have automated « database-as-code » pipelines, they remain lots of unnecessary complexity and disruptive scripts that DBA should take care. If this role is placed only at the end of the delivery pipeline, collaboration and continuous learning with developer teams will restricted at minimum.

3) There is a dedicated monitoring tool for SQL Server infrastructure and this is a good point. It provides necessary baselining and performance insights for DBAs. But in other hand, the tool in place targets only DBA profiles and its usage is limited to the infrastructure team. This doesn’t contribute to help improving the scalability in the operations team and beyond. Another issue with the existing tooling is correlation can be difficult with external events that come from either the continuous delivery pipeline or configuration changes performed by operations teams on the SQL Server instances. In this case, establishment of observability (the why) may be limited and this is what teams need to respond quickly and resolve emergencies in modern and distributed software.

What is observability?

You probably noticed the word « observability » in my previous sentence, so I think it deserves some explanations before to continue. Observability might seem like a buzzword but in fact it is not a new concept but became prominent in DevOps software development lifecycle (SDLC) methodologies and distributed infrastructure systems. Referring to the Wikipedia definition, Observability is the ability to infer internal states of a system based on the system’s external outputs. To be honest, it has not helped me very much and further readings were necessary to shed the light on what observability exactly is and what difference exist with monitoring.

Let’s start instead with monitoring which is the ability to translate infrastructure log metrics data into meaningful and actionable insights. It helps knowing when something goes wrong and starting your response quickly. This is the basis for monitoring tool and the existing one is doing a good job on it. In DBA world, monitoring is often related to performance but reporting performance is only as useful as that reporting accurately represents the internal state of the global system and not only your database environment. For example, in the past I went to some customer shops where I was in charge to audit their SQL Server infrastructure. Generally, customers were able to present their context, but they didn’t get the possibility to provide real facts or performance metrics of their application. In this case, you usually rely on a top-down approach and if you’re either lucky or experimented enough, you manage to find what is going wrong. But sometimes I got relevant SQL Server metrics that would have highlighted a database performance issue, but we didn’t make a clear correlation with those identified on application side. In this case, relying only on database performance metrics was not enough for inferring the internal state of the application. From my experience, many shops are concerned with such applications that have been designed for success and not for failure. They often lake of debuggability monitoring and telemetry is often missing. Collecting data is as the base of observability.

Observability provides not only the when of an error or issue, but more importantly the why. With modern software architectures including micro-services and the emphasis of DevOps, monitoring goals are no longer limited to collecting and processing log data, metrics, and event traces. Instead, it should be employed to improve observability by getting a better understanding of the properties of an application and its performance across distributed systems and delivery pipeline. Referring to the new context I’m working now, metric capture and analysis is started with deployment of each micro-service and it provides better observability by measuring all the work done across all dependencies.

White-Box vs. Black-Box Monitoring

In my company as many other companies, different approaches are used when it comes monitoring: White-box and Black-Box monitoring.
White-box monitoring focuses on exposing internals of a system. For example, this approach is used by many SQL Server performance tools on the market that make effort to set a map of the system with a bunch of internal statistic data about index or internal cache usage, existing wait stats, locks and so on …

In contrast, black-Box monitoring is symptom oriented and tests externally visible behavior as a user would see it. Goal is only monitoring the system from the outside and seeing ongoing problems in the system. There are many ways to achieve black-box monitoring and the first obvious one is using probes which will collect CPU or memory usage, network communications, HTTP health check or latency and so on … Another option is to use a set of integration tests that run all the time to test the system from a behavior / business perspective.

White-Box vs. Black-Box Monitoring: Which is finally more important? All are and can work together. In my company, both are used at different layers of the micro-service architecture including software and infrastructure components.

RED vs USE monitoring

When you’re working in a web-oriented and customer-centric company, you are quickly introduced to The Four Golden Signals monitoring concept which defines a series of metrics originally from Google Site Reliability Engineering including latency, traffic, errors and saturation. The RED method is a subset of “Four Golden Signals” and focus on micro-service architectures and include following metrics:

Rate: number of requests our service is serving per second
Error: number of failed requests per second
Duration: amount of time it takes to process a request

Those metrics are relatively straightforward to understand and may reduce time to figure out which service was throwing the errors and then eventually look at the logs or to restart the service, whatever.

For HTTP Metrics the RED Method is a good fit while the USE Method is more suitable for infrastructure side where main concern is to keep physical resources under control. The latter is based on 3 metrics:

Utilization: Mainly represented in percentage and indicates if a resource is in underload or overload state.
Saturation: Work in a queue and waiting to be processed
Errors: Count of event errors

Those metrics are commonly used by DBAs to monitor performance. It is worth noting that utilization metric can be sometimes misinterpreted especially when maximum value depends of the context and can go over 100%.

SQL Server infrastructure monitoring expectations

Referring to the starting scenario and all concepts surfaced above, it was clear for us to evolve our existing SQL Server monitoring architecture to improve our ability to reach the following goals:

Keeping analyzing long-term trends to respond usual questions like how my daily-workload is evolving? How big is my database? …
Alerting to respond for a broken issue we need to fix or for an issue that is going on and we must check soon.
Building comprehensive dashboards – dashboards should answer basic questions about our SQL Server instances, and should include some form of the advanced SQL telemetry and logging for deeper analysis.
Conducting an ad-hoc retrospective analysis with easier correlation: from example an http response latency that increased in one service. What happened around? Is-it related to database issue? Or blocking issue raised on the SQL Server instance? Is it related to a new query or schema change deployed from the automated delivery pipeline? In other words, good observability should be part of the new solution.
Automated discovery and telemetry collection for every SQL Server instance installed on our environment, either on VM or in container.
To rely entirely on the common platform monitoring based on Prometheus and Grafana. Having the same tooling make often communication easier between people (human factor is also an important aspect of DevOps)

Prometheus, Grafana and Telegraf

Prometheus and Grafana are the central monitoring solution for our micro-service architecture. Some others exist but we’ll focus on these tools in the context of this write-up.
Prometheus is an open-source ecosystem for monitoring and alerting. It uses a multi-dimensional data model based on time series data identified by metric name and key/value pairs. WQL is the query language used by Prometheus to aggregate data in real time and data are directly shown or consumed via HTTP API to allow external system like Grafana. Unlike previous tooling, we appreciated collecting SQL Server metrics as well as those of the underlying infrastructure like VMWare and others. It allows to comprehensive picture of a full path between the database services and infrastructure components they rely on.

Grafana is an open source software used to display time series analytics. It allows us to query, visualize and generate alerts from our metrics. It is also possible to integrate a variety of data sources in addition of Prometheus increasing the correlation and aggregation capabilities of metrics from different sources. Finally, Grafana comes with a native annotation store and the ability to add annotation events directly from the graph panel or via the HTTP API. This feature is especially useful to store annotations and tags related to external events and we decided to use it for tracking software releases or SQL Server configuration changes. Having such event directly on dashboard may reduce troubleshooting effort by responding faster to the why of an issue.

For collecting data we use Telegraf plugin for SQL Server. The plugin exposes all configured metrics to be polled by a Prometheus server. The plugin can be used for both on-prem and Azure instances including Azure SQL DB and Azure SQL MI. Automated deployment and configuration requires low effort as well.

The high-level overview of the new implemented monitoring solution is as follows:

SQL Server telemetry is achieved through Telegraf + Prometheus and includes both Black-box and White-box oriented metrics. External events like automated deployment, server-level and database-level configuration changes are monitored through a centralized scheduled framework based on PowerShell. Then annotations + tags are written accordingly to Grafana and event details are recorded to logging tables for further troubleshooting.

Did the new monitoring met our expectations?

Well, having experienced the new monitoring solution during this year, I would say we are on a good track. We worked mainly on 2 dashboards. The first one exposes basic black-box metrics to show quickly if something is going wrong while the second one is DBA oriented with a plenty of internal counters to dig further and to perform retrospective analysis.

Here a sample of representative issues we faced this year and we managed to fix with the new monitoring solution:

1) Resource pressure and black-box monitoring in action:

For this scenario, the first dashboard highlighted resource pressure issues, but it is worth noting that even if the infrastructure was burning, users didn’t experience any side effects or slowness on application side. After corresponding alerts raised on our side, we applied proactive and temporary fixes before users experience them. I would say, this scenario is something we would able to manage with previous monitoring and the good news is we didn’t notice any regression on this topic.

2) Better observability for better resolution of complex issue

This scenario was more interesting because the first symptom started from the application side without alerting the infrastructure layer. We started suffering from HTTP request slowness on November around 12:00am and developers got alerted with sporadic timeout issues from the logging system. After they traversed the service graph, they spotted on something went wrong on the database service by correlating http slowness with blocked processes on SQL Server dashboard as shown below. I put a simplified view on the dashboards, but we need to cross several routes between the front-end services and databases.

Then I got a call from them and we started investigating blocking processes from the logging tables in place on SQL Server side. At a first glance, different queries with a longer execution time than usual and neither release deployments nor configuration updates may explain such sudden behavior change. The issue kept around and at 15:42 it started appearing more frequently to deserve a deeper look at the SQL Server internal metrics. We quickly found out some interesting correlation with other metrics and we finally managed to figure out why things went wrong as show below:

Root cause was related to transaction replication slowness within Always On availability group databases and we directly jumped on storage issue according to error log details on secondary:

End-to-End observability by including the database services to the new monitoring system drastically reduces the time for finding the root cause. But we also learnt from this experience and to continuously improve the observability we added a black-box oriented metric related to availability group replication latency (see below) to detect faster any potential issue.

And what’s next?

Having such monitoring is not the endpoint of this story. As said at the beginning of this write-up, continuous delivery comes with its own DBA challenges illustrated by the starting scenario. Traditionally the DBA role is siloed, turning requests or tickets into work and they can be lacking context about the broader business or technology used in the company. I experienced myself several situations where you get alerted during the night when developer’s query exceeds some usage threshold. Having discussed the point with many DBAs, they tend to be conservative about database changes (normal reaction?) especially when you are at the end of the delivery process without clear view of what will could deployed exactly.

Here the new situation:

Implementing new monitoring stuff changed the way to observe the system (at least from a DBA perspective). Again, I believe the adding value of DBA role in a company with a strong DevOps mindset is being part of both production DBAs and Development DBAs. Making observability consistent across all the delivery pipeline including databases is likely part of the success and may help DBA getting a broader picture of system components. Referring to my context, I’m now able to get more interaction with developer teams on early phases and to provide them contextual feedbacks (and not generic feedbacks) for improvements regarding SQL production telemetry. They also have access to them and can check by themselves impact of their development. In the same way, feedbacks and work with my team around database infrastructure topic may appear more relevant.

It is finally a matter of collaboration

Dealing with SQL Server on Linux on WSL2

mikedavem — Mon, 27 Jul 2020 06:53:32 +0000

This is a blog post I intended to write sometime ago … about using SQL Server on Linux on WSL2. For whom already installed it on Windows 10, version 2004, you are already aware it doesn’t come with the support of systemd. Indeed, although it does exist in the file system, systemd is not running. If you intend to use directly SQL Server on WSL2, you need to read this carefully, because installation and management relies precisely on systemd!

First, let’s say if you want to run SQL Server on WSL2 in a supported way, docker containers remain probably the better fit. Podman may be also an alternative but it seems to remain an experimental stuff so far. The interesting point is that some people prefer to use Docker directly on WSL2 whereas other ones prefer to use Docker Desktop for Windows with the integration of WSL2 based engine. The option seems to be available for a few months as shown below:

But this is not the main topic of this blog post. Having interesting discussions with some of my colleagues, we wanted to know if we were able to install directly SQL Server on WSL2 and the good news is we finally managed to do it but at the cost of some (dirty) tricks required to make SQL Server starting / stopping correctly.

Before continuing, let’s precise we did it in a pure experimental and academic way and this is definitely not supported by Microsoft. But before starting my holidays in August, it was something very fun and a good reminder of some Linux concepts …

I did my test with the Ubuntu-18.04 distro in WSL v2.

First step consisted in installing SQL Server in an usual way as per the Microsoft BOL. The installation was done correctly but after running the setup, an error message raised quickly in the last step related to service initialization:

In fact, if you think about it, this message is expected because systemd is not running on WSL2 and the configuration process attempts to initialize the mssql-server service based on systemd. But let’s dig further on the installation process … Referring to how a deb package is made, there exist preinst, postinst, prerm, postrm scripts for mssqlserver deb package located in /var/lib/dpkg/info/:

$ sudo ls -l /var/lib/dpkg/info/mssql-server*
-rwxr-xr-x 1 root root 108 Jun 16 20:01 /var/lib/dpkg/info/mssql-server.config
-rw-r--r-- 1 root root 8579 Jul 26 00:31 /var/lib/dpkg/info/mssql-server.list
-rw-r--r-- 1 root root 11543 Jun 16 20:01 /var/lib/dpkg/info/mssql-server.md5sums
-rwxr-xr-x 1 root root 1436 Jun 16 20:01 /var/lib/dpkg/info/mssql-server.postinst
-rwxr-xr-x 1 root root 289 Jun 16 20:01 /var/lib/dpkg/info/mssql-server.postrm
-rwxr-xr-x 1 root root 1353 Jun 16 20:01 /var/lib/dpkg/info/mssql-server.preinst
-rwxr-xr-x 1 root root 365 Jun 16 20:01 /var/lib/dpkg/info/mssql-server.prerm
-rw-r--r-- 1 root root 72 Jun 16 20:01 /var/lib/dpkg/info/mssql-server.shlibs
-rw-r--r-- 1 root root 305 Jun 16 20:00 /var/lib/dpkg/info/mssql-server.templates
-rw-r--r-- 1 root root 74 Jun 16 20:01 /var/lib/dpkg/info/mssql-server.triggers

My suspicion was those files content references of systemctl commands and yes, there were in preinst/prerm/postins/prerm files:

Well, we found a reasonable explanation of the previous error message at least. At this stage, I would say you can use your new fresh installed SQL Server instance, but it requires to manually start / stop it because there are no systemctl commands to handle it. But thinking about it … we are still using the famous init daemon (PID1) on WSL2 and it works for regular services.

A good alternative could be to rely on init scripts and start-stop-daemon wrapped into LSB-compliant init scripts with:

– At least start, stop, restart, force-reload, and status
– Return Proper exit code
– Document run-time dependencies

LSB provides default set of functions which is in /lib/lsb/init-functions and we can make use of those functions in our Init scripts. The script file is located to /etc/init.d and named mssql-server.

$ sudo ls -l /etc/init.d/mssql*
-rwxr-xr-x 1 root root 1606 Jul 26 22:13 /etc/init.d/mssql-server

Here the content of my mssql-server script file:

#! /bin/sh -e
#
### BEGIN INIT INFO
# Provides: sqlserver
# Required-Start: $all
# Required-Stop:
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Manages SQL Server instance on Linux
### END INIT INFO

DAEMON="/opt/mssql/bin/sqlservr"
daemon_OPT=""
DAEMONUSER="mssql"
daemon_NAME="sqlservr"

export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:/var/opt/mssql

# Check sqlserver is present
if [ ! -x $DAEMON ]; then
log_failure_msg "$DAEMON not present or not executable"
exit 1
fi

# Load init functions
. /lib/lsb/init-functions

d_start () {
log_daemon_msg "Starting system $daemon_NAME Daemon"
start-stop-daemon --background --name $daemon_NAME --start --quiet --chuid $DAEMONUSER --exec $DAEMON --umask 007 -- $DAEMON_OPTS --oknodo
log_end_msg $?
}

d_stop () {
log_daemon_msg "Stopping system $daemon_NAME Daemon"
start-stop-daemon --name $daemon_NAME --stop --retry 5 --quiet --name $daemon_NAME --oknodo
log_end_msg $?
}

case "$1" in

start|stop)
d_${1}
;;

restart|reload|force-reload)
d_stop
d_start
;;

force-stop)
d_stop
;;

status)
status_of_proc "$daemon_NAME" "$DAEMON" "system-wide $daemon_NAME" && exit 0 || exit $?
;;
*)
echo "Usage: /etc/init.d/$daemon_NAME {start|stop|force-stop|restart|reload|force-reload|status}"
exit 1
;;
esac
exit 0

Thus, I was now able to manage with status / start / stop / restart operations:

$ sudo service mssql-server status
/opt/mssql/bin/sqlservr is not running
$ sudo service mssql-server start
Starting system sqlservr Daemon [ OK ]

The SQL Server engine was started successfully, and I can double check with the top command …

… and with a quick connect to my SQL Server instance:

As starting my SQL Server instance, stopping it can be performed with the service command and stop:

$ sudo service mssql-server stop
Stopping system sqlservr Daemon [ OK ]

$ sudo service mssql-server status
/opt/mssql/bin/sqlservr is not running

Generally speaking, there is no concept of runlevel with WSL, so starting automatically the mssql-service with the WSL2 startup can be achieved in different ways with entries in .bashrc file or tricks from github projects (not tested from my side).

Finally, let’s talk about removing SQL Server from WSL2. As a reminder, during the installation / setup we faced an error message related to systemd. We also identified some dependencies to systemd in some dkpg files as shown below:

Similar to the installation step, you will experience the same kind of issue with prerm / postrm files when removing SQL Server. In my case, I had to comment concerned lines in those files to uninstall successfully my SQL Server instance.

Hope this blog post helps!

Universal usage of NVARCHAR type and performance impact

mikedavem — Wed, 27 May 2020 17:06:24 +0000

A couple of weeks, I read an article from Brent Ozar about using NVARCHAR as a universal parameter. It was a good reminder and from my experience, I confirm this habit has never been a good idea. Although it depends on the context, chances are you will almost find an exception that proves the rule.

A couple of days ago, I felt into a situation that illustrated perfectly this issue, and, in this blog, I decided to share my experience and demonstrate how the impact may be in a real production scenario.
So, let’s start with the culprit. I voluntary masked some contextual information but the principal is here. The query is pretty simple:

DECLARE @P0 DATETIME
DECLARE @P1 INT
DECLARE @P2 NVARCHAR(4000)
DECLARE @P3 DATETIME
DECLARE @P4 NVARCHAR(4000)

UPDATE TABLE SET DATE = @P0
WHERE ID = @P1
AND IDENTIFIER = @P2
AND P_DATE >= @P3
AND W_O_ID = (
SELECT TOP 1 ID FROM TABLE2
WHERE Identifier = @P4
ORDER BY ID DESC)

And the corresponding execution plan:

The most interesting part concerns the TABLE2 table. As you may notice the @P4 input parameter type is NVARCHAR and it is evident we get a CONVERT_IMPLICIT in the concerned Predicate section above. The CONVERT_IMPLICIT function is required because of data type precedence. It results to a costly operator that will scan all the data from TABLE2. As you probably know, CONVERT_IMPLICT prevents sargable condition and normally this is something we could expect here referring to the distribution value in the statistic histogram and the underlying index on the Identifier column.

EXEC sp_helpindex 'TABLE2';

DBCC SHOW_STATISTICS ('TABLE2', 'IX___IDENTIFIER')
WITH HISTOGRAM;

Another important point to keep in mind is that scanning all the data from the TABLE 2 table may be at a certain cost (> 1GB) even if data resides in memory.

EXEC sp_spaceused 'TABLE2'

The execution plan warning confirms the potential overhead of retrieving few rows in the TABLE2 table:

To set a little bit more the context, the concerned application queries are mainly based on JDBC Prepared statements which imply using NVARCHAR(4000) with string parameters regardless the column type in the database (VARCHAR / NVARCHAR). This is at least what we noticed from during my investigations.

So, what? Well, in our DEV environment the impact was imperceptible, and we had interesting discussions with the DEV team on this topic and we basically need to improve the awareness and the visibility on this field. (Another discussion and probably another blog post) …

But chances are your PROD environment will tell you a different story when it comes a bigger workload and concurrent query executions. In my context, from an infrastructure standpoint, the symptom was an abnormal increase of the CPU consumption a couple of days ago. Usually, the CPU consumption was roughly 20% up to 30% and in fact, the issue was around for a longer period, but we didn’t catch it due to a « normal » CPU footprint on this server.

So, what happened here? We’re using SQL Server 2017 with Query Store enabled on the concerned database. This feature came to the rescue and brought attention to the first clue: A query plan regression that led increasing IO consumption in the second case (and implicitly the additional CPU resource consumption as well).

You have probably noticed both the execution plans are using an index scan at the right but the more expensive one (at the bottom) uses a different index strategy. Instead of using the primary key and clustered index (PK_xxx), a non-clustered index on the IX_xxx_Identifier column in the second query execution plan is used with the same CONVERT_IMPLICIT issue.

According to the query store statistics, number of executions per business day is roughly 25000 executions with ~ 8.5H of CPU time consumed during this period (18.05.2020 – 26.05.2020) that was a very different order of magnitude compared to what we may have in the DEV environment

At this stage, I would say investigating why a plan regression occurred doesn’t really matter because in both cases the most expensive operator concerns an index scan and again, we expect an index seek. Getting rid of the implicit conversion by using VARCHAR type to make the conditional clause sargable was a better option for us. Thus, the execution plan would be:

The first workaround in mind was to force the better plan in the query store (automatic tuning with FORCE_LAST_GOOD_PLAN = ON is disabled) but having discussed this point with the DEV team, we managed to deploy a fix very fast to address this issue and to reduce drastically the CPU consumption on this SQL Server instance as shown below. The picture is self-explanatory:

The fix consisted in adding CAST / CONVERT function to the right side of the equality (parameter and not the column) to avoid side effect on the JDBC driver. Therefore, we get another version of the query and a different query hash as well. The query update is pretty similar to the following one:

DECLARE @P0 DATETIME
DECLARE @P1 INT
DECLARE @P2 NVARCHAR(4000)
DECLARE @P3 DATETIME
DECLARE @P4 NVARCHAR(4000)

UPDATE TABLE SET DATE = @P0
WHERE ID = @P1
AND IDENTIFIER = CAST(@P2 AS varchar(50))
AND P_DATE >= @P3
AND W_O_ID = (
SELECT TOP 1 ID FROM TABLE2
WHERE Identifier = CAST(@P4 AS varchar(50))
ORDER BY ID DESC)

Sometime later, we gathered query store statistics of both the former and new query to confirm the performance improvement as shown below:

Finally changing the data type led to enable using an index seek operator to reduce drastically the SQL Server CPU consumption and logical read operations by far.

QED!

dbachecks and AlwaysOn availability group checks

mikedavem — Mon, 20 Apr 2020 19:57:31 +0000

When I started my DBA position in my new company, I was looking for a tool that was able to check periodically the SQL Server database environments for several reasons. First, as DBA one of my main concern is about maintaining and keeping the different mssql environments well-configured against an initial standard. It is also worth noting I’m not the only person to interact with databases and anyone in my team, which is member of sysadmin server role as well, is able to change any server-level configuration settings at any moment. In this case, chances are that having environments shifting from our initial standard over the time and my team and I need to keep confident by checking periodically the current mssql environment configurations, be alerting if configuration drifts exist and obviously fix it as faster as possible.

A while ago, I relied on SQL Server Policy Based Management feature (PBM) to carry out this task at one of my former customers and I have to say it did the job but with some limitations. Indeed, PBM is the instance-scope feature and doesn’t allow to check configuration settings outside the SQL Server instance for example. During my investigation, dbachecks framework drew my attention for several reasons:

– It allows to check different settings at different scopes including Operating System and SQL Server instance items
– It is an open source project and keeps evolving with SQL / PowerShell community contributions.
– It is extensible, and we may include custom checks to the list of predefined checks shipped with the targeted version.
– It is based on PowerShell, Pester framework and fits well with existing automation and GitOps process in my company

The first dbacheck version we deployed in production a couple of month ago was 1.2.24 and unfortunately it didn’t include reliable tests for availability groups. It was the starting point of my first contributions to open source projects and I felt proud and got honored when I noticed my 2 PRs validated for the dbacheck tool including Test Disk Allocation Unit and Availability Group checks:

Obviously, this is just an humble contribution and to be clear, I didn’t write the existing tests for AGs but I spent some times to apply fixes for a better detection of all AG environments including their replicas in a simple and complex topologies (several replicas on the same server and non-default ports for example).

So, here the current list of AG checks in the version 1.2.29 at the moment of this write-up:

– Cluster node should be up
– AG resource + IP Address in the cluster should be online
– Cluster private and public network should be up
– HADR should be enabled on each AG replica
– AG Listener + AG replicas should be pingable and reachable from client connections
– AG replica should be in the correct domain name
– AG replica port number should be equal to the port specified in your standard
– AG availability mode should not be in unknown state and should be in synchronized or synchronizing state regarding the replication type
– Each high available database (member of an AG) should be in synchronized / synchronizing state, ready for failover, joined to the AG and not in suspended state
– Each AG replica should have an extended event session called AlwaysOn_health which is in running state and configured in auto start mode

Mandatory parameters are app.cluster and domain.name.

Get-DbcCheck -Tag HADR | ft Group, Type, AllTags, Config -AutoSize

Group Type AllTags Config
----- ---- ------- ------
HADR ClusterNode ClusterHealth, HADR app.sqlinstance app.cluster skip.hadr.listener.pingcheck domain.name policy...

The starting point of the HADR checks is the Windows Failover Cluster component and hierarchically other tests are performed on each sub component including availability group, AG replicas and AG databases.

Then you may change the behavior on the HADR check process according to your context by using the following parameters:

– skip.hadr.listener.pingcheck => Skip ping check of hadr listener
– skip.hadr.listener.tcpport => Skip check of standard tcp port about AG listerners
– skip.hadr.replica.tcpport => Skip check of standard tcp port about AG replicas

For instance, in my context, I configured the hadr.replica.tcpport parameter to skip checks on replica ports because we own different environments that including several replicas on the same server and which listen on a non-default port.

Get-DbcConfig skip.hadr.*
Name Value Description
---- ----- -----------
skip.hadr.listener.pingcheck False Skip the HADR listener ping test (especially useful for Azure and AWS)
skip.hadr.listener.tcpport False Skip the HADR AG Listener TCP port number (If port number is not standard acro...
skip.hadr.replica.tcpport True Skip the HADR Replica TCP port number (If port number is not standard across t...

Running the HADR check can be simply run by using HADR tag as follows:

Invoke-DbcCheck -Tag HADR
Pester v4.10.1 Executing all tests in 'C:\Program Files\WindowsPowerShell\Modules\dbachecks\1.2.29\checks\HADR.Tests.ps1' with Tags HADR
...

Well, this a good start but I think some almost of checks are state-oriented and some configuration checks are missing. I’m already willing to add some of them in a near the future or/and feel free to add your own contribution as well

Stay tuned!

Availability Group 2017 Direct seeding and updated default path policy

mikedavem — Fri, 13 Mar 2020 18:10:10 +0000

A couple of days ago, I ran into an issue when adding a new database in direct seeding mode that led me to reconsider refreshing my skills on this feature. Going through the AG database wizard for adding database, I faced the following error message …

… and I was surprised by the required directories value (L:\SQL\Data) because the correct topology should be:

– D:\SQL\Data (SQL data files)
– L:\SQL\Logs (SQL Log files)

SQL Server 2016 required to have symmetric storage layout for both AG replicas but SQL Server 2017 and above seems to tell another story as specified to the BOL. In my context, I got the check script executed by the wizard and it became obvious that the direct seeding feature is checking if folders based on default path values exist on each replica.

In my context, I got the check script executed by the wizard and it became obvious that the direct seeding feature is checking if folders based on default path values exist on each replica.

declare @SmoAuditLevel int
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'AuditLevel', @SmoAuditLevel OUTPUT

declare @NumErrorLogs int
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'NumErrorLogs', @NumErrorLogs OUTPUT

declare @SmoLoginMode int
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'LoginMode', @SmoLoginMode OUTPUT

declare @ErrorLogSizeKb int
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'ErrorLogSizeInKb', @ErrorLogSizeKb OUTPUT

declare @SmoMailProfile nvarchar(512)
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'MailAccountName', @SmoMailProfile OUTPUT

declare @BackupDirectory nvarchar(512)
if 1=isnull(cast(SERVERPROPERTY('IsLocalDB') as bit), 0)
select @BackupDirectory=cast(SERVERPROPERTY('instancedefaultdatapath') as nvarchar(512))
else
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'BackupDirectory', @BackupDirectory OUTPUT

declare @SmoPerfMonMode int
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'Performance', @SmoPerfMonMode OUTPUT

if @SmoPerfMonMode is null
begin
set @SmoPerfMonMode = 1000
end

SELECT
@SmoAuditLevel AS [AuditLevel],
ISNULL(@NumErrorLogs, -1) AS [NumberOfLogFiles],
(case when @SmoLoginMode < 3 then @SmoLoginMode else 9 end) AS [LoginMode],
ISNULL(SERVERPROPERTY('instancedefaultdatapath'),'') AS [DefaultFile],
SERVERPROPERTY('instancedefaultlogpath') AS [DefaultLog],
ISNULL(@ErrorLogSizeKb, 0) AS [ErrorLogSizeKb],
-1 AS [TapeLoadWaitTime],
ISNULL(@SmoMailProfile,N'') AS [MailProfile],
@BackupDirectory AS [BackupDirectory],
@SmoPerfMonMode AS [PerfMonMode]

Primary replica

Secondary replica

Even if directing seeding allows asymmetric storage layout, a mistake was introduced in my context and both replicas should have been aligned. It is therefore all the more important that using direct seeding capabilities from PowerShell cmdlets like Add-DbaAgDatabase doesn’t generate any errors and fixing the default path value for data and log require restarting the SQL Server instance. See you!

Hope this tips helps!

See you!

Collaborative way and tooling to debug SQL Server blocked processes scenarios

mikedavem — Thu, 30 Jan 2020 14:13:48 +0000

A quick blog post to show how helpful an extended event and few other tools can be to help fixing orphan transactions in a real use case scenario. I often gave training with customers about SQL Server performance and tools, but I noticed how difficult it can be to explain the importance of a tool if you only explain theory without any illustration with a real customer case.
Well, let’s start my own story that began a couple of days ago with an SQL alert to indicate a blocking scenario issue. Looking at our SQL dashboard (below), we were able to confirm quickly we are running into an annoying issue and it would be getting worse over if we do nothing.

…

Let’s precise before continuing there are a plenty of tools to help digging into blocked processes scenarios and my intention is not to favor one specific tool over another one. So, in my case the first tool I used was the sp_WhoIsActive procedure from Adam Machanic. One great feature of it is to get a comprehensive picture of what is happening on your system at the moment you’re executing the procedure.
Here a sample of output I got. Let’s precise it doesn’t reflect exactly my context (which was a little more complex) but anyway my intention is not to focus on this specific part

As you may see, I got quickly interesting information about the blocking leaders including session_id, the application name and the command executed as this moment. But the interesting part of this story was to get a request_id to NULL as well as a status value to SLEEPING. After some researches, having these values indicate likely that SQL Server has completed the command and the connection is waiting for the next command to come from the client. In addition, looking at the open_tran_count value (=1) confirmed the transaction was still opened. We started monitoring the transaction a couple of minutes to see if the application could manage to commit (or to rollback) the transaction but nothing happened. So, we had to kill the corresponding session to get back to a normal situation. A few minutes later, the situation came back with the same pattern and we applied the same temporary fix (KILL session).
The next step consisted in discussing with the DEV team to fix this annoying issue once and for all. We managed to reproduce this scenario in DEV environment, but it was not clear what happened exactly inside the application because we got not specific errors even when looking at the tracing infrastructure. To help the DEV team investigating the issue, we decided to create a special extended event session that both monitor all activity scoped to the concerned application and the transactions that remain open during the tracing timeframe. Events will be easily correlated relying on causality tracking capabilities of extended events.

So, the extended event session used two targets including the Event File and Histogram. Respectively, the first one was intended to write workload activity into a file on disk and the second one aimed identifying quickly opened transactions.

CREATE EVENT SESSION [OrphanedTransactionHunter] ON SERVER
ADD EVENT sqlserver.database_transaction_begin(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.database_transaction_end(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.error_reported(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.module_end(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.module_start(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.rpc_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.rpc_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sp_statement_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sp_statement_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sql_statement_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sql_statement_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack))
ADD TARGET package0.event_file(SET filename=N'OrphanedTransactionHunter'),
ADD TARGET package0.pair_matching(SET begin_event=N'sqlserver.database_transaction_begin',begin_matching_actions=N'sqlserver.session_id',end_event=N'sqlserver.database_transaction_end',end_matching_actions=N'sqlserver.session_id',respond_to_memory_pressure=(1))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=5 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)
GO

The outputs were as follows:

> Target histogram (with opened transactions)

The only transaction that remained opened during our test (just be careful other noisy records not relevant at the moment when you start the XE session) concerned the session_id = 873. The attach_activity_id, attach_activity_id_xfer and session column values were helpful here to correlate events recorded to the event file target.

> Event File (Workload activity)

Here the events after applying a filter with above values.

We noticed the transaction with session_id = 873 was started but never ended. In addition, we were able to identify the sequence of code executed by the application (mainly based on prepared statement and stored procedures in our context).This information helps the DEV team to focus on the right portion of code to fix. Without getting into details, it was very interesting to see this root cause was a SQL statement and duplicate key issue not thrown and managed correctly by the application. I was just surprised the application didn’t catch any errors in a such case. We finally understood that prepared statements and stored procedure calls were done through the DButils class including the closeQuietly() method to close connections. Referring to the Apache documentation, closeQuietly() was designed to hide SQL Exceptions when happen which definitely not help identifying easily the issue from an application side. Never mind, thanks to the collaboration with the DEV team we managed to get rid of this issue

David Barbarin

Expérimentation d’une mise à jour de statistiques sur une grosse table par des voies détournées

mikedavem — Thu, 25 Jan 2018 06:52:08 +0000

Ceci est mon premier blog de l’année 2018 et depuis un moment d’ailleurs. En effet, l’année dernière j’ai mis toute mon énergie à réajuster mes connaissances Linux avec la nouvelle stratégie Open Source de Microsoft. Mais en même temps, j’ai réalisé un certain nombre de tâches intéressantes chez certains clients et en voici une pour commencer cette nouvelle année. Dans ce billet, j’aimerai souligner une approche particulière (selon moi) pour optimiser une mise à jour de statistiques pour une grosse table.

> Lire la suite (en anglais)

David Barbarin
MVP & MCM SQL Server

Lorsqu’une recherche d’index n’est pas forcément adéquate

mikedavem — Thu, 13 Oct 2016 18:35:29 +0000

N’avez-vous jamais considéré une recherche d’index comme un problème? Laissez moi vous raconter une histoire avec un de mes clients avec un contexte simple: une requête spécifique qui n’était pas dans les valeurs acceptables de performance exigées (environ 200ms de temps d’exécution moyen). Le plan d’exécution associé de la requête était similaire à ce que vous pouvez voir ici

> Lire la suite (en anglais)

David Barbarin
MVP & MCM SQL Server

Changement une partition existante … pas si facile que cela

mikedavem — Thu, 03 Mar 2016 05:25:03 +0000

Cette fois, parlons d’un cas client intéressant avec une table partitionnée de 100 GB sous SQL Server 2014. Dans ce contexte, le partitionnement avait pour objectif de sauvegarder de l’espace disque (données d’archives compressées), aider à réduire les temps de maintenance ainsi et les ressources consommées (utilisation des opérations d’index et de statistiques à la partition). Par la même occasion, cela aidera à améliorer les performances de requête sur la table concernée qui se concentre uniquement que sur les commandes récentes des clients.

> Lire la suite (en anglais)

David Barbarin
MVP & MCM SQL Server