David Barbarin » devops

FinOps with Azure Cost management and Azure Log Analytics

mikedavem — Wed, 12 May 2021 15:37:47 +0000

In a previous blog post, I surfaced Azure monitor capabilities for extending observability of Azure SQL databases. We managed to correlate different metrics and SQL logs to identify new execution patterns against our Azure SQL DB, and we finally go through a new compute tier model that fits better with our new context. In this blog post, I would like to share some new experiences about combining Azure cost analysis and Azure log analytics to spot “abnormal” trend and to fix it.

If you deal with Cloud services and infrastructure, FinOps is a discipline you should get into for keeping under control your costs and getting actionable insights that could result in efficient cloud costs. Azure cost management provides visibility and control. Azure cost analysis is my favorite tool when I want to figure out costs of the different services and to visualize improvements after applying quick wins, architecture upgrades on the environment. It is also a good place to identify stale resources to cleanup. I will focus on Azure SQL DB here. From a cost perspective, Azure SQL DB service includes different meter subcategories regarding the options and the service tier you will use. You may have to pay for the compute, the dedicated storage for your database and for your backups (pitr or ltr) and so on … Cost Analysis allows drill-down analysis through different axis with aggregation or forecast capabilities.

In our context, we would like to know if moving from Azure SQL DB Azure Serverless compute tier (Pay-As-You-Go) to Provisioned Tier (+ Azure Hybrid Benefit + Reserved Instances for 3 years) has some good effects on costs. First look at the cost analysis section by applying correct filters and data aggregation on compute tier, confirmed our initial assumption that Serverless didn’t fit anymore with our context now. The chart uses a monthly-based timeframe daily aggregation. We switched to a different model mi-April as show below:

Real numbers are confidential but not so important here. We can easily notice a drop a daily cost (~ 0.5) between Serverless and Provisioned compute tier.

If we get a higher-level view of all services and costs for previous months, the trend is also confirmed for April with serverless + provisioned tier combined costs lower than serverless computer tier only for previous months. But we need to wait for next months to confirm the trend.

At the same time (and this is the focus on this write-up), we detected a sudden increase of backup storage cost in March that may ruin our optimization efforts made for compute, right? :). To explain this new trend, log analytics came to the rescue. As explained in the previous blog post, we configured streaming of Azure SQL DB telemetry into Log Analytics target to get benefit from solutions like SQL Insights and custom queries from different Azure logs.

Basic metrics are part of Azure SQL DB telemetry and stored in AzureMetrics table. We can use Kusto query to extract backup metrics and get an idea of different backup type trends over the time including FULL, DIFF and LOG backups. The following query shows backup trends within the same timeframe used for billing in cost management (February to May). It also includes a series_file_line function to draw a trendline in the time chart.

AzureMetrics
| where TimeGenerated >= ago(90d)
| where Resource == 'myDB'
| where MetricName == 'full_backup_size_bytes' // in ('full_backup_size_bytes','diff_backup_size_bytes','log_backup_size_bytes')
| make-series SizeBackupDiffTB=max(Maximum/1024/1024/1024/1024) on TimeGenerated in range(ago(90d),now(), 1d)
| extend (RSquare,Slope,Variance,RVariance,Interception,TrendLine)=series_fit_line(SizeBackupDiffTB)
| render timechart

Full backup time chart

FULL backup size is relatively steady and cannot explain the sudden increase of storage backup cost in our case.

DIFF and LOG backup time chart

…

LOG and DIFF backup charts are more relevant and the trendline suggests a noticeable change starting mi-March. For the first part of the month, the trendline starts misaligning with backup size series.

At this stage, we found out the cause of the cost increase, but we were interested in understanding the reasons that may explain such trend. After investigating our ITSM system, we were able to find a correlation with the deployment of new maintenance tool – Ola Hallengren maintenance solution + custom scripts to rebuild columnstore indexes. The latter rebuilds aggressively 2 big fact tables with CCI in our DW (unlike the former tool) that explain the increase of DIFF and LOG backup sizes (~ 1TB).

This is where the collaboration with the data engineering team is starting to find an efficient and durable way to minimize the impact of the maintenance:

– Reviewing the custom script threshold may result to a more relax detection of fragmented columnstore indexes. However, this is only a piece of the solution because when a columnstore index becomes a good candidate for the next maintenance operation, it will lead to a resource-intensive and time-consuming operation (> 2.5h dedicated for these two tables). We are using Azure automation jobs with fair share to execute the maintenance and we are limited to 3h max per job execution. We may use a divide and conquer strategy to fit within the permitted execution timeframe, but it would lead to more complexity and we want to keep maintenance as simple as possible.

– We need to find another way to keep index and stat maintenance jobs execute time under a certain control. Introducing partition for these tables is probably a good catch and another piece of the solution. Indeed, currently concerned tables are not partitioned, and we could get benefit from partition-level maintenance for both indexes and statistics at the partition level.

Bottom line

Azure cost management center and log analytics are a powerful recipe in the FinOps practice. Kusto SQL language is a flexible tool for finding and correlate all kinds of log entries and events assuming you configured telemetry to the right target. I definitely like annotation-like system as we are using with Grafana because it makes correlation with external changes and workflows easier. Next step: investigate annotations on metric charts in Application insights?

See you!!

Azure monitor as observability platform for Azure SQL Databases and more

mikedavem — Mon, 08 Feb 2021 16:57:26 +0000

In a previous blog post, I wrote about reasons we moved our monitoring of on-prem SQL Server instances on Prometheus and Grafana. But what about Cloud and database services?

We have different options and obviously in my company we thought first moving our Azure SQL Database workload telemetry on on-prem central monitoring infrastructure as well. But not to mention the main blocker which is the serverless compute tier because Telegraf Server agent would imply initiating a connection that could prevent auto-pausing the database or at least it would made monitoring more complex because it would supposed to have a predictable workload all the time.

The second option was to rely on Azure monitor which is a common platform for combining several logging, monitoring and dashboard solutions across a wide set of Azure resources. It is scalable platform, fully managed and provides a powerful query language and native features like alerts, if logs or metrics match specific conditions. Another important point is there is no vendor lock-in, with this solution, as we can always fallback to our self-hosted Prometheus and Grafana instances if neither computer tier doesn’t fit nor in case Azure Monitor might not be an option anymore!

Firstly, to achieve a good observability with Azure SQL Database we need to put both diagnostic telemetry and SQL Server audits events in a common Log Analytics workspace. A quick illustration below:

Diagnostic settings are configured per database and including basic metrics (CPU, IO, Memory etc …) and also different SQL Server internal metrics as deadlock, blocked processes or query store information about query execution statistic and waits etc… For more details please refer to the Microsoft BOL.

SQL Azure DB auditing is both server-level or database-level configuration setting. In our context, we defined a template of events at the server level which is then applied to all databases within the logical server. By default, 3 events are automatically audited:
– BATCH_COMPLETED_GROUP
– SUCCESSFUL_DATABASE_AUTHENTICATION_GROUP
– FAILED_DATABASE_AUTHENTICATION_GROUP

The first one of the list is probably to be discussed according to the environment because of its impact but in our context that’s ok because we faced a data warehouse workload. However we added other ones to meet our security requirements:
– PERMISSION_CHANGE_GROUP
– DATABASE_PRINCIPAL_CHANGE_GROUP
– DATABASE_ROLE_MEMBER_CHANGE_GROUP
– USER_CHANGE_PASSWORD_GROUP

But if you take care about Log Analytics as target for SQL audits, you will notice it is still a feature in preview as shown below:

To be clear, usually we don’t consider using Azure preview features in production especially when they remain in this state for a long time but in this specific context we got interested by observability capabilities of the platform. From one hand, we get very useful performance insights through SQL Analytics dashboards (again in preview) and from the other hand we can easily query logs and traces through Log Analytics for correlation with other metrics. Obviously, we hope Microsoft moving a step further and providing this feature in GA in the near feature.

Let’s talk briefly of SQL Analytics first. It is an advanced and free cloud monitoring solution for Azure SQL database monitoring performance and it relies mainly on your Azure Diagnostic metrics and Azure Monitor views to present data in a structured way through performance dashboard.

Here an example of built-in dashboards we are using to track activity and high CPU / IO bound queries against our data warehouse.

You can use drill-down capabilities to different contextual dashboards to get insights of resource intensive queries. For example, we identified some LOG IO intensive queries against a clustered columnstore index and after some refactoring of UPDATE statement to DELETE + INSERT we reduced drastically LOG IO waits.

In addition, Azure monitor helped us in an another scenario where we tried to figure out recent workload patterns and to know if the current compute tier still fits with it. As said previously, we are relying on Serverless compute tier to handle the data warehouse-oriented workload with both auto-scaling and auto-pausing capabilities. At the first glance, we might expect a typical nightly workload as illustrated to Microsoft BOL and a cost optimized to this workload:

Images from Microsoft BOL

It could have been true when the activity started on Azure, but the game has changed with new incoming projects over the time. Starting with the general performance dashboard, the workload seems to follow the right pattern for Serverless compute tier, but we noticed billing keep going during unexpected timeframe as shown below. Let’s precise that I put deliberately only a sample of two days, but this pattern is a good representation of the general workload in our context.

Indeed, workload should be mostly nightly-oriented with sporadic activity during the day but quick correlation with other basic metrics like CPU or Memory percentage usage confirmed a persistent activity all day. We have CPU spikes and probably small batches that keep minimum memory around at other moments.

As per the Microsoft documentation, the minimum auto-pausing delay value is 1h and requires an inactive database (number of sessions = 0 and CPU = 0 for user workload) during this timeframe. Basic metrics didn’t provide any further insights about connections, applications or users that could generate such « noisy » activity, so we had to go another way by looking at the SQL Audit logs stored in Azure Monitor Logs. Data can be read through KQL which stands for Kusto Query Language (and not Kibana Query Language ). It’s the language used to query the Azure log databases: Azure Monitor Logs, Azure Monitor Application Insights and others and it is pretty similar to SQL language in the construct.

Here the first query I used to correlate number of events with metrics and that could prevent auto-pausing to kick in for the concerned database including RPC COMPLETED, BATCH COMPLETED, DATABASE AUTHENTICATION SUCCEEDED or DATABASE AUTHENTICATION FAILED

AzureDiagnostics
| where Category == 'SQLSecurityAuditEvents' and (action_name_s in ('RPC COMPLETED','BATCH COMPLETED') or action_name_s contains "DATABASE AUTHENTICATION") and LogicalServerName_s == 'xxxx' and database_name_s == xxxx
| summarize count() by bin(event_time_t, 1h),action_name_s
| render columnchart

Results are aggregated and bucketized per hour on generated time event with bin() function. Finally, for a quick and easy read, I choosed a simple and unformatted column chart render. Here the outcome:

As you probably noticed, daily activity is pretty small compared to nightly one and seems to confirm SQL batches and remote procedure calls. From this unclear picture, we can confirm anyway the daily workload is enough to keep the billing going because there is no per hour timeframe where there is no activity.

Let’s write another KQL query to draw a clearer picture of which applications ran during the a daily timeframe 07:00 – 20:00:

let start=datetime("2021-01-26");
let end=datetime("2021-01-29");
let dailystart=7;
let dailyend=20;
let timegrain=1d;
AzureDiagnostics
| project action_name_s, event_time_t, application_name_s, server_principal_name_s, Category, LogicalServerName_s, database_name_s
| where Category == 'SQLSecurityAuditEvents' and (action_name_s in ('RPC COMPLETED','BATCH COMPLETED') or action_name_s contains "DATABASE AUTHENTICATION")
| where LogicalServerName_s == 'xxxx' and database_name_s == 'xxxx'
| where event_time_t > start and event_time_t < end
| where datetime_part("Hour",event_time_t) between (dailystart .. dailyend)
| summarize count() by bin(event_time_t, 1h), application_name_s
| render columnchart with (xtitle = 'Date', ytitle = 'Nb events', title = 'Prod SQL Workload pattern')

And here the new outcome:

The new chart reveals some activities from SQL Server Management Studio but most part concerns applications with .Net SQL Data Provider. For a better clarity, we need more information related about applications and, in my context, I managed to address the point by reducing the search scope with the service principal name that issued the related audit event. It results to this new outcome that is pretty similar to previous one:

Good job so far. For a sake of clarity, the service principal obfuscated above is used by our Reporting Server infrastructure and reports to get data from this data warehouse. By going this way to investigate daily activity at different moments on the concerned Azure SQL database, we came to the conclusion that using Serverless computer tier didn’t make sense anymore and we need to upgrade likely to another computer tier.

Additional thoughts

Azure monitor is definitely a must to have if you are running resources on Azure and if you don’t own a platform for observability (metrics, logs and traces). Otherwise, it can be even beneficial for freeing up your on-prem monitoring infrastructure resources if scalability is a concern. Furthermore, there is no vendor-locking and you can decide to stream Azure monitor data outside in another place but at the cost of additional network transfer fees according to the target scenario. For example, Azure monitor can be used directly as datasource with Grafana. Azure SQL telemetry can be collected with Telegraf agent whereas audit logs can be recorded in another logging system like Kibana. In this blog post, we just surfaced the Azure monitor capabilities but, as demonstrated above, performing deep analysis correlations from different sources in a very few steps is a good point of this platform.

Why we moved SQL Server monitoring on Prometheus and Grafana

mikedavem — Tue, 22 Dec 2020 16:55:12 +0000

During this year, I spent a part of my job on understanding the processes and concepts around monitoring in my company. The DevOps mindset mainly drove the idea to move our SQL Server monitoring to the existing Prometheus and Grafana infrastructure. Obviously, there were some technical decisions behind the scene, but the most important part of this write-up is dedicated to explaining other and likely most important reasons of this decision.

But let’s precise first, this write-up doesn’t constitute any guidance or any kind of best practices for DBAs but only some sharing of my own experience on the topic. As usual, any comment will be appreciated.

That’s said, let’s continue with the context. At the beginning of this year, I started my new DBA position in a customer-centric company where DevOps culture, microservices and CI/CD are omnipresent. What does it mean exactly? To cut the story short, development and operation teams are used a common approach for agile software development and delivery. Tools and processes are used to automate build, test, deploy and to monitor applications with speed, quality and control. In other words, we are talking about Continuous Delivery and in my company, release cycle is faster than traditional shops I encountered so far with several releases per day including database changes. Another interesting point is that we are following the « Operate what you build » principle each team that develops a service is also responsible for operating and supporting it. It presents some advantages for both developers and operations but pushing out changes requires to get feedback and to observe impact on the system on both sides.

In addition, in operation teams we try to act as a centralized team and each member should understand the global scope and topics related to the infrastructure and its ecosystem. This is especially true when you’re dealing with nightly on-calls. Each has its own segment responsibility (regarding their specialized areas) but following DevOps principles, we encourage shared ownership to break down internal silos for optimizing feedback and learning. It implies anyone should be able to temporarily overtake any operational task to some extent assuming the process is well-documented, and learnin has been done correctly. But world is not perfect and this model has its downsides. For example, it will prioritize effectiveness in broader domains leading to increase cognitive load of each team member and to lower visibility in for vertical topics when deeper expertise is sometimes required. Having an end-to-end observable system including infrastructure layer and databases may help to reduce time for investigating and fixing issues before end users experience them.

The initial scenario

Let me give some background info and illustration of the initial scenario:

… and my feeling of what could be improved:

1) From a DBA perspective, at a first glance there are many potential issues. Indeed, a lot of automated or semi-manual deployment processes are out of the control and may have a direct impact on the database environment stability. Without better visibility, there is likely no easy way to address the famous question: He, we are experiencing performance degradations for two days, has something happened on database side?

2) Silos are encouraged between DBA and DEVs in this scenario. Direct consequence is to limit drastically the adding value of the DBA role in a DevOps context. Obviously, primary concerns include production tasks like ensuring integrity, backups and maintenance of databases. But in a DevOps oriented company where we have automated « database-as-code » pipelines, they remain lots of unnecessary complexity and disruptive scripts that DBA should take care. If this role is placed only at the end of the delivery pipeline, collaboration and continuous learning with developer teams will restricted at minimum.

3) There is a dedicated monitoring tool for SQL Server infrastructure and this is a good point. It provides necessary baselining and performance insights for DBAs. But in other hand, the tool in place targets only DBA profiles and its usage is limited to the infrastructure team. This doesn’t contribute to help improving the scalability in the operations team and beyond. Another issue with the existing tooling is correlation can be difficult with external events that come from either the continuous delivery pipeline or configuration changes performed by operations teams on the SQL Server instances. In this case, establishment of observability (the why) may be limited and this is what teams need to respond quickly and resolve emergencies in modern and distributed software.

What is observability?

You probably noticed the word « observability » in my previous sentence, so I think it deserves some explanations before to continue. Observability might seem like a buzzword but in fact it is not a new concept but became prominent in DevOps software development lifecycle (SDLC) methodologies and distributed infrastructure systems. Referring to the Wikipedia definition, Observability is the ability to infer internal states of a system based on the system’s external outputs. To be honest, it has not helped me very much and further readings were necessary to shed the light on what observability exactly is and what difference exist with monitoring.

Let’s start instead with monitoring which is the ability to translate infrastructure log metrics data into meaningful and actionable insights. It helps knowing when something goes wrong and starting your response quickly. This is the basis for monitoring tool and the existing one is doing a good job on it. In DBA world, monitoring is often related to performance but reporting performance is only as useful as that reporting accurately represents the internal state of the global system and not only your database environment. For example, in the past I went to some customer shops where I was in charge to audit their SQL Server infrastructure. Generally, customers were able to present their context, but they didn’t get the possibility to provide real facts or performance metrics of their application. In this case, you usually rely on a top-down approach and if you’re either lucky or experimented enough, you manage to find what is going wrong. But sometimes I got relevant SQL Server metrics that would have highlighted a database performance issue, but we didn’t make a clear correlation with those identified on application side. In this case, relying only on database performance metrics was not enough for inferring the internal state of the application. From my experience, many shops are concerned with such applications that have been designed for success and not for failure. They often lake of debuggability monitoring and telemetry is often missing. Collecting data is as the base of observability.

Observability provides not only the when of an error or issue, but more importantly the why. With modern software architectures including micro-services and the emphasis of DevOps, monitoring goals are no longer limited to collecting and processing log data, metrics, and event traces. Instead, it should be employed to improve observability by getting a better understanding of the properties of an application and its performance across distributed systems and delivery pipeline. Referring to the new context I’m working now, metric capture and analysis is started with deployment of each micro-service and it provides better observability by measuring all the work done across all dependencies.

White-Box vs. Black-Box Monitoring

In my company as many other companies, different approaches are used when it comes monitoring: White-box and Black-Box monitoring.
White-box monitoring focuses on exposing internals of a system. For example, this approach is used by many SQL Server performance tools on the market that make effort to set a map of the system with a bunch of internal statistic data about index or internal cache usage, existing wait stats, locks and so on …

In contrast, black-Box monitoring is symptom oriented and tests externally visible behavior as a user would see it. Goal is only monitoring the system from the outside and seeing ongoing problems in the system. There are many ways to achieve black-box monitoring and the first obvious one is using probes which will collect CPU or memory usage, network communications, HTTP health check or latency and so on … Another option is to use a set of integration tests that run all the time to test the system from a behavior / business perspective.

White-Box vs. Black-Box Monitoring: Which is finally more important? All are and can work together. In my company, both are used at different layers of the micro-service architecture including software and infrastructure components.

RED vs USE monitoring

When you’re working in a web-oriented and customer-centric company, you are quickly introduced to The Four Golden Signals monitoring concept which defines a series of metrics originally from Google Site Reliability Engineering including latency, traffic, errors and saturation. The RED method is a subset of “Four Golden Signals” and focus on micro-service architectures and include following metrics:

Rate: number of requests our service is serving per second
Error: number of failed requests per second
Duration: amount of time it takes to process a request

Those metrics are relatively straightforward to understand and may reduce time to figure out which service was throwing the errors and then eventually look at the logs or to restart the service, whatever.

For HTTP Metrics the RED Method is a good fit while the USE Method is more suitable for infrastructure side where main concern is to keep physical resources under control. The latter is based on 3 metrics:

Utilization: Mainly represented in percentage and indicates if a resource is in underload or overload state.
Saturation: Work in a queue and waiting to be processed
Errors: Count of event errors

Those metrics are commonly used by DBAs to monitor performance. It is worth noting that utilization metric can be sometimes misinterpreted especially when maximum value depends of the context and can go over 100%.

SQL Server infrastructure monitoring expectations

Referring to the starting scenario and all concepts surfaced above, it was clear for us to evolve our existing SQL Server monitoring architecture to improve our ability to reach the following goals:

Keeping analyzing long-term trends to respond usual questions like how my daily-workload is evolving? How big is my database? …
Alerting to respond for a broken issue we need to fix or for an issue that is going on and we must check soon.
Building comprehensive dashboards – dashboards should answer basic questions about our SQL Server instances, and should include some form of the advanced SQL telemetry and logging for deeper analysis.
Conducting an ad-hoc retrospective analysis with easier correlation: from example an http response latency that increased in one service. What happened around? Is-it related to database issue? Or blocking issue raised on the SQL Server instance? Is it related to a new query or schema change deployed from the automated delivery pipeline? In other words, good observability should be part of the new solution.
Automated discovery and telemetry collection for every SQL Server instance installed on our environment, either on VM or in container.
To rely entirely on the common platform monitoring based on Prometheus and Grafana. Having the same tooling make often communication easier between people (human factor is also an important aspect of DevOps)

Prometheus, Grafana and Telegraf

Prometheus and Grafana are the central monitoring solution for our micro-service architecture. Some others exist but we’ll focus on these tools in the context of this write-up.
Prometheus is an open-source ecosystem for monitoring and alerting. It uses a multi-dimensional data model based on time series data identified by metric name and key/value pairs. WQL is the query language used by Prometheus to aggregate data in real time and data are directly shown or consumed via HTTP API to allow external system like Grafana. Unlike previous tooling, we appreciated collecting SQL Server metrics as well as those of the underlying infrastructure like VMWare and others. It allows to comprehensive picture of a full path between the database services and infrastructure components they rely on.

Grafana is an open source software used to display time series analytics. It allows us to query, visualize and generate alerts from our metrics. It is also possible to integrate a variety of data sources in addition of Prometheus increasing the correlation and aggregation capabilities of metrics from different sources. Finally, Grafana comes with a native annotation store and the ability to add annotation events directly from the graph panel or via the HTTP API. This feature is especially useful to store annotations and tags related to external events and we decided to use it for tracking software releases or SQL Server configuration changes. Having such event directly on dashboard may reduce troubleshooting effort by responding faster to the why of an issue.

For collecting data we use Telegraf plugin for SQL Server. The plugin exposes all configured metrics to be polled by a Prometheus server. The plugin can be used for both on-prem and Azure instances including Azure SQL DB and Azure SQL MI. Automated deployment and configuration requires low effort as well.

The high-level overview of the new implemented monitoring solution is as follows:

SQL Server telemetry is achieved through Telegraf + Prometheus and includes both Black-box and White-box oriented metrics. External events like automated deployment, server-level and database-level configuration changes are monitored through a centralized scheduled framework based on PowerShell. Then annotations + tags are written accordingly to Grafana and event details are recorded to logging tables for further troubleshooting.

Did the new monitoring met our expectations?

Well, having experienced the new monitoring solution during this year, I would say we are on a good track. We worked mainly on 2 dashboards. The first one exposes basic black-box metrics to show quickly if something is going wrong while the second one is DBA oriented with a plenty of internal counters to dig further and to perform retrospective analysis.

Here a sample of representative issues we faced this year and we managed to fix with the new monitoring solution:

1) Resource pressure and black-box monitoring in action:

For this scenario, the first dashboard highlighted resource pressure issues, but it is worth noting that even if the infrastructure was burning, users didn’t experience any side effects or slowness on application side. After corresponding alerts raised on our side, we applied proactive and temporary fixes before users experience them. I would say, this scenario is something we would able to manage with previous monitoring and the good news is we didn’t notice any regression on this topic.

2) Better observability for better resolution of complex issue

This scenario was more interesting because the first symptom started from the application side without alerting the infrastructure layer. We started suffering from HTTP request slowness on November around 12:00am and developers got alerted with sporadic timeout issues from the logging system. After they traversed the service graph, they spotted on something went wrong on the database service by correlating http slowness with blocked processes on SQL Server dashboard as shown below. I put a simplified view on the dashboards, but we need to cross several routes between the front-end services and databases.

Then I got a call from them and we started investigating blocking processes from the logging tables in place on SQL Server side. At a first glance, different queries with a longer execution time than usual and neither release deployments nor configuration updates may explain such sudden behavior change. The issue kept around and at 15:42 it started appearing more frequently to deserve a deeper look at the SQL Server internal metrics. We quickly found out some interesting correlation with other metrics and we finally managed to figure out why things went wrong as show below:

Root cause was related to transaction replication slowness within Always On availability group databases and we directly jumped on storage issue according to error log details on secondary:

End-to-End observability by including the database services to the new monitoring system drastically reduces the time for finding the root cause. But we also learnt from this experience and to continuously improve the observability we added a black-box oriented metric related to availability group replication latency (see below) to detect faster any potential issue.

And what’s next?

Having such monitoring is not the endpoint of this story. As said at the beginning of this write-up, continuous delivery comes with its own DBA challenges illustrated by the starting scenario. Traditionally the DBA role is siloed, turning requests or tickets into work and they can be lacking context about the broader business or technology used in the company. I experienced myself several situations where you get alerted during the night when developer’s query exceeds some usage threshold. Having discussed the point with many DBAs, they tend to be conservative about database changes (normal reaction?) especially when you are at the end of the delivery process without clear view of what will could deployed exactly.

Here the new situation:

Implementing new monitoring stuff changed the way to observe the system (at least from a DBA perspective). Again, I believe the adding value of DBA role in a company with a strong DevOps mindset is being part of both production DBAs and Development DBAs. Making observability consistent across all the delivery pipeline including databases is likely part of the success and may help DBA getting a broader picture of system components. Referring to my context, I’m now able to get more interaction with developer teams on early phases and to provide them contextual feedbacks (and not generic feedbacks) for improvements regarding SQL production telemetry. They also have access to them and can check by themselves impact of their development. In the same way, feedbacks and work with my team around database infrastructure topic may appear more relevant.

It is finally a matter of collaboration

Database maintenance thoughts with Azure SQL databases

mikedavem — Sun, 29 Mar 2020 20:55:33 +0000

As DBA, your priority is to ensure your data are consistent, safely backed up and you get steady performance of your database. In on-prem environments, these tasks are generally performed through scheduled jobs including backups, check integrity and index / statistics maintenance tasks.

But moving databases to the cloud in Azure (and others) tells a different story. Indeed, even if the same concern and tasks remain, some of them are under the responsibility of the Cloud provider and some other ones not. If you’re working with Azure SQL databases – like me – some questions raise very quickly on this topic and it was my motivation to write this write-up. I would like to share with you some new experiences by digging into the different maintenance items. If you have a different story to tell, please feel free to comment and to share your own experience!

Database backups

Microsoft takes over the database backups with a strategy based on FULL (every week), DIFF (every 12 hours) and LOGs (every 5 to 10min) with cross-datacenter replication of the backup data. As far as I know, we cannot change this strategy, but we may change the retention period and extend it with an archiving period extend up to 10 years by enabling the Long-term retention. The latter assumes this is supported by your database service level and options that come with. For instance, we are using some SQL Azure databases in serverless mode which doesn’t support LTR. This strategy provides different methods to restore an Azure database including PITR, Geo-Restore or the ability to restore a deleted database. We are using some of them for our database refresh between Azure SQL Servers or sometimes to restore previous database states for testing. However, just be aware that even if restoring a database may be a trivial operation in Azure, the operation may take a long time regarding your context and factors described here. In our context and regarding the operation, a restore operation may take up to 2.5h (600GB of data to restore on GEN5

In addition, it is worth noting that there is not a free lunch here and you will pay for storing your backups and probably more than you initially expect. Cost is obviously tied to your backup size for FULL, DIFF and LOG and the retention period making the budget sometimes hard to predict. According to discussions with some colleagues and other MVPs, it seems we are not alone in this case and my advice is to keep an eye of your cost. Here a quick and real picture of the cost ratio between compute + database storage versus backup storage (PITR + LTR) with a PITR retention of 35 days and LITR (max retention of one year)

…

As you may notice half of the total fees for the Azure SQL Database may concern only the backup storage. From our side, we are working hard on reducing this ratio, but this is another topic out of the scope of this blog post.

Database integrity check

Should we continue to use the famous DBCC CHECKDB command? Well, the response is no, and the Azure SQL Database engineering team takes responsibility for managing data integrity. During internal team discussions we wondered what the process would be to recover corrupt data and how fast corruptions are treated by the Azure team. All questions seem to be addressed in this Microsoft blog post here and for us, it was important to know the Microsoft response time in case of database corruption because it may impact the retention policy. Faster Microsoft warns you about your integrity issue, less the retention could be to rewind to the last consistent point (in a reasonable order of magnitude obviously).

Database maintenance (statistics and indexes)

Something that is likely misunderstood with Azure SQL database is the maintenance of indexes and statistics are not anymore under the responsibility of the DBA. Referring to some discussions around me, it seems to be a misconception and the automatic index tuning was often mentioned in the discussions. Automatic tuning aims to adapt dynamically database to a changing workload by applying tuning recommendations either by creating new indexes or dropping redundant and duplicate indexes or forcing last good plan for queries as well. Even this feature (not by default) helps improving the performance for sure, it doesn’t substitute neither updating statistics nor rebuilding fragmented indexes. Concerning the statistics, it is true that some improvements about statistics has been shipped with SQL Server over the time like TF2371 which makes the formula for large tables more dynamic (by default since SQL Server 2016+) but we may arguably say that it remains situations where updating statistics should be done manually and as database administrator it is still under your own responsibility to maintain them.

Database maintenance and schedulng in Azure?

As said as the beginning of this write-up with Azure SQL DB, database maintenance is a different story and the same applies when it comes scheduling. Indeed, you quickly noticed we lacked built-in job scheduler capabilities like the traditional SQL Server agent with on-premises installations, but it doesn’t mean we were not able to schedule any job at all. In fact, there is exists different options to look at to replace the traditional SQL Server agent for database maintenance in Azure we had to look at:

1) SQL Agent jobs still exist but only available for SQL Managed Instances. In our context, we use Azure Single Database with GP_S_Gen5 SKU, so definitely not an option for us.

2) Elastic database jobs can run across multiple servers and allow to write DB maintenance tasks in T-SQL or PowerShell. But this feature has some limitations which has excluded it from the equation:
– It’s still in preview and we cannot rely on it for production scenarios
– Serverless and auto-pausing / auto-resuming used with our GP_S_Gen5 SKU database are not supported

3) Data factory could be an option because it is already part of the Azure Services consumed in our context, but we wanted to be decoupled from ETL / Business workflow.

4) Finally, we were interested by Data factory especially the integration with Git and Azure DevOps and the same capabilities are shipped with Azure Automation. One another important factor of decision was the cost because Azure automation runs for free until 500 minutes of job execution per month. In our context, we have a weekly-based schedule for our maintenance plan and we estimated one hour per runbook execution. Thus, we stay under the limit of additional fees.

Azure Automation brings a good control on credentials, but we already use the Azure Key Vault to protect sensitive information. We found that using Azure automation native capabilities and Azure Key Vault may be duplicate that could lead to decentralize our secret management and it more complex. Here a big picture of the process to perform the maintenance of our Azure databases from a scheduled runbook in Azure automation:

Firstly, we use a PowerShell-based runbook which in turn calls different stored procedures on the target Azure database to perform the database maintenance. To be compliant with our DevOps processes, the runbook is stored in a source control repository (Git) and published to Azure Automation through the built-in sync process. The runbook runs with “Run As Account” option to get access of Azure Key Vault and AppID for using the dedicated application identity. Finally, this identity is then used to connect to the SQL Azure DB and to perform the database maintenance based on the corresponding token authentication and granted permissions on the DB side. New token-based authentication available since the Azure SQL DB v12 and helped us to meet our security policy that prevents using SQL Logins when possible. To generate the token, we still use the old ADAL.PS module. This is something we need to update in the future.

Here a sample of interesting parts of the PowerShell code to authenticate to the Azure database:

# Run runbook as special account to get access the Azure Key Vault
$AzureAutomationConnectionName = "xxxx"
$ServicePrincipalConnection = Get-AutomationConnection -Name $AzureAutomationConnectionName

…

$clientId = (Get-AzKeyVaultSecret -VaultName $KeyvaultName -Name "xxxxx").SecretValueText
$response = Get-ADALToken -ClientId $clientId -ClientSecret $clientSecret -Resource $resourceUri -Authority $authorityUri -TenantId $tenantName

# Connection String
$connectionString = "Server=tcp:$SqlInstance,1433;Initial Catalog=$Database;Persist Security Info=False;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;"

# Create the connection object
$connection = New-Object System.Data.SqlClient.SqlConnection($connectionString)

# Set identity by using the corresponding token to connect to the Azure DB
$connection.AccessToken = $response.AccessToken

...

Yes, Azure is a different beast (like other Clouds) and requires from DBAs to review their habits. It may be very confusing at the beginning but everything you made in the past is possible or at least can be achieved in a different way in Azure. Just think differently would be my best advice in this case!

Collaborative way and tooling to debug SQL Server blocked processes scenarios

mikedavem — Thu, 30 Jan 2020 14:13:48 +0000

A quick blog post to show how helpful an extended event and few other tools can be to help fixing orphan transactions in a real use case scenario. I often gave training with customers about SQL Server performance and tools, but I noticed how difficult it can be to explain the importance of a tool if you only explain theory without any illustration with a real customer case.
Well, let’s start my own story that began a couple of days ago with an SQL alert to indicate a blocking scenario issue. Looking at our SQL dashboard (below), we were able to confirm quickly we are running into an annoying issue and it would be getting worse over if we do nothing.

…

Let’s precise before continuing there are a plenty of tools to help digging into blocked processes scenarios and my intention is not to favor one specific tool over another one. So, in my case the first tool I used was the sp_WhoIsActive procedure from Adam Machanic. One great feature of it is to get a comprehensive picture of what is happening on your system at the moment you’re executing the procedure.
Here a sample of output I got. Let’s precise it doesn’t reflect exactly my context (which was a little more complex) but anyway my intention is not to focus on this specific part

As you may see, I got quickly interesting information about the blocking leaders including session_id, the application name and the command executed as this moment. But the interesting part of this story was to get a request_id to NULL as well as a status value to SLEEPING. After some researches, having these values indicate likely that SQL Server has completed the command and the connection is waiting for the next command to come from the client. In addition, looking at the open_tran_count value (=1) confirmed the transaction was still opened. We started monitoring the transaction a couple of minutes to see if the application could manage to commit (or to rollback) the transaction but nothing happened. So, we had to kill the corresponding session to get back to a normal situation. A few minutes later, the situation came back with the same pattern and we applied the same temporary fix (KILL session).
The next step consisted in discussing with the DEV team to fix this annoying issue once and for all. We managed to reproduce this scenario in DEV environment, but it was not clear what happened exactly inside the application because we got not specific errors even when looking at the tracing infrastructure. To help the DEV team investigating the issue, we decided to create a special extended event session that both monitor all activity scoped to the concerned application and the transactions that remain open during the tracing timeframe. Events will be easily correlated relying on causality tracking capabilities of extended events.

So, the extended event session used two targets including the Event File and Histogram. Respectively, the first one was intended to write workload activity into a file on disk and the second one aimed identifying quickly opened transactions.

CREATE EVENT SESSION [OrphanedTransactionHunter] ON SERVER
ADD EVENT sqlserver.database_transaction_begin(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.database_transaction_end(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.error_reported(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.module_end(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.module_start(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.rpc_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.rpc_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sp_statement_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sp_statement_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sql_statement_completed(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.tsql_stack)),
ADD EVENT sqlserver.sql_statement_starting(
ACTION(sqlserver.database_id,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack))
ADD TARGET package0.event_file(SET filename=N'OrphanedTransactionHunter'),
ADD TARGET package0.pair_matching(SET begin_event=N'sqlserver.database_transaction_begin',begin_matching_actions=N'sqlserver.session_id',end_event=N'sqlserver.database_transaction_end',end_matching_actions=N'sqlserver.session_id',respond_to_memory_pressure=(1))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=5 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)
GO

The outputs were as follows:

> Target histogram (with opened transactions)

The only transaction that remained opened during our test (just be careful other noisy records not relevant at the moment when you start the XE session) concerned the session_id = 873. The attach_activity_id, attach_activity_id_xfer and session column values were helpful here to correlate events recorded to the event file target.

> Event File (Workload activity)

Here the events after applying a filter with above values.

We noticed the transaction with session_id = 873 was started but never ended. In addition, we were able to identify the sequence of code executed by the application (mainly based on prepared statement and stored procedures in our context).This information helps the DEV team to focus on the right portion of code to fix. Without getting into details, it was very interesting to see this root cause was a SQL statement and duplicate key issue not thrown and managed correctly by the application. I was just surprised the application didn’t catch any errors in a such case. We finally understood that prepared statements and stored procedure calls were done through the DButils class including the closeQuietly() method to close connections. Referring to the Apache documentation, closeQuietly() was designed to hide SQL Exceptions when happen which definitely not help identifying easily the issue from an application side. Never mind, thanks to the collaboration with the DEV team we managed to get rid of this issue

David Barbarin