Introducing SQL Server with Portworx and storage orchestration

Stateful applications like databases need special considerations on K8s world. This is because data persistence is important and we need also something at the storage layer communicating with the container orchestrator to take advantage of its scheduling capabilities. For Stateful applications, StatefulSet may be only part of the solution because it primary focuses on the Pod availability and we have to rely on the application capabilities for data replication stuff. But StatefulSet doesn’t address that of the underlying storage at all. At the moment of this write-up, StatefulSet-based solutions for SQL Server such availability groups are not supported yet on production.

So, with Stateful applications we may consider other solutions like GlusterFS or NFS as distributed storage spanning all the nodes of the K8s cluster, but they often don’t meet the requirements of a database workload running in production with high throughput and IOPS requirement and data migration.

Products exist in the market and seem to address these specific requirements and I was very curious to get a better picture of their capabilities. During my investigation, I went through a very interesting one named Portworx for a potential customer’s project. The interesting part of Portworx consists of a container-native, orchestration-aware storage fabric including the storage operation and administration inside K8s. It aggregates underlying storage and exposes it as a software-defined, programmable block device.

From a high-level perspective, Portworx is using a custom scheduler – STORK (STorage Orchestration Runtime for Kubernetes) to assist K8s in placing a Pod in the same node where the associated PVC resides. It reduces drastically some complex stuff around annotations and labeling to perfmon some affinity rules.

In this blog post, I will focus only on the high-availability topic which is addressed by Portworx with volume’s content synchronization between K8s nodes and aggregated disks. Therefore Portworx requires to define the redundancy of the dataset between replicas through a replication factor value by the way.

I cannot expose my customer’s architecture here but let’s try top apply the concept to my lab environment.

151 - 0 - 0 - K Lab architecture

As shown above, my lab environment includes 4 k8s nodes with 3 nodes that will act as worker. Each worker node owns its local storage based on SSD disks (One for the SQL Server data files and the another one will handle Portworx metadata activity – Journal disk). After deploying Portworx on my K8s cluster here a big picture of my configuration:

$ kubectl get daemonset -n kube-system | egrep "(stork|portworx|px)"
portworx       3         3         3       3            3        
portworx-api   3         3         3       3            3

Portworx is a DaemonSet-based installation. Each Portworx node will discover the availability storage to create a container-native block storage device with:
– /dev/sdb for my SQL Server data
– /dev/sdc for hosting my journal

$ kubectl get pod -n kube-system | egrep "(stork|portworx|px)"

portworx-555wf                                  1/1     Running            18         2d23h
portworx-api-2pv6s                              1/1     Running            8          2d23h
portworx-api-s8zzr                              1/1     Running            8          2d23h
portworx-api-vnqh2                              1/1     Running            4          2d23h
portworx-pjxl8                                  1/1     Running            17         2d23h
portworx-wrcdf                                  1/1     Running            389        2d10h
px-lighthouse-55db75b59c-qd2nc                  3/3     Running            0          35h
stork-5d568485bb-ghlt9                          1/1     Running            0          35h
stork-5d568485bb-h2sqm                          1/1     Running            13         2d23h
stork-5d568485bb-xxd4b                          1/1     Running            1          2d4h
stork-scheduler-56574cdbb5-7td6v                1/1     Running            0          35h
stork-scheduler-56574cdbb5-skw5f                1/1     Running            4          2d4h
stork-scheduler-56574cdbb5-v5slj                1/1     Running            9          2d23h

The above picture shows different stork pods that may influence scheduling based on the location of volumes that a pod requires. In addition, the PX cluster (part of Portworx Enterprise Platform) includes all the Portworx pods and allows getting to monitor and performance insights of each related pod (SQL Server instance here).

Let’s have a look at the global configuration by using the pxctl command (first section):

$ PX_POD=$(kubectl get pods -l name=portworx -n kube-system -o jsonpath='{.items[0]}')
$ kubectl exec $PX_POD -n kube-system -- /opt/pwx/bin/pxctl status
Status: PX is operational
License: Trial (expires in 28 days)
Node ID: 590d7afd-9d30-4624-8082-5f9cb18ecbfd
        Local Storage Pool: 1 pool
        0       HIGH            raid0           20 GiB  8.5 GiB Online  default default
        Local Storage Devices: 1 device
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_MAGNETIC 20 GiB          08 Dec 19 21:59 UTC
        total                   -                       20 GiB
        Cache Devices:
        No cache devices
        Journal Device:
        1       /dev/sdc1       STORAGE_MEDIUM_MAGNETIC

Portworx has created a pool composed of my 3 replicas / Kubernetes nodes with a 20GB SSD each. I just used a default configuration without specifying any zone or region stuff for fault tolerance capabilities. This is not my focus at this moment. According to Portworx’s performance tuning documentation, I configured a journal device to improve I/O performance by offloading PX metadata writes to a separate storage.

Second section:

Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode      Used    Capacity        Status  StorageStatus   Version         Kernel                  OS    b0ac4fa3-29c2-40a8-9033-1d0558ec31fd    k8n2.dbi-services.test  Yes     3.1 GiB  20 GiB          Online  Up     3.10.0-1062.1.2.el7.x86_64      CentOS Linux 7 (Core)   9fc5bc45-5602-4926-ab38-c74f0a8a8b2c    k8n1.dbi-services.test  Yes     8.6 GiB  20 GiB          Online  Up     3.10.0-1062.1.2.el7.x86_64      CentOS Linux 7 (Core)   590d7afd-9d30-4624-8082-5f9cb18ecbfd    k8n3.dbi-services.test  Yes     8.5 GiB  20 GiB          Online  Up (This node) 3.10.0-1062.1.2.el7.x86_64      CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  20 GiB
        Total Capacity  :  60 GiB

All my nodes are up for a total storage of 60 GiB. Let’s deploy a Portworx Storage Class with the following specification:

kind: StorageClass
  name: portworx-sc
  repl: "3"
  nodes: "b0ac4fa3-29c2-40a8-9033-1d0558ec31fd,9fc5bc45-5602-4926-ab38-c74f0a8a8b2c,590d7afd-9d30-4624-8082-5f9cb18ecbfd"
  label: "name=mssqlvol"
  fs: "xfs"
  io_profile: "db"
  priority_io: "high"
  journal: "true"
allowVolumeExpansion: true

The important parameters are:

repl: « 3 » => Number of replicas (K8s nodes) where data will be replicated

nodes: « b0ac4fa3-29c2-40a8-9033-1d0558ec31fd,9fc5bc45-5602-4926-ab38-c74f0a8a8b2c,590d7afd-9d30-4624-8082-5f9cb18ecbfd » => Number of replicas used for data replication. Replicas are identified by their ID. Each write is synchronously replicated to a quorum set of nodes whereas read throughput is aggregated, where multiple nodes can service one read request in parallel streams.

fs: « xfs » => I used a Linux FS supported by SQL Server on Linux

io_profile: « db » => By default, Portworx is able to use some profiles according to the access pattern. Here I just forced it to use db profile that implements a write-back flush coalescing algorithm.

priority_io: « high » => I deliberately configured the IO priority value to high for my pool in order to favor maximum throughput and low latency transactional workloads. I used SSD storage accordingly.

journal: « true » => The volumes used by this storage class will use the journal dedicated device

allowVolumeExpansion: true => is an interesting parameter to allow online expansion of the concerned volume(s). As an aside, it is worth noting that volume expansion capabilities is pretty new (> v1.11+) on K8s word for the following in-tree volume plugins: AWS-EBS, GCE-PD, Azure Disk, Azure File, Glusterfs, Cinder, Portworx, and Ceph RBD

Then, let’s use Dynamic Provisioning with the following PVC specification:

kind: PersistentVolumeClaim
apiVersion: v1
  name: pvcsc001
  annotations: portworx-sc
    - ReadWriteOnce
      storage: 20Gi

Usual specification for a PVC … I just claimed 20Gi of storage based on my portworx storage class. After deploying both the Storage Class and PVC here the new picture of my configuration:

$ kubectl get sc
NAME                     PROVISIONER                     AGE
portworx-sc       3d14h
stork-snapshot-sc        stork-snapshot                  3d23h

$ kubectl get pvc
NAME       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvcsc001   Bound    pvc-98d12db5-17ff-11ea-9d3a-00155dc4b604   20Gi       RWO            portworx-sc    3d13h

Note that there is also a special storage class implemention for snapshot capabilities, we will talk about this capability in next write-up. My PVC pvcs001 is ready to be used by my Stateful application. Now it’s time to deploy a Stateful application with my SQL Server pod and the specification below. Let’s say that Portworx volumes are usable for non-root execution containers when specifying fsGroup parameter (securityContext section). So,this is a good fit with the non-root execution capabilities shipped with SQL Server pod :) You will also notice there is no special labeling or affinity stuff between my pod and the PVC. I just defined the volume mount, the corresponding PVC and that’s it!

apiVersion: apps/v1beta1
kind: Deployment
  name: mssql-deployment
  replicas: 1
        app: mssql
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001
      terminationGracePeriodSeconds: 10
      - name: mssql
        - containerPort: 1433
        - name: MSSQL_PID
          value: "Developer"
        - name: ACCEPT_EULA
          value: "Y"
        - name: MSSQL_SA_PASSWORD
              name: sql-secrets
              key: sapassword
        - name: mssqldb
          mountPath: /var/opt/mssql
            cpu: "3500m"
            cpu: "2000m"
      - name: mssqldb
          claimName: pvcsc001

apiVersion: v1
kind: Service
  name: mssql-deployment
    app: mssql
    - protocol: TCP
      port: 1470
      targetPort: 1433
  type: LoadBalancer

Let’s take a look at the deployment status:

$ kubectl get deployment,pod,svc
NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.extensions/mssql-deployment   1/1     1            1           3d7h

NAME                                   READY   STATUS    RESTARTS   AGE
pod/mssql-deployment-67fdd4759-vtzmz   1/1     Running   0          45m

NAME                       TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
service/kubernetes         ClusterIP                 443/TCP          4d
service/mssql-deployment   LoadBalancer   1470:32374/TCP   3d7h

We’re now ready to test the HA capabilities of Portworx! Let’s see how STORK influences the scheduling to get my SQL Server pod on the same node where my PVC resides. The pxctl command provides different options to get information about the PX cluster and volumes as well as configuration and management capabilities. Here a picture of my volumes:

$ kubectl exec $PX_POD -n kube-system -- /opt/pwx/bin/pxctl volume list
ID                      NAME                                            SIZE    HA      SHARED  ENCRYPTED        IO_PRIORITY     STATUS                          SNAP-ENABLED
675137742462835449      pvc-98d12db5-17ff-11ea-9d3a-00155dc4b604        20 GiB  2       no      no               HIGH            up - attached on  no
$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE                     NOMINATED NODE   READINESS GATES
mssql-deployment-67fdd4759-vtzmz   1/1     Running   0          48m   k8n1.dbi-services.test

My SQL Server pod and my Portworx storage sit together on the K8n1.dbi-services.test node. The PX web console is also available and provides the same kind of information as pxctl command does.

151 - 1 - PX web console volume

Let’s now simulate the K8n1.dbi-services.test node failure. In this scenario both my PVC and my SQL Server pod are going to move to the next available – K8n2 ( This is where STORK comes into play to stick my pod with my PVC location.

151 - 2 - PX web console volume after failover

$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP               NODE                     NOMINATED NODE   READINESS GATES
mssql-deployment-67fdd4759-rbxcb   1/1     Running   0          31m   k8n2.dbi-services.test

Another important point, my SQL Server data survived to my pod restart and remained available through my SQL Server instance as expected !! It was a short introduction to Portworx capabilities here and I will continue to share about it in a near future!

See you !

David Barbarin

Laisser un commentaire