Tech ONTAP Blogs

Trident Controller Parallelism

cknight
NetApp
688 Views

Trident 1.0 shipped in December 2016, and its growth since then has mirrored that of the container ecosystem, now largely dominated by Kubernetes.  The relentless growth in customer deployments and provisioned petabytes has yielded production environments with tens of thousands of volumes and a high arrival rate of provisioning requests.  Newer technologies such as Kubernetes-based virtualization add to the load, as golden images must be rapidly cloned in large numbers to produce running virtual machines in seconds.  While less of an issue years ago, meeting these expectations means that Trident's performance at scale has become far more critical today.

 

The Controller Lock

The heart of Trident is its controller, a Kubernetes Deployment that handles all CSI requests, manages all of Trident's custom resources, and directs all interaction with storage system APIs.  The controller employs a layered architecture, with frontends that receive incoming work (CSI, Kubernetes, REST, Docker, Prometheus), backends for managing storage systems (ONTAP, FSxN, ANF, GCNV), and a "core" layer in the middle that routes requests appropriately.  Since its inception, the Trident controller has done only one thing at a time.  The single-threaded nature of Trident is enforced by a mutual exclusion lock (mutex) in the core layer that must be held for almost any workflow to proceed.  The mutex does a few things:

  • Serializes all workflows, thereby preventing API overload to any storage system
  • Protects the core cache from concurrent access contention
  • Prevents all manner of concurrency problems in and below the core layer

The effect of the controller lock is like a master chef having only one stove burner.  Like the chef, if Trident is busy doing something that takes a while, such as creating a Flexgroup volume or an Azure NetApp Files (ANF) snapshot, then all other operations have to wait.  tridentctl commands appear to hang while waiting, and operations where latency is particularly noticeable such as attaching volumes to pods appear to take much longer than they should.  The issue is particularly acute when deploying a large number of pods & PVCs all at once; the pods come up slowly since PV attachment calls must compete for the same lock as PV creation requests.

 

Chef1.png

 

Early in Trident's history, the controller lock was of little concern, for a couple reasons.

  1. The first backend drivers were ONTAP NAS/SAN and SolidFire, for which the storage systems provide fast provisioning operations.  Creating an ONTAP volume or snapshot might take a second or two, so no Trident workflow added much latency, and Trident remained responsive to arriving requests.
  2. Most early deployments were of a small-to-medium scale that Trident could easily manage.

In more recent years, Trident has added drivers for cloud-based storage exhibiting significantly longer provisioning times, and customers have deployed Trident with ever greater scale requirements.  The Trident team responded by deploying multiple strategies to keep up, but we can only do so much with the controller lock in place.

 

Concurrency vs. Parallelism

The ultimate goal is to improve Trident's performance and responsiveness.  But a quick measurement shows that Trident uses very little CPU, so the Trident code isn't adding much latency through its own execution.  Instead we observe that the Trident controller spends nearly all of its time waiting on other things, most notably storage system APIs.  So it seems obvious that Trident should do more than one thing at once, just as our master chef might have multiple pans on the stove while mostly waiting for them.  As we all know, concurrency is not parallelism.  But is our solution merely one of concurrency, where all of the parallelism takes place on the storage systems and Trident is merely a single-threaded smart scheduler (like the chef), or is true parallelism needed in Trident itself?

 

Chef2.png

 

The answer lies in the implementation of CSI and the sidecars we ship with Trident.  Each of the four controller sidecars (provisioner, attacher, resizer, snapshotter) is its own controller, and each maintains its own pool of worker threads.  Each defaults to having 10 workers (100 for the provisioner!), and Trident has not needed to configure them differently.  So in the unlikely worst case, the sidecars could bombard Trident with 130 simultaneous CSI gRPC calls.  In theory, Trident could serialize those onto a work queue and handle them in parallel as resources are available.  But CSI calls are synchronous, so joining those onto a single-threaded queue adds complexity with no benefit.  The cache resource contention problem is the same either way, so since the requests are already on separate contexts, we might as well handle them as such.  So the answer seems to be true parallelism, not to gain performance from additional CPU, but by more closely mirroring the nature of the multi-threaded calls into Trident from its CSI, Kubernetes, and REST frontends.

 

Hierarchical Read-Write Locks

It is obvious that there are many unsafe concurrent scenarios that we need to prevent, such as:

  • Resizing a volume while deleting or snapshotting it
  • Snapshotting a volume before it is fully created
  • Reverting a snapshot to a volume that is being resized or deleted
  • Updating a backend while creating volumes on it
  • Attaching a volume that is being deleted or modified

But there are many more concurrent scenarios that are safe and should be allowed:

  • Creating two volumes on different backends
  • Creating two volumes on the same backend
  • Creating one volume while deleting another
  • Creating one volume while performing snapshot operations on another
  • Creating or updating a backend while performing operations on any other backend
  • Attaching a volume while performing other operations on any other volume or snapshot
  • Listing objects of any type while operating on objects of that type
  • Pause/drain all activity

It should be apparent that some type of locking is needed to prevent the unsafe scenarios.  It is also clear that no combination of mutex locks will allow the safe scenarios.  After much consideration, there were a couple of key insights that offered a solution:

  1. Our cached objects are partially hierarchical, in that a backend owns a set of volumes, each of which may own a set of snapshots.  So locking a backend should not interfere with any operations on a different backend.
  2. Instead of using a mutex per object, we can use a read-write mutex per object.  A read-write lock allows concurrent access for read-only operations, whereas write operations require exclusive access.  Applied to Trident's cached objects, a write lock on an object allows for modification or deletion of that object while preventing locking operations on any of that object's children.  Holding a read lock on an object is required but sufficient to obtain a read or write lock on any of its immediate children.  So for example, we can grab a write lock on a volume while resizing it, which prevents concurrently creating a snapshot of the same volume (which would require a read lock on the volume).  If the resize operation wins the race, the snapshot workflow has to wait, and vice versa.

 

Early Results

To isolate the controller improvement without including node operations such as iSCSI scanning and LUN formatting, Land, Hendrik measured the time taken to create 400 and 1200 bound iSCSI PVC/PV pairs.  Using ONTAP's older ZAPI interface showed a 3x-6x improvement, while REST was 9x-19x faster.  Note that REST is slower to create a single volume because it is asynchronous, requiring Trident to poll for job completion, but parallel operation keeps ONTAP's API queue full and REST actually runs faster than ZAPI with average PV creation times as low as 0.14 seconds.

 

iscsiResults.png

 

The Road Ahead

Trident 25.06 contains the first installment of controller parallelism to enable much higher performance.  It ships as a Tech Preview feature, not suitable for production deployments.  An optional installation argument ("--enable-concurrency" for tridentctl install, "enableConcurrency: true" for operator Torc CR, "--set enableConcurrency=true" for Helm chart) activates the new concurrent core, which is temporarily hard-limited to ontap-san backends with iSCSI or FCP.  Some features, such as Prometheus metrics, replication, and group snapshots, are not yet available in the concurrent core.  Within those limitations, there should be no issue switching between concurrent and single-threaded operation.  Over the coming releases, we anticipate adding the missing features and backends while the concurrent mode is hardened with long-term stress testing.  Early benchmarking results are exciting, and we have more ideas on making Trident run as fast as possible without overwhelming the storage system or Kubernetes APIs.  Please give it a try and let us know your experience!

Public