Scaling Kubernetes storage with NetApp Trident: From bottlenecks to breakthroughs

Shashank-Pal · ‎2025-08-13

Scaling Kubernetes storage with NetApp Trident: From bottlenecks to breakthroughs

Picture this: Your application is humming along, serving millions of users with ease, scaling up as demand spikes, and never missing a beat. That’s the dream Kubernetes delivers. Kubernetes—or K8s, as the cool kids call it—is the unsung hero orchestrating containerized applications at scale. It’s like the conductor of a symphony, ensuring that every note (or container) plays in harmony. But as Kubernetes has evolved from a geeky experiment to the engine of modern computing, its storage demands have surged. Enter NetApp^®Trident™, the Container Storage Interface (CSI) driver that bridges Kubernetes with NetApp ONTAP^® storage, so that your apps get the persistent, scalable storage they need to thrive.

But as workloads surged, we hit a snag—a bottleneck that threatened to slow the music down. This blog takes you behind the scenes of how we tackled this challenge for the Node part of the Trident (for the controller: Trident Controller Parallelism), the clever fix we engineered, and the jaw-dropping results that followed, turning Trident into a lean, mean, scalability machine ready for the Kubernetes revolution.

The Kubernetes explosion: Why storage can’t keep quiet

Kubernetes isn’t just growing—it’s booming. The stats tell the story. The Cloud Native Computing Foundation (CNCF) reports that Kubernetes adoption has surged, with 96% of enterprises using the platform in 2024, and 80% deploying it in production environments. A 2025 study from Mordor Intelligence shows that the market for Kubernetes is experiencing robust financial growth, projected to expand from US$2.57 billion in 2025 to US$7.07 billion by 2030, at a compound annual growth rate (CAGR) of 22.4%. This expansion is significantly driven by the increasing demand for managed services and the rapid rise of artificial intelligence and machine learning workloads, which increasing rely on Kubernetes as their foundational infrastructure

Why? Because it’s fast, flexible, and lets you scale like a boss.

But here’s the rub: As these workloads pile in, they need dynamic volume provisioning—storage that can scale up fast and flawlessly. Trident was built for this provisioning, but early on, it relied on a single global lock mechanism to manage requests.

The bottleneck: A lock that locked us in

Picture this: A busy airport with just one check-in counter for all flights. Passengers (requests) stack up, waiting their turn, even if they’re on different airlines or headed to different destinations. That’s what the Trident Node single global lock was like. Every provisioning request—attaching, mounting, unmounting, formatting—had to queue up, even if they were for unrelated volumes. That kept things orderly when Kubernetes workloads were light, but as the crowds grew, it turned into a scalability challenge. In a cluster with 500 pods, each needing a volume, a 1-second-per-request delay meant 500 seconds (over 8 minutes) of waiting. In today’s world of microsecond SLAs, that’s an eternity.

Serializing everything sacrificed speed for safety. It was like forcing all planes to take off one at a time, no matter how many runways were free. We needed a smarter way to manage the traffic.

A peek behind the Kubernetes curtain

Before we charge ahead with this epic Kubernetes storage adventure, let’s hit the brakes for just a second. I know I’m interrupting the flow, but trust me, this little detour will be worth it. The Node component of the CSI driver operates in accordance with a defined specification that governs its behavior, particularly through four primary gRPC calls: NodeStage, NodeUnstage, NodePublish, and NodeUnpublish. (These gRPC calls are WRTnode Pods.. These calls are crucial to the management of storage resources when a pod with a Persistent Volume Claim (PVC) is deployed.

How it all comes together

To elaborate, here’s the sequence of operations on the Node side:

NodeStage. This call prepares the volume for use by executing necessary preliminary actions, such as formatting or mounting to a designated staging target.
NodePublish. This step mounts the volume to the Pod’s directory, thereby enabling access for the Pod.
NodeUnpublish. Upon completion of the Pod’s requirement, this call unmounts the volume from the Pod’s directory.
NodeUnstage. This final call performs cleanup tasks, undoing the preparations made during NodeStage.

These slick moves are powered by gRPC calls, a fancy way of saying that the CSI driver and Kubernetes Nodes chat back and forth like a well-rehearsed crew, making sure that every step is perfectly timed. When you deploy a Pod with a PVC, it’s like the director yelling “Action!” The CSI driver leaps in, running this choreography to get the volume ready, hooked up, and eventually tidied away. Now that you’ve got the behind-the-scenes magic, let’s jump back into the main plot.

The fix: Two steps to freedom

We weren’t about to let a bottleneck stop us. Drawing inspiration from concurrency gurus like Katherine Cox-Buday (Concurrency in Go: Tips and Techniques for Developers) and Rob Pike’s Go concurrency talks, we engineered a two-step solution that’s as elegant as it is practical.

Step 1: UUID-based locking—personalized speed

Out went the single lock; in came a per-volume lock based on each volume’s unique UUID. Think of it as giving a check-in counter to each airline and also segregating further depending on the destination. Requests for the same volume wait their turn (to avoid chaos), but requests for different volumes? They race ahead in parallel.

Check out the code:

	lockContext := "NodeStageVolume"
	defer locks.Unlock(ctx, lockContext, req.GetVolumeId())
	if !attemptLock(ctx, lockContext, req.GetVolumeId(), csiNodeLockTimeout) {
		return nil, status.Error(codes.Aborted, "request waited too long for the lock")
	}

These locks are placed at the start of the gRPC calls that we just discussed.

Step 2: The limiter—keeping chaos in check

But freedom comes with a catch. If the kubelet (Kubernetes’ Node agent) flooded Trident with hundreds of requests at once—like formatting tons of volumes—it could clog the system. Imagine opening every airport gate at once, only to jam the security check-ins. Enter the limiter, our smart traffic cop. It caps how many requests can run simultaneously, tailored to the protocol (iSCSI, NAS) and operation (NodeStage, NodePublish, etc.). Check it out:

	if err := p.limiterSharedMap[NodeStageNFSVolume].Wait(ctx); err != nil {
		return nil, err
	}
	defer p.limiterSharedMap[NodeStageNFSVolume].Release(ctx)

The maximum number of allowable requests is a configurable parameter, but currently we do not permit users to modify it. If there's an increased need for it, then it can be implemented with a simple one- or two-line code change.

Currently, this parallelization has been enabled only for the NAS and SAN (iSCSI) protocols. In the upcoming Trident 25.10 release, it will be extended to support the remaining SAN protocols, FCP and NVMe.
Here are the limits that Trident is currently using:

	maxNodeStageNFSVolumeOperations     = 10
	maxNodeStageSMBVolumeOperations     = 10
	maxNodeUnstageNFSVolumeOperations   = 10
	maxNodeUnstageSMBVolumeOperations   = 10
	maxNodePublishNFSVolumeOperations   = 10
	maxNodePublishSMBVolumeOperations   = 10
	maxNodeUnpublishVolumeOperations    = 10
	maxNodeStageISCSIVolumeOperations   = 5
	maxNodeUnstageISCSIVolumeOperations = 10
	maxNodePublishISCSIVolumeOperations = 10
	maxNodeExpandVolumeOperations       = 10

The payoff: Numbers don’t lie

So, did our fix work? You bet. Although exact metrics depend on your setup, here’s a taste of what we’ve seen.

These are the configurations under which these tests were recorded:

number_of_k8_nodes = Single
volume_count = 30
luks = Enabled
fs_type = ext4
formatting_options = None
access_mode = ReadWriteOnce

And here are some numbers with the added controller concurrency (for the controller: Trident Controller Parallelism), which was recently introduced as a tech preview in the Trident 25.06 release.

These figures also include the improvements observed from adding a few of the luks arguments to the luksFormat command. If you look closely, you'll notice that adding controller parallelism doesn't deliver the speedup you might expect. That’s because attaching a volume to a Pod hinges more on the Node rather than the controller, with tasks like formatting and LUKS encryption accounting for a significant portion of the time.

Here are some numbers excluding LUKS formatting from our calculations, which more clearly highlights the performance gains we achieved purely through parallelism.

To use this feature you don't need to do anything specific, just install or upgrade Trident and it's business as usual. Github link: https://github.com/NetApp/trident

Trident is not just keeping up—it’s leading the charge.

The future: Ready for the Kubernetes stage

Kubernetes is here to stay, and NetApp Trident is ready to rock the house. By ditching the single lock and embracing UUID-based locking and the limiter, we’ve built a storage solution that’s fast, scalable, and simple to maintain. Whether you’re running a handful of containers or thousands, Trident has your back.

Running Kubernetes with NetApp ONTAP? Give these upgrades a spin and tell us what you think—your feedback keeps us sharp!

References:

Dynatrace, "Kubernetes in the Wild report 2025." https://www.dynatrace.com/resources/ebooks/kubernetes-in-the-wild/
CNCF, January 30, 2025. https://www.cncf.io/blog/2025/01/30/digital-transformation-driven-by-community-kubernetes-as-example/
Data on Kubernetes Community. https://dok.community/wp-content/uploads/2024/11/2024DoKReport.pdf
Mordor Intelligence, Kubernetes Market Analysis. https://www.mordorintelligence.com/industry-reports/kubernetes-market
BMC Software, The State of Containers Today: A Report Summary. https://www.bmc.com/blogs/state-of-containers/
Katherine Cox-Buday, Concurrency in Go: Tools and Techniques for Developers
Rob Pike, "Go Concurrency Patterns." https://talks.golang.org/2012/concurrency.slide

Trident node parallelism

Scaling Kubernetes storage with NetApp Trident: From bottlenecks to breakthroughs

The Kubernetes explosion: Why storage can’t keep quiet

The bottleneck: A lock that locked us in

A peek behind the Kubernetes curtain

How it all comes together

The fix: Two steps to freedom

Step 1: UUID-based locking—personalized speed

Step 2: The limiter—keeping chaos in check

The payoff: Numbers don’t lie

The future: Ready for the Kubernetes stage

I2A Registration is Open!