Tech ONTAP Blogs
Tech ONTAP Blogs
In two recent posts by my colleague Mike Oglesby, he walked through how to implement KV cache offloading with vLLM and LMCache, and how to extend that design with ONTAP S3 for shared storage across multiple serving instances. If you have not read them yet, start here:
Those posts showed how to configure the stack. They covered the commands, connectors, routers, and benchmark results. I needed to see the big picture... it left me with a question: what is happening inside these cache layers, and why should we care?
At a high level, KV cache is not a performance trick. It is a memory strategy. Every time an LLM processes a prompt, it builds internal key-value tensors for each token. Those tensors live in GPU memory. As prompts grow longer and as more users hit the system at once, that memory footprint grows linearly. Eventually, something gives. Either you reduce concurrency, shrink context length, or start evicting useful cache entries.
This is where architecture matters. vLLM manages KV cache efficiently inside a single node. LMCache turns KV cache into a portable, shareable asset that can move across CPU memory and shared storage. And NetApp ONTAP provides the durable, scalable tier that makes that sharing practical across multiple instances. The "why" is not just speed. The "why" is control over memory, concurrency, and cost in real production environments.
If you are running a single model on a single GPU, vLLM's native KV cache is hard to beat. It keeps the key-value tensors directly in GPU memory and manages them with a paged attention system designed for high throughput and continuous batching. That means you get efficient memory allocation and automatic prefix reuse without bolting on anything extra.
Why do we care about KV cache at all? Because without it, every new token forces the model to recompute attention over the entire prompt. With caching, the model reuses previously computed tensors for the shared prefix. In practice, that means faster time-to-first-token for repeated prompts and much better performance in multi-turn chat. For a minimal example, if your RAG system reuses the same prompt or retrieved context, vLLM will automatically leverage it.
The tangible benefits show up quickly. You can support more concurrent users before hitting GPU memory limits. You can keep longer context windows active. And you do not need to design a complex cache tier just to get basic prefix reuse. For many teams, especially those running a single inference server, that is enough. Keep it local. Keep it fast.
And from an operational standpoint, it is almost boringly simple. Install vLLM, point it at a supported Hugging Face model or one on local disk, and expose the OpenAI-compatible endpoint. No Kubernetes required (but that doesn't preclude you from running it in a cluster). No distributed coordination layer. You stand up an endpoint and serve a high-performance LLM with built-in KV caching. For single-node inference, that simplicity is a feature, not a limitation.
Single-node inference is clean and efficient. But the moment you add a second vLLM instance behind a load balancer, the rules change. Now, each node has its own GPU memory and local KV cache. Without coordination, one instance cannot see what the other has already computed. You end up recomputing the same prefixes across nodes, wasting GPU cycles, and increasing latency.
This is where LMCache changes the design. Instead of treating KV cache as a private, in-process memory structure, LMCache treats it as shared infrastructure. KV blocks can be offloaded to CPU memory and shared storage, and multiple vLLM instances can reference the same cached content. In a multi-node setup, that means the second instance does not need to recompute what the first one already processed. You are no longer scaling GPUs in isolation. You are scaling with a shared memory plane.
The benefit becomes obvious in real workloads. In chat systems, RAG pipelines, or agent workflows, there is almost always a shared structure: system prompts, tool schemas, retrieved passages, and long conversation prefixes.
With LMCache connected to shared storage such as ONTAP S3, both vLLM instances can access the same offloaded KV cache entries. That improves throughput and reduces redundant computation as concurrency grows.
Now let's focus on an important detail: using the LMCache v1 connector for both prefill and decode (kv_role: "kv_both"). Prefill is where the model processes the prompt and builds the initial KV tensors. Decode is where tokens are generated incrementally. If you only optimize prefill, you still risk fragmentation and duplication across workers during generation. Using the connector for both phases creates a unified KV lifecycle. The same connector manages storage, reuse, and transfer across the full request path. That keeps the architecture consistent and avoids subtle mismatches between what was cached during prefill and what is needed during decode.
In multi-node environments, this consistency matters. You want predictable behavior across instances, especially when routing requests round-robin behind a router. A shared, connector-driven KV layer ensures that cache semantics are stable regardless of which node handles the request. The result is not just faster responses. It is better resource utilization, smoother scaling, and fewer surprises when you move from one GPU to many.
For specific implementation and setup details, check out the before mentioned blog post: KV cache offloading – exploring the benefits of shared storage.
Step back for a moment. Why is everyone suddenly obsessed with inference efficiency? Because inference is now the dominant cost in most real-world AI systems. Training made the headlines. Inference pays the power bill.
We are seeing this shift everywhere. NVIDIA's continued push into inference infrastructure, like with Inference Context Memory Storage Platform, and memory-tier innovation, including moves around high-performance AI silicon and data movement, signals that the bottleneck is no longer just raw compute. It is memory, bandwidth, and power efficiency. At the same time, companies like DeepSeek have shown that you can compete not by training the biggest model, but by engineering smarter inference paths. Cheaper tokens win markets.
KV cache architecture sits directly in the middle of this shift. When you reuse KV blocks instead of recomputing them, you reduce GPU cycles. When you offload intelligently to CPU memory or shared storage, you avoid overprovisioning expensive accelerators. When you design for multi-node reuse rather than isolated GPUs, you turn memory into shared infrastructure rather than a per-GPU constraint.
This is also why quantization is gaining so much attention. Smaller models and lower-precision weights reduce memory pressure. KV cache reuse reduces recomputation. Together, they change the economics of deployment. Instead of throwing more GPUs at the problem, you engineer the system. And engineering almost always beats brute force in the long run.
It is easy to reduce KV cache to a simple headline: "it makes inference faster." That is true, but it misses the real point. KV cache is about reducing redundant work. Every reused prefix is GPU compute you did not have to pay for twice. Every offloaded block is memory pressure you did not push back onto expensive accelerators.
And that translates directly into power and cooling savings. GPUs are incredible, but they are also energy-hungry. If you can reduce recomputation, reuse context intelligently, and tier memory across GPU, CPU, and shared storage, you do not need to scale hardware as aggressively. You get more throughput per watt and more value per rack unit. That is not an optimization detail. That is an infrastructure strategy.
To truly grasp how transformative this is, I highly encourage you to read Mike Oglesby's recent technical walkthroughs detailing this setup. He explicitly demonstrates that in a multi-tier configuration (utilizing GPU memory, CPU memory, and an ONTAP S3 bucket) there is virtually no downside to enabling that S3 tier. Because the S3 tier and CPU tier operate synergistically. He outlines the precise implementation commands for vLLM and LMCache, highlighting prerequisites like having an active S3 bucket within your ONTAP cluster and ensuring network access from your GPU server to the S3 data interface. Go read his posts for the granular configuration details.
This holistic approach to resource management is exactly what I will be dissecting at the Open Data Science Conference (ODSC) East in my upcoming session, "Less Compute, More Impact: How Model Quantization Fuels the Next Wave of Agentic AI." Quantization and KV cache reuse are two sides of the same coin. Both are about getting a better return on investment from the hardware you already have. Smaller or quantized models reduce weight memory. Smarter caching reduces attention recomputation. Together, they change the math. I was recently on an episode of the ODSC AI X Podcast titled Smarter Per Watt where we talk about the upcoming energy crunch that is going to have an impact on deploying new GPU platforms in 2026 (and beyond).
In the end, the companies that win will not be the ones that throw the most GPUs at a problem. They will be the ones who design carefully. They will treat memory as a first-class architectural concern. They will engineer their inference stack rather than blindly scale it. And that is the real "why" behind vLLM, LMCache, and shared storage tiers.