KV Cache Offloading - When is it Beneficial?

moglesby

Today, we continue our series on LLM inference with a deeper exploration of KV cache offloading. This post will build on my previous post in which I showed you how to enable the offloading of your vLLM deployment's KV cache to a NetApp ONTAP storage system using NVIDIA GPUDirect Storage (GDS). The previous post showed you how to do this. In this post, I'll explain why and when you might want to enable offloading. I'll cut through the hype and outline the scenarios in which enabling offloading is beneficial as well as the scenarios in which it can actually be a drag on inference performance.

Technical Concepts

Let's begin with a quick refresher on the key technical concepts. First off, what is the "KV cache" that I am referring to here? Simply put, KV (key-value) caching is a technique that is used to optimize LLM inference by storing previously calculated values in a KV cache so that these values don't need to be calculated again for every new token that is generated, which would otherwise be necessary. This article from HuggingFace provides a good overview.

So...what is "KV cache offloading?" The standard behavior for most inference engines is to store their KV cache in GPU memory. This introduces a challenge. The size of the KV cache scales linearly with batch size (the number of prompts being processed simultaneously) and sequence length (the length of each individual generation stream). As model context windows grow ever larger, multi-turn chats become more common, and inference platforms are utilized by more and more users (and agents), the size of the KV cache can quickly outpace the amount of available GPU memory. In certain scenarios, there is a high probability that useful KV cache entries will be evicted from GPU memory. To address this challenge, there has been considerable interest in enabling the offloading of KV cache entries to CPU memory and/or storage so that they can be kept indefinitely and referenced again in the future.

When is Enabling Offloading Beneficial?

Now that you have a good baseline understanding of the concepts, let's explore some scenarios. Enabling KV cache offloading can greatly enhance your inference performance, but only when you see certain usage patterns. With other usage patterns, it can actually reduce your performance.

Many Unique One-Shot Prompts

First, let's explore the scenario in which an inference server typically receives unique prompts. Additionally, this inference server rarely serves multi-turn chats, meaning that it typically receives what are referred to as "one-shot" prompts. This scenario is representative of many general-purpose chatbot deployments.

In this scenario, enabling KV cache offloading will not be beneficial. Virtually all inference server deployments have enough space in GPU memory to hold the KV cache entries for at least one full generation sequence, so the entries will remain in GPU memory during processing of a single prompt. If offloading is enabled, the entries will be offloaded after the generation sequence is complete. However, since every prompt is a unique one-shot prompt, these previously-calculated KV cache values won't ever be useful. This means that offloading is nothing more than an additional and unnecessary I/O step, which will reduce your overall inference throughput.

Many Prompts with Long Shared Prefixes

Next, let's dive into the scenario in which an inference server typically receives many prompts that share the exact same prefix, and this prefix represents a large portion of the prompt. This scenario is inclusive of both one-shot prompts and multi-turn chats, as long as each sequence contains the same long prefix. This is representative of many departmental or team-specific agentic coding and RAG deployments.

In this scenario, enabling KV cache offloading will not be beneficial. This is counterintuitive, since it would seem that retaining the KV cache entries for the long shared prefix would be beneficial. In fact, having these entries on-hand is beneficial, but in this scenario, they will almost always be resident in GPU memory. As I previously mentioned, virtually all inference server deployments have enough space in GPU memory to hold the KV cache entries for at least one full generation sequence. Since most prompts in this scenario contain the exact same long prefix, these entries will rarely, if ever, be evicted from GPU memory. This means that, once again, offloading is nothing more than an additional and unnecessary I/O step, which will reduce your overall inference throughput.

Many Long Prompts with Multiple Shared Prefixes

Lastly, let's dive into the scenario in which an inference server typically receives many prompts, and these prompts share common long prefixes, but not every prefix is the same. Just as with the previous scenario, this scenario is inclusive of both one-shot prompts and multi-turn chats, as long as the individual sequences share a common set of long prefixes. Some examples of deployments that would fit this scenario are agentic coding applications that service multiple codebases and RAG applications that serve multiple departments.

In this scenario, enabling KV cache offloading can be extremely beneficial. Offloading is beneficial in this scenario, when it was not in the previous scenario, due to the fact that there are multiple shared prefixes. If the number of common prefixes is high enough, the cache entries for these prefixes will routinely be evicted from GPU memory. When a new prompt is received that contains a common prefix for which the cache entries were evicted, instead of having to recalculate the cache entries, the serving engine can simply retrieve them from the offload target.

In an attempt to quantify the benefit of enabling offloading, we simulated this scenario. We generated a large number of 30K token prompts. Each prompt contained a 28K token prefix from a set of 20 common prefixes. We then sent large batches of prompts to the server simultaneously, iterating across multiple batches, to simulate a high-traffic deployment. We first ran this workload against a standard vLLM deployment with no offloading. Then, we killed our vLLM instance and deployed a new instance with offloading to ONTAP using LMCache's GDS Backend and re-ran the same exact workload. Enabling offloading yielded an impressive 61% improvement in total token throughput, a 34% reduction in mean TTFT (time to first token), and a 43% decrease in mean ITL (inter-token latency).

Conclusion

KV cache offloading is a powerful technique. However, it is not a magic bullet. It is important to know when it will be beneficial, and when it won't be. For scenarios in which it is is beneficial, it can deliver extremely impressive performance gains as we demonstrated. Best of all, NetApp supports KV cache offloading using industry-standard protocols and tooling. No custom proprietary clients are required. You can continue to use the same NFS protocol that you have used for decades.