Tech ONTAP Blogs

KV cache offloading - CPU RAM vs. storage

moglesby
NetApp
118 Views

Today, we continue our exploration of KV cache offloading. If you missed my previous posts on this topic, be sure to check them out here and here. In this post, I will explore the nuances of offloading your KV cache to CPU RAM vs. storage. Using empirical results, I will demonstrate when you might want to go with one over the other.

 

KV cache offloading refresher

 

Let's begin with a quick refresher on the KV cache offloading concept. If you read my previous posts and are already familiar with KV cache offloading, you can skip this section.

 

First off, what is a "KV cache?" Simply put, KV (key-value) caching is a technique that is used to optimize LLM inference by storing previously calculated values in a KV cache so that these values don't need to be calculated again for every new token that is generated, which would otherwise be necessary. This article from HuggingFace provides a good overview.

 

So...what is "KV cache offloading?" The standard behavior for most inference engines is to store their KV cache in GPU memory. This introduces a challenge. The size of the KV cache scales linearly with batch size (the number of prompts being processed simultaneously) and sequence length (the length of each individual generation stream). As model context windows grow ever larger, multi-turn chats become more common, and inference platforms are utilized by more and more users (and agents), the size of the KV cache can quickly outpace the amount of available GPU memory. In certain scenarios, there is a high probability that useful KV cache entries will be evicted from GPU memory. To address this challenge, there has been considerable interest in enabling the offloading of KV cache entries to CPU memory and/or storage so that they can be kept indefinitely and referenced again in the future. You can think of these KV cache offload "targets" as functioning as a "swap space" for GPU memory.

 

Comparing offload targets

 

In a previous post, I demonstrated how to enable KV cache offloading using two open-source tools, vLLM and LMCache. This combination has become quite popular in the community, and I will continue to focus on these tools in this post.

 

LMCache supports multiple different backends for implementing offloading. These backends generally fall into three different categories - CPU RAM, local storage, and external storage. It is a bit more nuanced than this as several backends span categories, but you get the picture. This raises the question: when might you want to use one over the other? As with most things in the world of AI, it depends on your use case. CPU RAM is generally faster than storage, but it is limited in capacity and can't easily be shared across nodes. Depending on your particular setup, local storage may or may not be faster than external storage, and it also can't be shared across nodes. External storage may not be the fastest, but it is trivial to share across a large number of nodes, and technologies like NVIDIA GPUDirect Storage (GDS) are narrowing the performance gap. The below table outlines the pros and cons of different offload "targets."

 

Attribute CPU RAM Local storage External storage
Speed Fastest Slower than RAM; may be faster than external storage Slower than RAM; may be faster than local storage
Capacity Lowest Limited Extremely high
Shareability Not easily shareable across nodes Not easily shareable across nodes Shareable across a large number of nodes

 

At this point, you may be asking: "why not use both CPU RAM and external storage?" At the time of this writing, it is not recommended to use the CPU RAM alongside GDS with LMCache because, as mentioned in the GDS backend documentation, CPU RAM can interfere with GDS operations. However, the community and ecosystem are working towards supporting such multi-tier deployments, so this is sure to change in the future.

 

Empirical results

 

Benchmark

 

To prove out the conclusions from the previous section, we tested multiple different setups and compared the results. For this testing, we used the LMCache project's multi-round QA benchmark. We performed the same set of benchmark runs for three different scenarios - standalone vLLM (no LMCache), vLLM with LMCache's CPU RAM backend (100 GB capacity), and vLLM with LMCache's GDS backend (offloading to a 10 TB ONTAP NFS volume).

 

For each run, we held the shared system prompt constant at 1,000 tokens, the maximum number of concurrent users constant at 15, the number of rounds per user constant at 20, and the benchmark runtime constant at 20 minutes. We re-ran the benchmark with different values for the length of the user-specific context. This value represents the length of a user's chat history at the beginning of the first round. For each run, we served the Qwen3-8B model on a single GPU. I've included the exact benchmark command that we used below. You can use this reproduce our testing in your own environment.

 

python3 $LMCACHE_REPO_PATH/benchmarks/multi_round_qa/multi-round-qa.py \
    --num-users 15 \
    --num-rounds 20 \
    --qps 1 \
    --shared-system-prompt 1000 \
    --user-history-prompt $USER_CONTEXT_LENGH \
    --answer-len 100 \
    --time 1200 \
    --model Qwen/Qwen3-8B \
    --output $WORKING_DIR/${SCENARIO_NAME}_${USER_CONTEXT_LENGH}.csv \
    --base-url http://localhost:8000/v1

 

The benchmark starts with 15 users. Each user submits a long initial prompt (including a long chat history) and then submits shorter follow-up prompts for 20 rounds. When a user finishes their 20th round, a new user is added, so the number of concurrent users is always capped at 15. This pattern continues until the time limit is reached, at which point the benchmark waits for all in-progress requests to be completed and then exits.

 

Results

 

The results of our testing are captured in the following chart. This chart shows the processing speed across multiple initial prompt lengths for each of the three deployment setups.

 

upstream vLLM+LMCache - variable prompt lengths.png

 

As you can clearly see from the results, there is very little downside to enabling offloading. The only time you might not want to bother with it is if you are hosting a low-traffic service where most requests feature very short prompts. If you are dealing with multiple concurrent users and medium-to-long prompts, then you will generally benefit from enabling offloading. Whether to use CPU RAM or external storage is a trickier question to answer. Based on our testing, if your prompts are in the 6,000 - 21,000 token range, then you are clearly better off using CPU RAM. For prompts that are 26,000 tokens and longer, you are better off using external storage. Remember that we used a capacity of 100 GB for our CPU RAM scenario. If you are working with less RAM, then you will hit the "cutover" point at a lower token count. Conversely, if you have more RAM, you'll hit it at a higher token count.

 

It is important to note, however, that we tested a single-instance deployment, meaning that, for each benchmark run, we ran a single vLLM instance on a single GPU. This means that we didn't benefit from the ability to share external storage across instances/nodes. We do plan to repeat this testing with a multi-instance deployment, so stay tuned for another blog with those results.

 

Conclusion

 

KV cache offloading is a powerful technique. However, there is no "one size fits all" solution for every use case. It is important to chose the right deployment type for your use case. If external storage is a good fit, remember that NetApp supports KV cache offloading using industry-standard protocols and tooling. No custom proprietary clients are required. You can continue to use the same NFS protocol that you have used for decades. To learn more about NetApp, visit netapp.com.

 

 

Public