Tech ONTAP Blogs

KV cache offloading - exploring the benefits of shared storage

moglesby
NetApp
150 Views

Today, we continue our exploration of KV cache offloading. If you missed my previous posts on this topic, be sure to check them out here, here, and here. In this post, I will further explore the benefits of offloading your KV cache to shared storage. I will show the benefits of a shared storage tier and explain why, with certain configurations, there is virtually no downside to including a shared storage tier.

 

KV cache offloading refresher

 

Let's begin with a quick refresher on the KV cache offloading concept. If you read my previous posts and are already familiar with KV cache offloading, you can skip this section.

 

First off, what is a "KV cache?" Simply put, KV (key-value) caching is a technique that is used to optimize large language model (LLM) inference by storing previously calculated values in a KV cache so that these values don't need to be calculated again for every new token that is generated, which would otherwise be necessary. This article from HuggingFace provides a good overview.

 

So...what is "KV cache offloading?" The standard behavior for most inference engines is to store their KV cache in GPU memory. This introduces a challenge. The size of the KV cache scales linearly with batch size (the number of prompts being processed simultaneously) and sequence length (the length of each individual generation stream). As model context windows grow ever larger, multi-turn chats become more common, and inference platforms are utilized by more and more users (and agents), the size of the KV cache can quickly outpace the amount of available GPU memory. In certain scenarios, there is a high probability that useful KV cache entries will be evicted from GPU memory. To address this challenge, there has been considerable interest in enabling the offloading of KV cache entries to CPU memory and/or storage so that they can be kept indefinitely (as space permits) and referenced again in the future. You can think of these KV cache offload "targets" as functioning as a "swap space" for GPU memory.

 

Shared storage

 

In a previous post, we demonstrated, using vLLM and LMCache, several scenarios in which including a shared storage tier is beneficial. However, at that time, we only tested with a single serving engine instance. One of the primary benefits of shared storage is the ability to access the same files/objects across multiple instances/nodes, and our previous testing did not explore this benefit.

 

Additionally, we used LMCache's GPUDirect Storage (GDS) backend to connect to my shared storage tier. It is not recommended to use LMCache's CPU RAM backend alongside GDS because, as mentioned in the GDS backend documentation, CPU RAM can interfere with GDS operations. This means that for the scenarios in which our shared storage tier was enabled, we had to disable the CPU memory tier, forcing us to operate with a two-tier setup (GPU memory and external shared storage). A multi-tier setup incorporating CPU memory would have been preferable. Such a setup enables the staging of cache blocks to a faster tier (e.g. CPU memory) as new prompts come into the request queue.

 

Designer.png

 

In this post, we expand our testing to include additional scenarios in which the benefits of a shared storage tier really shine. With certain types of setups, there is virtually no downside to including a shared storage tier. We will demonstrate this in the next section.

 

Empirical results

 

Benchmark

 

For this testing, we used the same benchmark that we used in our previous testing, the LMCache project's multi-round QA benchmark. This time, however, we tested against two separate vLLM instances behind a load balancer. We tested the same scenarios that we tested last time, but this time, we added an additional scenario. The four scenarios that we tested this time are outlined in the following table.

 

Scenario

Description

Standalone vLLM (no LMCache)

KV cache tiers:

  1. GPU memory (24.14 GB capacity)
vLLM with LMCache's CPU RAM backend

KV cache tiers:

  1. GPU memory (24.14 GB capacity)
  2. CPU memory (100 GB capacity)
vLLM with LMCache's GDS backend (offloading to an ONTAP NFS volume)

KV cache tiers:

  1. GPU memory (24.14 GB capacity)
  2. ONTAP NFS (GDS) (10 TB capacity)
vLLM with LMCache's CPU RAM backend and S3 backend (offloading to an ONTAP S3 bucket)

KV cache tiers:

  1. GPU memory (24.14 GB capacity)
  2. CPU memory (100 GB capacity)
  3. ONTAP S3 (10 TB capacity)

 

With LMCache, when the CPU RAM backend and the S3 backend are enabled at the same time, LMCache will stage KV cache blocks from the S3 tier to the CPU tier when new prompts/requests enter the queue, effectively implementing a three-tier setup. Since this is not a scenario that we have written about before, I have included the LMCache configuration file for our CPU+S3 setup below, for reference.

 

local_cpu: True
max_local_cpu_size: 100
save_decode_cache: True
remote_url: "s3://<bucket>.<ontap_s3_server_fqdn>"
remote_serde: "naive"
extra_config:
  s3_num_io_threads: 320
  s3_prefer_http2: False
  s3_enable_s3express: False
  save_chunk_meta: False
  disable_tls: <True/False>
  aws_access_key_id: "<ontap_s3_access_key>"
  aws_secret_access_key: "<ontap_s3_secret_key>"
  s3_region: "us-east-1"

 

Consistent with our last round of testing, we held the shared system prompt constant at 1,000 tokens, the maximum number of concurrent users constant at 15, the number of rounds per user constant at 20, and the benchmark runtime constant at 20 minutes. This time, however, we increased the queries per second (QPS) from one to two since we were testing against two vLLM instances. We once again re-ran the benchmark with different values for the length of the user-specific context. This value represents the length of a user's chat history at the beginning of the first round. For each vLLM instance, we served the Qwen3-8B model on a single GPU. I've included the exact benchmark command that we used below. You can use this to run the same tests against your own environment.

 

python3 $LMCACHE_REPO_PATH/benchmarks/multi_round_qa/multi-round-qa.py \
    --num-users 15 \
    --num-rounds 20 \
    --qps 2 \
    --shared-system-prompt 1000 \
    --user-history-prompt $USER_CONTEXT_LENGTH \
    --answer-len 100 \
    --time 1200 \
    --model Qwen/Qwen3-8B \
    --output $WORKING_DIR/${SCENARIO_NAME}_${USER_CONTEXT_LENGTH}.csv \
    --base-url http://localhost:8000/v1

 

The benchmark starts with 15 users. Each user submits a long initial prompt (including a long chat history) and then submits shorter follow-up prompts for 20 rounds. When a user finishes their 20th round, a new user is added, so the number of concurrent users is always capped at 15. This pattern continues until the time limit is reached, at which point the benchmark waits for all in-progress requests to be completed and then exits.

 

Results

 

The results of our testing are captured in the following chart. This chart shows the processing speed across multiple initial prompt lengths for each of the four deployment setups.

 

upstream vLLM+LMCache - 2x - variable prompt lengths.png

 

 

As you can clearly see from the results, there is virtually no downside to enabling the S3 tier. Because an S3 tier and CPU tier can be used together, there is no range/scenario where you're better off leaving the S3 tier disabled. This differs from the GDS tier, where there is a range of prompt lengths for which you are better off using the CPU tier instead (since you can't use both together). However, with the two-instance setup, this range is smaller than it was with a single instance (see our previous results) because the two vLLM instances are both able to access the full set of offloaded KV cache entries using the shared GDS tier. The CPU tier is not shareable since it is local to each vLLM instance, so one vLLM instance can't access KV cache entries that were offloaded by the other without implementing a P2P channel, which introduces additional latency.

 

Conclusion

 

KV cache offloading is a powerful technique, and the inclusion of a shared storage tier can significantly enhance your inference performance, especially with setups that incorporate multiple serving engine instances. When exploring your options, remember that NetApp supports KV cache offloading using industry-standard protocols and tooling. No custom proprietary clients are required. You can continue to use the same NFS and S3 protocols that you have used for years. To learn more about NetApp, visit netapp.com.

 

Public