Tech ONTAP Blogs

KV cache offloading with vLLM, LMCache, and StorageGRID

moglesby
NetApp
90 Views

Today, I will continue my series on LMCache benchmarking. In a previous post, I demonstrated the benefits of offloading your KV cache to shared storage using ONTAP pNFS over RDMA and showed that there is virtually no downside to enabling this shared storage tier. In a subsequent post, I explained how to tune LMCache to better handle longer-context workloads when working with external storage. However, you may prefer working with object storage vs. NAS. Today, I will demonstrate that StorageGRID is a compelling platform for KV cache offloading, yielding up to a 99% decrease in time to first token (TTFT) when compared to vanilla vLLM with no KV cache offloading, and up to an 54% improvement when compared to a vLLM+LMCache setup that only utilizes the CPU memory (system RAM) for offload. I will also discuss various factors that might lead you to choose StorageGRID over ONTAP, or vice versa, for your KV cache offloading solution.

 

StorageGRID overview

 

NetApp StorageGRID delivers high-performance, globally scalable, S3-compatible object storage designed for AI and modern data workloads. With a unified namespace across distributed environments, it simplifies data management while enabling efficient data processing at scale across hybrid and multi-cloud architectures. For more information, check out the StorageGRID product page.

 

KV cache offloading refresher

 

If you read my previous posts and are already familiar with KV cache offloading, you can skip this section. If it's a new concept for you, read on.

 

First off, what is a KV cache? Simply put, KV (key-value) caching is a technique that is used to optimize large language model (LLM) inference by storing previously calculated values in a KV cache so that these values don't need to be calculated again for every new token that is generated, which would otherwise be necessary. This article from HuggingFace provides a good overview.

 

KV cache offloading is simply the offloading of a previously-processed prompt's KV cache to an external target outside of GPU/accelerator memory, such as system RAM, local disk, or shared storage. You can think of these KV cache offload "targets" as functioning as tiered "swap spaces" for GPU memory.

 

Shared storage

 

If you read my previous post on ONTAP pNFS, you will recall that we tested with two separate vLLM instances behind a load balancer. Each vLLM instance was configured to use LMCache to offload KV cache blocks to the same shared ONTAP NAS volume. The results showed there is virtually no downside to enabling the shared storage tier, and that a significant system performance gain can be achieved across certain scenarios.

 

moglesby_0-1782761617718.png

 

In this post, we repeat the same test, but instead of using an ONTAP NAS volume for our shared storage tier, we use a StorageGRID S3 bucket. To connect to a StorageGRID S3 bucket, we utilize LMCache's S3 backend. The S3 backend enables you to implement a three-tier setup incorporating GPU memory, CPU memory (system RAM), and a shared storage tier.

 

Empirical results

 

Benchmark

 

For this testing, we used the same benchmark that we used in our previous testing, the LMCache project's multi-round QA benchmark. Again, we tested against two separate vLLM instances behind a load balancer. The four scenarios that we tested this time are outlined in the following table.

 

Scenario

Description

Standalone vLLM (no LMCache)

KV cache tiers:

  1. GPU memory
vLLM with LMCache's CPU RAM backend

KV cache tiers:

  1. GPU memory
  2. CPU memory (100 GB capacity per instance)
vLLM with LMCache's CPU RAM backend and Remote FS plugin (offloading to an ONTAP NAS volume using pNFS over RDMA)

KV cache tiers:

  1. GPU memory
  2. CPU memory (100 GB capacity per instance)
  3. ONTAP pNFS over RDMA

vLLM with LMCache's CPU RAM backend and S3 backend (offloading to a StorageGRID S3 bucket)

KV cache tiers:

  1. GPU memory
  2. CPU memory (100 GB capacity per instance)
  3. StorageGRID S3

 

Since it is not a scenario that we have written about before, I have included the LMCache configuration file for our StorageGRID S3 setup below, for reference.

 

Note: Your StorageGRID system must be configured to support virtual hosted style requests.

 

local_cpu: True
max_local_cpu_size: 100
save_decode_cache: True
remote_url: "s3://<bucket>.<storagegrid_s3_endpoint_fqdn>"
remote_serde: "naive"
extra_config:
  s3_num_io_threads: 320
  s3_prefer_http2: False
  s3_enable_s3express: False
  save_chunk_meta: False
  disable_tls: <True/False>
  aws_access_key_id: "<storagegrid_s3_access_key>"
  aws_secret_access_key: "<storagegrid_s3_secret_key>"
  s3_region: "us-east-1"

 

Consistent with our last round of testing, we held the shared system prompt constant at 1,000 tokens, the maximum number of concurrent users constant at 15, the number of rounds per user constant at 20, the benchmark runtime constant at 20 minutes, and the queries per second (QPS) constant at 2. We once again re-ran the benchmark with different values for the length of the user-specific context. This value represents the length of the unique context included in each user's first message. For each vLLM instance, we served the Qwen3-8B model on a single GPU. I've included the exact benchmark command that we used below. You can use this to run the same tests against your own environment.

 

python3 $LMCACHE_REPO_PATH/benchmarks/multi_round_qa/multi-round-qa.py \
    --num-users 15 \
    --num-rounds 20 \
    --qps 2 \
    --shared-system-prompt 1000 \
    --user-history-prompt $USER_CONTEXT_LENGTH \
    --answer-len 100 \
    --time 1200 \
    --model Qwen/Qwen3-8B \
    --output $WORKING_DIR/${SCENARIO_NAME}_${USER_CONTEXT_LENGTH}.csv \
    --base-url http://localhost:8000/v1

 

The benchmark begins by immediately creating 15 user sessions, staggered in simulated time so they appear to have started at evenly spaced offsets in the past (roughly 9.5 seconds apart, given QPS=2). Each user's first request includes the 1,000-token shared system prompt concatenated with their user-specific context, followed by a question. In subsequent rounds, the full accumulated chat history is sent with each request, so prompt length grows with every round. Each user sends one request every 7.5 seconds (15 users / 2 QPS) for up to 20 rounds. New user sessions are added by the manager on a fixed time interval (roughly every 9.5 seconds). When the 20-minute time limit is reached, the simulation loop exits and the benchmark waits for all in-flight requests to complete before writing results and exiting.

 

Results

 

The results of our testing are captured in the following chart. This chart shows the aggregate system processing speed across multiple initial prompt lengths for each of the four deployment setups.

 

upstream vLLM+LMCache - 2x - variable prompt lengths _3.png

 

StorageGRID performs admirably, and just as with pNFS, there is virtually no downside to enabling the S3 tier (there is no scenario in which it meaningfully underperforms a standard setup). Overall, adding the StorageGRID S3 tier yields impressive system performance gains:

 

  When compared to...  
KV cache: GPU + CPU + StorageGRID S3 KV cache: GPU memory only KV cache: GPU + CPU
Aggregate system processing speed Up to 173% increase Up to 76% increase

Average TTFT per request

Up to 99% decrease Up to 54% decrease
Average decode throughput per request Up to 290% increase Up to 88% increase

 

When to choose ONTAP vs. StorageGRID

 

Based on our testing, ONTAP pNFS over RDMA delivers a greater improvement in system performance than StorageGRID S3. If you want to maximize system performance to the greatest extent possible, ONTAP pNFS over RDMA is the right choice.

 

However, there are many reasons why you might want to choose StorageGRID instead. While the magnitude is not quite as large as with ONTAP pNFS, StorageGRID S3 still delivers very meaningful system performance gains. If a global object storage namespace better fits your way of working, StorageGRID is likely the right choice for you. There are also other constraints that you need to take into account, such as your project budget your team's expertise. Many AI engineers and developers have extensive experience with object storage and prefer working in that paradigm. Also, with StorageGRID’s caching feature, we may observe even greater performance gains. We have plans to test this and provide an updated blog in the future. My hope is that the main thing you will take away from this post is the knowledge that you can build a very compelling KV cache offloading solution using either ONTAP or StorageGRID.

 

Conclusion

 

KV cache offloading is a powerful technique, and the inclusion of a shared storage tier can significantly enhance your inference performance. When exploring your options, remember that NetApp supports KV cache offloading using industry-standard protocols and tooling. No custom proprietary clients are required. You can continue to use the same NFS and S3 protocols that you have used for years. To learn more about NetApp, visit netapp.com.

Public