LLM Inference - KV Cache Offloading to ONTAP with vLLM and GDS

moglesby · ‎2025-07-11

Today, we continue our series on LLM inference with an exploration of a more advanced topic, KV cache offloading. This post will build on my previous post in which I walked you through the deployment of an LLM inference stack using vLLM, a popular open-source LLM inference server. The vLLM deployment that I demonstrated in that post stores its KV cache entirely in GPU memory (VRAM). In this post I will show you how to enable the offloading of your KV cache to a NetApp ONTAP storage system using NVIDIA GPUDirect Storage, often referred to as GDS.

Technical Concepts

KV Caching

KV (key-value) caching is a technique that is used to optimize LLM inference by storing previously calculated values in a KV cache so that these values don't need to be calculated again for every new token that is generated, which would otherwise be necessary. This article from HuggingFace provides a good overview. KV caching is a standard feature in most LLM inference engines (vLLM, TensorRT-LLM, SGLang, DeepSpeed etc.).

KV Cache Offloading

The standard behavior for most inference engines is to store their KV cache in GPU memory. Thus, there is a compromise involved in the use of the KV caching technique - additional GPU memory is utilized in order to lower the compute intensity of the inference workload. This introduces a challenge. The size of the KV cache scales linearly with batch size (the number of prompts being processed simultaneously) and context length (the length of each generation stream). As model context windows grow ever larger, and inference platforms are utilized by more and more users, the size of the KV cache can quickly outpace the amount of available GPU memory.

What can be done about this? A portion of the KV cache can be offloaded to CPU memory and/or storage in order to scale beyond the limits of GPU memory capacity. As reasoning models and agentic AI become more popular, KV cache offloading is likewise becoming more popular. However, the offloading target needs to be extremely fast in order to minimize the impact on inference latency. Additionally, a KV cache storage layer that can be shared across compute nodes in a multi-node deployment, such as shared storage, is becoming desirable.

GPUDirect Storage

This is where GDS can help. GDS enables a direct data path for transfers between GPU memory and storage, which avoids a bounce buffer through the CPU, thus reducing latency. NetApp ONTAP supports GDS through the use of NFS over RDMA. The LMCache project has enabled KV cache offloading in vLLM using GDS. In this post, I will show you how to create a vLLM deployment that is configured to offload its KV cache to ONTAP using GDS.

Prerequisites

You must have configured your ONTAP system and compute nodes to support NFS over RDMA. If you are using an NVIDIA-based NetApp AIPod, such as NetApp AIPod for NVIDIA DGX or NetApp AIPod with Lenovo, and you deployed your AIPod according to the deployment guide, NFS over RDMA will have already been configured. For non-AIPod deployments, the NetApp AIPod for NVIDIA DGX deployment guide may be a helpful reference point, but you will need to tweak the details to fit your environment.
You must have installed Kubernetes on one or more compute nodes in your AIPod.
You must have installed and configured the NVIDIA GPU Operator in your Kubernetes cluster.
You must have installed and configured NetApp Trident in your Kubernetes cluster, you must have a FlexVol StorageClass as your default storage class, and your must have an additional FlexGroup StorageClass that utilizes NFS over RDMA. You can find some example configurations here.
You must have Helm and kubectl installed on the laptop or jump box that you are using to administer your Kubernetes cluster.

Create a KV Cache Volume

First, you need to create a volume for storing your offloaded KV cache. Create a pvc-vllm-kv-cache.yaml file.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: vllm-kv-cache
  namespace: vllm-stack
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Ti
  storageClassName: aipod-flexgroups-retain-rdma # Change to your FlexGroup + NFS over RDMA StorageClass

Next, use kubectl to create a persistent volume claim for your volume.

kubectl create -f pvc-vllm-kv-cache.yaml

Deploy vLLM with KV Cache Offloading

Now, you are ready to deploy vLLM with KV cache offloading enabled. You will configure your vLLM deployment to offload its KV cache to the volume that you just created.

As I demonstrated in the previous post, you can use the helm chart that is published by the vLLM Production Stack project to deploy vLLM on Kubernetes. First, create a vllm-prod-stack-values.yaml file.

servingEngineSpec:
  vllmApiKey: "my-key" # Set an API key for your vLLM API
  modelSpec:
  - name: "mistral"
    repository: "lmcache/vllm-openai"
    tag: "latest"
    modelURL: "mistralai/Mistral-Small-3.2-24B-Instruct-2506" # Choose a model from HuggingFace
    replicaCount: 1
    requestCPU: 14 # Set based on model requirements
    requestMemory: "64Gi" # Set based on model requirements
    requestGPU: 2 # Set based on model requirements
    pvcStorage: "1Ti"
    vllmConfig:
      enablePrefixCaching: true
      tensorParallelSize: 2 # Set to number of GPUs
      maxModelLen: 16384 # Set based on model requirements
      v1: 1
      extraArgs: # Set based on model requirements
      - "--tokenizer-mode"
      - "mistral"
      - "--config_format"
      - "mistral"
      - "--load_format"
      - "mistral"
      - "--limit_mm_per_prompt"
      - "image=4"
      - "--gpu-memory-utilization"
      - "0.85"
      - "--kv-transfer-config"
      - '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

    hf_token: "<huggingface_token>" # Enter your HuggingFace access token

    env:
    - name: LMCACHE_USE_EXPERIMENTAL
      value: "True"
    - name: LMCACHE_CHUNK_SIZE
      value: "256"
    - name: LMCACHE_GDS_PATH
      value: "/mnt/gds/cache"
    - name: LMCACHE_CUFILE_BUFFER_SIZE
      value: "4096"
    - name: LMCACHE_LOCAL_CPU
      value: "False"

    extraVolumes:
    - name: kv-cache
      persistentVolumeClaim:
        claimName: vllm-kv-cache

    extraVolumeMounts:
    - name: kv-cache
      mountPath: "/mnt/gds"

routerSpec:
  serviceType: NodePort

Next, use Helm to deploy vLLM.

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install vllm vllm/vllm-stack -n vllm-stack --create-namespace -f vllm-prod-stack-values.yaml

After deploying, confirm that vLLM is running.

kubectl -n vllm-stack get pod

The output of this command should show two running pods. It may take several minutes for the inference server pod to reach a running state as vLLM will need to download the model from HuggingFace.

NAME                                            READY   STATUS    RESTARTS   AGE
vllm-deployment-router-75d66f7fb4-6s5p7         1/1     Running   0          20m
vllm-mistral-deployment-vllm-86b7bc4bf5-9kkhz   1/1     Running   0          20m

Chat with vLLM

Run the following commands to retrieve the URL for vLLM's OpenAI-compatible API endpoint. You can use this API endpoint to chat with your model.

export NODE_PORT=$(kubectl get -n vllm-stack -o jsonpath="{.spec.ports[0].nodePort}" services vllm-router-service)
export NODE_IP=$(kubectl get nodes -o jsonpath="{.items[0].status.addresses[0].address}")
echo http://$NODE_IP:$NODE_PORT

Additionally, you can use Open WebUI to quickly deploy a chat UI in front of your vLLM deployment. To install Open WebUI, follow the instructions that I outlined in the previous post.

Screenshot 2025-07-08 at 5.02.13 PM.png

Why NetApp?

Why run all of this on NetApp? The answer is simple. If your goal is to build an enterprise-class, production-ready GenAI application as opposed to a "science project" or proof-of-concept, then you need enterprise-class storage. NetApp supports KV cache offloading using industry-standard protocols and tooling. No custom proprietary clients are required. You can continue to use the same NFS protocol that you have used for decades.

Additionally, if you run your LLM inference stack on NetApp, then your KV cache, model versions, chat UI configurations, chat history, RAG knowledge base, etc., will be safe and protected. You can use snapshot copies to efficiently back up your data, ensuring that you will never lose anything. Additionally, you can utilize NetApp's Autonomous Ransomware Protection to ensure that your critical data is protected. Finally, since NetApp is the only storage vendor to offer native services co-engineered with each of the three major public cloud providers, you can seamlessly move data between the cloud and the datacenter, and between different cloud providers, as needed to optimize costs or take advantage of specialized infrastructure offerings.

Stay tuned for more posts on building out your LLM inference stack! To learn more about NetApp, visit netapp.com.