Tech ONTAP Blogs
Tech ONTAP Blogs
Today, we continue our series on LLM inference with an exploration of a more advanced topic, KV cache offloading. This post will build on my previous post in which I walked you through the deployment of an LLM inference stack using vLLM, a popular open-source LLM inference server. The vLLM deployment that I demonstrated in that post stores its KV cache entirely in GPU memory (VRAM). In this post I will show you how to enable the offloading of your KV cache to a NetApp ONTAP storage system using NVIDIA GPUDirect Storage, often referred to as GDS.
KV (key-value) caching is a technique that is used to optimize LLM inference by storing previously calculated values in a KV cache so that these values don't need to be calculated again for every new token that is generated, which would otherwise be necessary. This article from HuggingFace provides a good overview. KV caching is a standard feature in most LLM inference engines (vLLM, TensorRT-LLM, SGLang, DeepSpeed etc.).
The standard behavior for most inference engines is to store their KV cache in GPU memory. Thus, there is a compromise involved in the use of the KV caching technique - additional GPU memory is utilized in order to lower the compute intensity of the inference workload. This introduces a challenge. The size of the KV cache scales linearly with batch size (the number of prompts being processed simultaneously) and context length (the length of each generation stream). As model context windows grow ever larger, and inference platforms are utilized by more and more users, the size of the KV cache can quickly outpace the amount of available GPU memory.
What can be done about this? A portion of the KV cache can be offloaded to CPU memory and/or storage in order to scale beyond the limits of GPU memory capacity. As reasoning models and agentic AI become more popular, KV cache offloading is likewise becoming more popular. However, the offloading target needs to be extremely fast in order to minimize the impact on inference latency. Additionally, a KV cache storage layer that can be shared across compute nodes in a multi-node deployment, such as shared storage, is becoming desirable.
This is where GDS can help. GDS enables a direct data path for transfers between GPU memory and storage, which avoids a bounce buffer through the CPU, thus reducing latency. NetApp ONTAP supports GDS through the use of NFS over RDMA. The LMCache project has enabled KV cache offloading in vLLM using GDS. In this post, I will show you how to create a vLLM deployment that is configured to offload its KV cache to ONTAP using GDS.
First, you need to create a volume for storing your offloaded KV cache. Create a pvc-vllm-kv-cache.yaml file.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: vllm-kv-cache
namespace: vllm-stack
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Ti
storageClassName: aipod-flexgroups-retain-rdma # Change to your FlexGroup + NFS over RDMA StorageClass
Next, use kubectl to create a persistent volume claim for your volume.
kubectl create -f pvc-vllm-kv-cache.yaml
Now, you are ready to deploy vLLM with KV cache offloading enabled. You will configure your vLLM deployment to offload its KV cache to the volume that you just created.
As I demonstrated in the previous post, you can use the helm chart that is published by the vLLM Production Stack project to deploy vLLM on Kubernetes. First, create a vllm-prod-stack-values.yaml file.
servingEngineSpec:
vllmApiKey: "my-key" # Set an API key for your vLLM API
modelSpec:
- name: "mistral"
repository: "lmcache/vllm-openai"
tag: "latest"
modelURL: "mistralai/Mistral-Small-3.2-24B-Instruct-2506" # Choose a model from HuggingFace
replicaCount: 1
requestCPU: 14 # Set based on model requirements
requestMemory: "64Gi" # Set based on model requirements
requestGPU: 2 # Set based on model requirements
pvcStorage: "1Ti"
vllmConfig:
enablePrefixCaching: true
tensorParallelSize: 2 # Set to number of GPUs
maxModelLen: 16384 # Set based on model requirements
v1: 1
extraArgs: # Set based on model requirements
- "--tokenizer-mode"
- "mistral"
- "--config_format"
- "mistral"
- "--load_format"
- "mistral"
- "--limit_mm_per_prompt"
- "image=4"
- "--gpu-memory-utilization"
- "0.85"
- "--kv-transfer-config"
- '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
hf_token: "<huggingface_token>" # Enter your HuggingFace access token
env:
- name: LMCACHE_USE_EXPERIMENTAL
value: "True"
- name: LMCACHE_CHUNK_SIZE
value: "256"
- name: LMCACHE_GDS_PATH
value: "/mnt/gds/cache"
- name: LMCACHE_CUFILE_BUFFER_SIZE
value: "4096"
- name: LMCACHE_LOCAL_CPU
value: "False"
extraVolumes:
- name: kv-cache
persistentVolumeClaim:
claimName: vllm-kv-cache
extraVolumeMounts:
- name: kv-cache
mountPath: "/mnt/gds"
routerSpec:
serviceType: NodePort
Next, use Helm to deploy vLLM.
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install vllm vllm/vllm-stack -n vllm-stack --create-namespace -f vllm-prod-stack-values.yaml
After deploying, confirm that vLLM is running.
kubectl -n vllm-stack get pod
The output of this command should show two running pods. It may take several minutes for the inference server pod to reach a running state as vLLM will need to download the model from HuggingFace.
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-75d66f7fb4-6s5p7 1/1 Running 0 20m
vllm-mistral-deployment-vllm-86b7bc4bf5-9kkhz 1/1 Running 0 20m
Run the following commands to retrieve the URL for vLLM's OpenAI-compatible API endpoint. You can use this API endpoint to chat with your model.
export NODE_PORT=$(kubectl get -n vllm-stack -o jsonpath="{.spec.ports[0].nodePort}" services vllm-router-service)
export NODE_IP=$(kubectl get nodes -o jsonpath="{.items[0].status.addresses[0].address}")
echo http://$NODE_IP:$NODE_PORT
Additionally, you can use Open WebUI to quickly deploy a chat UI in front of your vLLM deployment. To install Open WebUI, follow the instructions that I outlined in the previous post.
Why run all of this on NetApp? The answer is simple. If your goal is to build an enterprise-class, production-ready GenAI application as opposed to a "science project" or proof-of-concept, then you need enterprise-class storage. NetApp supports KV cache offloading using industry-standard protocols and tooling. No custom proprietary clients are required. You can continue to use the same NFS protocol that you have used for decades.
Additionally, if you run your LLM inference stack on NetApp, then your KV cache, model versions, chat UI configurations, chat history, RAG knowledge base, etc., will be safe and protected. You can use snapshot copies to efficiently back up your data, ensuring that you will never lose anything. Additionally, you can utilize NetApp's Autonomous Ransomware Protection to ensure that your critical data is protected. Finally, since NetApp is the only storage vendor to offer native services co-engineered with each of the three major public cloud providers, you can seamlessly move data between the cloud and the datacenter, and between different cloud providers, as needed to optimize costs or take advantage of specialized infrastructure offerings.
Stay tuned for more posts on building out your LLM inference stack! To learn more about NetApp, visit netapp.com.