Today, we continue our exploration of KV cache offloading. In my last post, I demonstrated the benefits of offloading your KV cache to shared storage and showed that there is virtually no downside to enabling an S3 tier. If you have not yet read that post, I recommend reading through it before reading this one.
To demonstrate the benefits of an S3 tier, you will recall that we tested with two separate vLLM instances behind a load balancer. Each vLLM instance was configured to use LMCache to offload KV cache blocks to the same shared ONTAP S3 bucket. In this post, I will walk you through the process of implementing this exact setup in your environment.
Prerequisites
- You must have enabled S3 within your ONTAP cluster, you must have an empty S3 bucket available for use, and you must have credentials for reading and writing to this bucket. For more details on enabling S3 and creating buckets in ONTAP, refer to the ONTAP S3 documentation.
- You must have a server with NVIDIA GPUs. Note that we tested on a server with two NVIDIA L40S GPUs. If you have a different model of GPU, you may need to make some slight modifications to our vllm serve commands.
- You must have network access from your GPU server to an S3 data interface within your ONTAP cluster.
- You must have uv installed on your GPU server. For more details, refer to the uv documentation.
- You must have docker installed on your GPU server. For more details, refer to the docker documentation.
- You must have the NVIDIA Container Toolkit installed on your GPU server. For more details, refer to the NVIDIA Container Toolkit documentation.
Install vLLM and LMCache
Before you can run vLLM with LMCache on your GPU server, you need to install both vLLM and LMCache. First, create a Python virtual environment in which to install and run vLLM and LMCache.
uv venv .venv
source .venv/bin/activate
Next, install vLLM and LMCache.
uv pip install --upgrade lmcache==0.3.12 vllm==0.13.0
Start vLLM instances
Now, you can start your two vLLM instances. First, you need to create your LMCache config file
vi lmcache-config.yaml
local_cpu: True
max_local_cpu_size: 100
save_decode_cache: True
remote_url: "s3://<bucket_name>.<ontap_s3_server_fqdn>"
remote_serde: "naive"
extra_config:
s3_num_io_threads: 320
s3_prefer_http2: False
s3_enable_s3express: False
save_chunk_meta: False
disable_tls: <True/False>
aws_access_key_id: "<ontap_s3_access_key>"
aws_secret_access_key: "<ontap_s3_secret_key>"
s3_region: "us-east-1"
Next, start your first vLLM instance.
# Specify LMCache config file
export LMCACHE_CONFIG_FILE=$(pwd)/lmcache-config.yaml
# Specify GPU
export CUDA_VISIBLE_DEVICES="0"
# Ensure that both vLLM instances use the same hash seed
export PYTHONHASHSEED=0
# Set your vLLM API key (must be the same for both instances)
export VLLM_API_KEY='<api_key>'
vllm serve \
Qwen/Qwen3-8B \
--tensor-parallel-size 1 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
--port 8001
Next, in a separate terminal session, start your second vLLM instance.
# Specify LMCache config file
export LMCACHE_CONFIG_FILE=$(pwd)/lmcache-config.yaml
# Specify GPU
export CUDA_VISIBLE_DEVICES="1"
# Ensure that both vLLM instances use the same hash seed
export PYTHONHASHSEED=0
# Set your vLLM API key (must be the same for both instances)
export VLLM_API_KEY='<api_key>'
vllm serve \
Qwen/Qwen3-8B \
--tensor-parallel-size 1 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
--port 8002
Deploy a vLLM router
After both of your vLLM instances have completed startup and are serving APIs, you are ready to deploy a vLLM router to act as a load balancer in front of your two vLLM instances. This router will provide a single OpenAI-compatible API endpoint and will route requests to your two vLLM instances in round-robin fashion. In a separate terminal session, use docker to deploy a vLLM router.
docker run -d --rm --name vllm-router \
--network host \
ghcr.io/vllm-project/production-stack/router:latest \
--port 8000 \
--service-discovery static \
--static-backends "http://localhost:8001,http://localhost:8002" \
--static-models "Qwen/Qwen3-8B,Qwen/Qwen3-8B" \
--routing-logic roundrobin
Submit prompts
Now, you are ready to submit prompts to your vLLM router! You can run the following command to submit a test prompt.
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H 'Authorization: Bearer <api_key>' \
-d '{
"model": "Qwen/Qwen3-8B",
"messages": [{"role": "user", "content": "What is 2x2?"}]
}' | jq .
Conclusion
KV cache offloading is a powerful technique, and the inclusion of a shared storage tier can significantly enhance your inference performance, especially with setups that incorporate multiple serving engine instances. When exploring your options, remember that NetApp supports KV cache offloading using industry-standard protocols and tooling. No custom proprietary clients are required. You can continue to use the same NFS and S3 protocols that you have used for years. To learn more about NetApp, visit netapp.com.