KV cache offloading with vLLM, LMCache, and ONTAP S3

moglesby · ‎2026-02-12

Today, we continue our exploration of KV cache offloading. In my last post, I demonstrated the benefits of offloading your KV cache to shared storage and showed that there is virtually no downside to enabling an S3 tier. If you have not yet read that post, I recommend reading through it before reading this one.

To demonstrate the benefits of an S3 tier, you will recall that we tested with two separate vLLM instances behind a load balancer. Each vLLM instance was configured to use LMCache to offload KV cache blocks to the same shared ONTAP S3 bucket. In this post, I will walk you through the process of implementing this exact setup in your environment.

Prerequisites

You must have enabled S3 within your ONTAP cluster, you must have an empty S3 bucket available for use, and you must have credentials for reading and writing to this bucket. For more details on enabling S3 and creating buckets in ONTAP, refer to the ONTAP S3 documentation.
You must have a server with NVIDIA GPUs. Note that we tested on a server with two NVIDIA L40S GPUs. If you have a different model of GPU, you may need to make some slight modifications to our vllm serve commands.
You must have network access from your GPU server to an S3 data interface within your ONTAP cluster.
You must have uv installed on your GPU server. For more details, refer to the uv documentation.
You must have docker installed on your GPU server. For more details, refer to the docker documentation.
You must have the NVIDIA Container Toolkit installed on your GPU server. For more details, refer to the NVIDIA Container Toolkit documentation.

Install vLLM and LMCache

Before you can run vLLM with LMCache on your GPU server, you need to install both vLLM and LMCache. First, create a Python virtual environment in which to install and run vLLM and LMCache.

uv venv .venv
source .venv/bin/activate

Next, install vLLM and LMCache.

uv pip install --upgrade lmcache==0.3.12 vllm==0.13.0

Start vLLM instances

Now, you can start your two vLLM instances. First, you need to create your LMCache config file

vi lmcache-config.yaml

local_cpu: True
max_local_cpu_size: 100
save_decode_cache: True
remote_url: "s3://<bucket_name>.<ontap_s3_server_fqdn>"
remote_serde: "naive"
extra_config:
  s3_num_io_threads: 320
  s3_prefer_http2: False
  s3_enable_s3express: False
  save_chunk_meta: False
  disable_tls: <True/False>
  aws_access_key_id: "<ontap_s3_access_key>"
  aws_secret_access_key: "<ontap_s3_secret_key>"
  s3_region: "us-east-1"

Next, start your first vLLM instance.

# Specify LMCache config file
export LMCACHE_CONFIG_FILE=$(pwd)/lmcache-config.yaml

# Specify GPU
export CUDA_VISIBLE_DEVICES="0"

# Ensure that both vLLM instances use the same hash seed
export PYTHONHASHSEED=0

# Set your vLLM API key (must be the same for both instances)
export VLLM_API_KEY='<api_key>'

vllm serve \
    Qwen/Qwen3-8B \
    --tensor-parallel-size 1 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
    --port 8001

Next, in a separate terminal session, start your second vLLM instance.

# Specify LMCache config file
export LMCACHE_CONFIG_FILE=$(pwd)/lmcache-config.yaml

# Specify GPU
export CUDA_VISIBLE_DEVICES="1"

# Ensure that both vLLM instances use the same hash seed
export PYTHONHASHSEED=0

# Set your vLLM API key (must be the same for both instances)
export VLLM_API_KEY='<api_key>'

vllm serve \
    Qwen/Qwen3-8B \
    --tensor-parallel-size 1 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
    --port 8002

Deploy a vLLM router

After both of your vLLM instances have completed startup and are serving APIs, you are ready to deploy a vLLM router to act as a load balancer in front of your two vLLM instances. This router will provide a single OpenAI-compatible API endpoint and will route requests to your two vLLM instances in round-robin fashion. In a separate terminal session, use docker to deploy a vLLM router.

docker run -d --rm --name vllm-router \
  --network host \
  ghcr.io/vllm-project/production-stack/router:latest \
  --port 8000 \
  --service-discovery static \
  --static-backends "http://localhost:8001,http://localhost:8002" \
  --static-models "Qwen/Qwen3-8B,Qwen/Qwen3-8B" \
  --routing-logic roundrobin

Submit prompts

Now, you are ready to submit prompts to your vLLM router! You can run the following command to submit a test prompt.

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H 'Authorization: Bearer <api_key>' \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "What is 2x2?"}]
  }' | jq .

Conclusion

KV cache offloading is a powerful technique, and the inclusion of a shared storage tier can significantly enhance your inference performance, especially with setups that incorporate multiple serving engine instances. When exploring your options, remember that NetApp supports KV cache offloading using industry-standard protocols and tooling. No custom proprietary clients are required. You can continue to use the same NFS and S3 protocols that you have used for years. To learn more about NetApp, visit netapp.com.