Tech ONTAP Blogs

KV CacheBlend - What is it, and why it matters

MaxAmende
NetApp
129 Views

In this Blog article I want to continue Mike’s work on KV Cache Offloading by adding the concept and principle of CacheBlend. Also take his vLLM stack article into consideration before approaching this article.

 

Before going into the details about CacheBlend, let’s summarize the essential points of the previous articles.

 

While AI training workloads currently capture the stock exchanges and media’s attention, it is expected that by 2030, more money will be spent in AI Accelerators for inferencing workloads than for training workloads, with Nvidia claiming that even today, AI workloads and expenditures have transitioned from a training focus towards an inferencing one. This ambiguity is especially visible if you think about who spends their money on what. While large AI companies and hyperscalers capture the majority of AI training infrastructure expenditure, it is all the other companies who are interested in executing/inferencing the trained models.

 

Technical Concepts

 

Executing models locally 

 

For executing the trained models, there are plenty of different choices in hardware and software options. While applications like LM Studio or Ollama make great tools for executing LLMs locally for individuals, they struggle when it comes to executing LLMs at scale in enterprise environments.

 

In enterprise environments, we see our customers primarily leveraging vLLM or SGlang. Both are brilliant tools to execute LLMs at scale, while being quasi-independent of the hardware stack. No matter whether you want to leverage a fully featured NetApp AIPod with Nvidia DGX or a smaller AIPod Mini based on Intel hardware, both applications can leverage the most out of the hardware. For this article, we will focus on vLLM since it is the model server we see most at our customers.

 

Bottlenecks while executing models

 

For executing LLMs on your hardware, there are three primary bottlenecks on the side of the AI accelerator:

  1. AI accelerator performance, meaning how many operations the chip can actually execute.
  2. AI accelerator memory bandwidth/performance, meaning how fast the AI accelerator chip can access the model weights/caches.
  3. AI accelerator memory amount, meaning how much memory the chip has in total access to store weights/caches.

 

Increasing demands in AI accelerator memory

 

The last-named bottleneck is becoming more and more crucial. On the one hand, the trend towards Mixture of Expert models decreases the demands in sheer AI accelerator performance while in most cases increasing the amount of data stored in the memory compared to dense models. On the other side, the usable context lengths of models is becoming bigger and bigger.

 

For example the current version of DeepSeek R1 supports a context length of 163840 tokens. If you want to leverage that context length for a single user, you need to reserve about 266 GB of AI accelerator memory just for the cache, in addition to the model size of about 720 GB. This means you would need at least 6 NVidia B200 chips to execute the model for one user. Of course, there are tricks like Quantization to decrease the total memory need, but the problem stays the same, AI accelerator memory is very limited and expensive.

 

KV Cache Offloading

 

Mike showed in his blog a solution on how to offload the cache from the GPU towards a NetApp storage system leveraging LMCache and Nvidia GPUDirect Storage (GDS). This article now focuses on how to advance this approach.

 

With the current setup there is one key limitation - it only works if at least the beginning of the request is the same in the next request compared to the previous one. At the first point that the new request differentiates from the previous request changes, the cache for all subsequent tokens needs to be calculated from scratch.

 

To demonstrate this with a more practical example: Think about a coding assistant. Your current code is 10000 tokens (about 1000 lines of code) long, and you want the coding assistant to help you find a bug in it. With that request, vLLM will calculate the KVCache for the request and store it. Since the bug was apparently in line 700, you modify the code in that area. The next day, you want to implement a new feature and feed the current version of the code into it. With the current approach, the KVCache for every line before the changed line of code can be reused, but for the lines afterwards, the cache has to be calculated from scratch. In this example, that would still mean a significant speedup over not storing the KVCache. However, let’s assume in the meantime, you also renamed a variable in line 10 of your code. Now, you can only reuse the first 9 lines of code worth of KVCache and need to recalculate the rest, negating any significant speedup.

 

CacheBlend

 

This is exactly where the idea and research behind CacheBlend comes in. During their research, the team at University of Chicago realized that you can calculate the KVCache for “blocks” of context, and when the blocks get retrieved (no matter their order), only about 15% of the overall token have to be recalculated to achieve a good result.

 

To better understand how that works, let’s go back to the earlier coding assistant example. In the first scenario, we changed some lines of code around line 700 of the code, meaning we would have to recalculate the 300 lines after it. Leveraging CacheBlend, we can grab the block of precalculated KVCache and use it, with the performance penalty of having to recalculate about 15% of the overall tokens. In this example, the overall performance gain might not be staggering, but that changes if we look at the second example, where we changed a variable name in line 10 of the code. Traditionally, we could not leverage the KVCache basically at all, but with CacheBlend, we can simply grab the blocks after the change and recalculate about 15% of the overall tokens. In this scenario, the performance advantage would be significant.

 

RAG supercharged by CacheBlend

 

CacheBlend also opens up completely now approaches. We will take a look at Retrieval-Augmented Generation (RAG) specifically. Since CacheBlend does not care about the order of the cache blocks when used, we can now precalculate the caches for all chunks in our Vector Database. Now, instead of having to calculate the caches when the user asks a question, we can simply load the precalculated caches and merge them leveraging CacheBlend. This way, we can both significantly reduce the answer times for the users and reduce the overall load of the AI Accelerator, while blocking less AI Accelerator memory, allowing more requests at the same time.

 

Why use CacheBlend with NetApp?

 

While the scientific paper proposes offloading the cache blocks to the system memory or local storage, we argue that offloading the cache blocks via GDS to NetApp storage is the better approach. In addition to Mike’s points like enterprise readiness, security and stability, also take into account that the caches can get surprisingly big. If we would want to feed a knowledgebase of 1000 documents with 10 pages each to an employee chatbot, the precalculated caches would result in about 130TB of data. With NetApp storage that data can be provided blazingly fast without the storage becoming the bottleneck.

 

Implementation:

 

Starting with a standard vLLM/LMCache setup, or one leveraging GDS as described in Mike’s previous blogs, we need to make the following changes:

Note: In contrast to Mike’s guide, this guide does not follow a containerized approach.

Install the following Python library with the package manager of your choice:

 

pip install flashinfer-python

 

If we leave out this step, we will receive errors later on.

 

Next, we need to modify the gpu_worker.py in the vLLM library.

Therefore, we first need to find the installation path (varies based on your installation/configuration).

We can do that, for example, by executing:

 

sudo find / -path "*/vllm/v1/worker/gpu_worker.py" 2>/dev/null

 

After changing into that directory, we should create a copy of the gpu_worker so that if our changes do not work, we can revert back.

 

cp gpu_worker.py gpu_worker_backup.py

 

Now, we need to apply a list of changes to the file.

After opening it up, we first add the following two lines to the top of the document:

 

from lmcache.v1.compute.models.utils import VLLMModelTracker
from lmcache.integration.vllm.utils import ENGINE_NAME

 

Next, we search for the def init_worker_distributed_environment function in the Python file.

In that function, we comment out: 

 

ensure_kv_transfer_initialized(vllm_config)

 

by adding # before that line.

 

Afterwards, in the same document, we search for the following function: def load_model

 

At the end of the function, on the base level of the function we add the following two lines:

 

VLLMModelTracker.register_model(ENGINE_NAME, self.model_runner.model)
ensure_kv_transfer_initialized(self.vllm_config)

 

After saving the file and jumping back into the directory where we were working before, we must add the following lines to our lmcache_config file:

 

enable_blending: true
blend_special_str: " # # "
use_layerwise: true
blend_check_layers : 1
blend_recompute_ratios : 0.15

 

One word of warning - During our testing, we realized that CacheBlend is a very innovative approach but is not quite mature enough for production use cases in the enterprise. In time, this will likely change as the project evolves. LMCache version 0.3.8 or higher is required to run the technical part of this article. 

Public