At CES 2026 in Las Vegas, NVIDIA unveiled several groundbreaking technologies to accelerate AI adoption—not just improving performance but also providing greater energy efficiency. One standout announcement was the Inference Context Memory Storage (ICMS) platform, introduced during Jensen Huang’s keynote.
What is ICMS and Why Does it Matter?
ICMS targets gigascale AI inference environments and large AI factories, where models increasingly rely on iterative, multi-step reasoning that reuses context repeatedly. Traditionally, this context—stored in the KV cache (Key-Value Cache)—resided in each GPU’s high-bandwidth memory (HBM). In long-context, multi-turn inference, GPUs often regenerate the same context multiple times, wasting compute cycles and energy.
By enabling GPUs to retain and share context memory across inference processes, ICMS dramatically boosts tokens-per-second throughput and improves energy efficiency.
For a technical deep dive, see NVIDIA’s blog: Introducing NVIDIA BlueField-4 Powered Inference Context Memory Storage.
Where does ICMS fit in the Memory and Storage Hierarchy?
ICMS adds a new tier to the AI memory/storage hierarchy, designed specifically for transient inference context data. Current tiers include:
- GPU memory (HBM) – nanosecond access
- System RAM
- Local flash storage (SSD)
- Network storage – large-scale, shared datasets, protected, highly available, with enterprise grade data management. There are hierarchies within network storage layer as well (hot, warm, cold) to meet the performance requirements of applications at the lowest price and highest power efficiency.
As context cache grows beyond GPU memory capacity, it spills into RAM and SSDs. ICMS introduces a new tier 3.5 layer to scale mixture-of-experts (MoE) models involved in multi-step inferencing operations.
However, KV cache is ephemeral and may be recreated, though recreation has its own implied additional latencies for users. For a scalable inferencing system relying on hundreds of agents, each creating their own contexts, it would be important to save the query history, answer history, and context for any paused conversations on enterprise storage systems. Furthermore, AI still depends on durable, secure, high-performance enterprise storage for reference datasets, model training data, and other mission-critical assets. Protecting that data remains essential, and NetApp delivers industry-leading secure storage to meet those needs.
NVIDIA ICMS, NVIDIA BlueField-4, and the NetApp AI Data Engine
The NetApp AI Data Engine (AIDE) is an end-to-end, storage-integrated AI data service that enables you to simplify and secure your AI data pipelines. AIDE helps you find relevant data across your enterprise, ensures that the data in your data pipeline is current and secure it with guardrails, and transform your data for usage with apps or agents.
ICMS is powered by the NVIDIA Bluefield-4 processor. This combination enables direct and efficient access to shared, external storage by GPU clusters. This architecture makes the KV entries available to GPUs quickly from networked storage so inferencing processes don’t need to recalculate the context. ICMS and BlueField-4 expand KV context reuse by storing them in shared storage accessible to all GPU nodes in the cluster.
The location index of KV entries is stored in the NVIDIA Bluefield-4 memory and synchronized across GPU nodes by NetApp storage orchestration. NetApp storage orchestration on the GPU node can help balance the load across all the external storage elements, while GPUDirect RDMA transfer from storage to GPU minimizes latency for accessing KV cache entries.
AIDE being part of this architecture further helps with low latency inferencing. The knowledge graph that can be exported by AIDE can act as an aggregator and work with Retrieval Augmented Generation (RAG) pipelines and other data sources to provide the required context for the prompt. Additionally, since the prompt and context pass through the aggregator even before reaching the LLM it helps predict the tokens and passes hints to the storage engine to prefetch the proper KV entries.
NVIDIA Bluefield-4, ICMS, and NetApp AIDE align to streamline AI data pipelines, accelerate TTFT (Time to First Token), and reduce $/token.
Looking Ahead
Per NVIDIA’s announcement, ICMS is expected to ship later in 2026. Over the next few weeks we will continue to share even more details on how NetApp and NVIDIA technologies can be combined to help customers maximize performance, efficiency, and value from their AI investments addressing both training and inferencing AI work streams. NetApp and NVIDIA have an extensive and long-lasting partnership and we are working closely to deliver the best co-designed solutions to help customers on their business transformation journeys leveraging the power of AI. Stay tuned for more on how NetApp will support the new BlueField-4 based architectures for AI workloads.