Is small the next big thing in AI? Do all AI workloads need GPUs?
As we reach the crucial GA milestone, the final stages of our product launch this month, I reflected on these two customer-centric questions associated with AI infrastructure costs and the recent developments seen in the world of tokenization and the economics of deploying large language models (LLMs).
When it comes to AI inferencing and LLM deployments – not all AI workloads or business use cases require the powerful capabilities possessed by large language models (LLMs) with model size of 20+ billion parameters. Smaller language models (SLMs) have been emerging as a powerful and practical alternative, especially for tasks that require specific domain expertise and for customers with resource-constrained environments. When fine-tuned with tailored datasets, evidence-based research suggests that SLMs tend to outperform larger general-purpose models in specific tasks and domains, such as medical diagnostics or legal analysis. One of my findings from a recent development is VeriGen, an AI model fine-tuned on an open-source CodeGen-16B language model which contains 16 billion parameters to generate Verilog code, a hardware description language (HDL) used for design automation in the semiconductor and electronics industry. Furthermore, like the CEO of AI startup HuggingFace once suggested – that up to 99% of use cases could be addressed using SLMs. With SLMs demonstrating capabilities along the growing notion “small is the next big thing in AI” leaves many AI engineers with newer architectural considerations for RAG deployment options to choose from.
On the other hand, the choice between CPU vs GPU also depends on the specific requirements of the AI application. While some AI workloads benefit more from the parallel processing capabilities of the GPUs, other customers may prioritize low latency, which CPUs can provide. Moreover, procuring GPUs can sometimes be a challenge leading to hardware availability and supply-chain constraints in addition to budget considerations in an organization, thus slowing customers from reaching the finish line they aspire to achieve in their AI product development and deployment lifecycle. A thorough understanding of the customer’s workload characteristics and performance requirements leads to sound AI design decisions and therefore prevent over-engineering a system design with unnecessary complexities.
The collaboration between NetApp and Intel brings these perspectives and customer pain points into consideration with a product strategy that includes feasibility and viability via the NetApp® AIPod™ Mini, a retrieval-augmented generation (RAG) system designed for air-gapped AI inferencing workloads without the need for GPUs.
Announcing the General Availability of NetApp AIPod Mini
At a time when a growing number of organizations are leveraging RAG applications and LLMs to interpret user prompts. These prompts and responses can include text, code, images, or even therapeutic protein structures retrieved from an organization’s internal knowledge base. RAG accelerates knowledge retrieval and efficient literature review by quickly providing researchers and business leaders with relevant and reliable information.

The NetApp® AIPod™ Mini combines NetApp’s intelligent data infrastructure of NetApp AFF A-Series systems powered by NetApp ONTAP® data management software and compute servers with Intel® Xeon® 6 processors with Intel® Advanced Matrix Extensions (Intel® AMX), Intel AI for Enterprise RAG, and the OPEA software stack.
The NetApp® AIPod™ Mini supports pre-trained models of sizes up to 20 billion parameters (e.g. Llama-13B, DeepSeek-R1-8B, Qwen 14B, Mistral 7B). Intel AMX accelerates AI inferencing on a combination of data types (e.g. INT4, INT8, BF16). The NetApp AIPod Mini, jointly tested by NetApp and Intel using optimization techniques like activation-aware weight quantization (AWQ) for accuracy and speculative decoding for inference speed, delivers up to 2000 input/output tokens for 30+ concurrent users with 500+ tokens per second (TPS), thus balancing the trade-off between speed and accuracy for user-experience. The benchmark results released can be found on the MLPerf Inference 5.0.
Advantages of running RAG system with NetApp AIPod Mini:
- NetApp ONTAP data management provides enterprise-grade storage to support various types of AI workloads, including batch and real-time inferencing, and offers the benefits of velocity and scalability to handle large datasets for versioning, data access with multiprotocol support allowing client AI applications to read data using S3, NFS, and SMB file-sharing protocols which can facilitate data access in multimodal LLM inference scenarios and data protection & confidentiality with built-in NetApp Autonomous Ransomware Protection (ARP), offering both software- and hardware-based encryption to enhance confidentiality and security for RAG applications that retrieve knowledge from company’s document repositories.
- In addition to a data pipeline powered by NetApp intelligent data infrastructure you get OPEA for Intel® AI for Enterprise RAG. OPEA simplifies transforming your enterprise data into actionable insights. Intel AI for Enterprise RAG extends customers with key features that enhance scalability, security, and user experience. OPEA includes a comprehensive framework featuring LLMs, datastores, prompt engines and RAG architectural blueprints.

RAG systems and LLMs are technologies that work together to provide accurate and context-aware responses retrieved from your organization’s internal knowledge repository. NetApp has been a leader in data management, data mobility, data governance, and data security technologies across the ecosystem of edge, data center, and cloud. The NetApp AIPod Mini delivers an air-gapped RAG inferencing pipeline to help enterprise customers deploy generative AI technologies with significantly less computational power and boost their business productivity.
To learn more about the solution design and validation, please refer to NetApp AIPod Mini for Enterprise RAG Inferencing | NetApp