Tech ONTAP Blogs

Building a Secure, Enterprise-Ready AI Infrastructure with FlexPod AI for Model Training

ReeseLloyd
NetApp
537 Views

The enterprise AI landscape is evolving rapidly. Organizations are moving beyond pre-built AI models, investing instead in training and fine-tuning their own data. This shift brings tremendous opportunity, but also introduces new challenges around data infrastructure performance, management, and security. To meet these demands, FlexPod AI introduces a new model training solution built on the Cisco UCS C885A server, and delivering the comprehensive security framework essential for enterprise AI deployments.

 

Why Enterprise AI Training Matters

Pre-trained AI models are a great starting point, but they only get you so far without customization options that reflect an organization’s unique terminology, processes, and domain expertise. To unlock real business value, organizations need the ability to train custom models on proprietary data and fine-tune foundation models.

 

Training in your own data center also addresses critical concerns around data sovereignty and security. For organizations in regulated industries like healthcare, financial services, and government, keeping sensitive training data on premises is often a requirement rather than a choice. And for any organization looking to iterate quickly, local training eliminates cloud egress costs and latency while accelerating time-to-insight.

 

Introducing FlexPod AI for Model Training

FlexPod AI introduces a new model training solution designed for large-scale, GPU-accelerated workloads. This validated, turnkey platform, FlexPod AI for Model Training, brings together best-in-class infrastructure components to enable faster, more secure iteration across AI training workflows.

 

reesel1_0-1777575606383.png

 

The FlexPod AI for model training architecture is designed to deliver several key benefits for enterprises running production-scale AI training workloads.

 

  • Compute: The Cisco UCS C885A (C885A) server is built on the NVIDIA HGX reference architecture, featuring 8 NVIDIA H200 GPUs with NVLink interconnect. The C885A delivers the multi-GPU performance required for training and fine-tuning large language models, deep learning algorithms, and other demanding AI workloads.
  • Storage: NetApp AFF A90 all-flash arrays provide the high-throughput, low-latency storage that GPU-intensive training workloads demand. Support for NFS over RDMA and NVIDIA GPUDirect Storage enables data to flow directly between storage and GPU memory, minimizing latency, and maximizing throughput for data-intensive training jobs.
  • Network: Cisco Nexus switching delivers the high-bandwidth connectivity required for GPU-to-GPU and storage-to-compute communication.

Key Benefits

Together, these tightly integrated components form a platform that translates architectural design into measurable business outcomes for AI training teams.

 

  • Performance at Scale: The combination of NVIDIA HGX-based compute, NVLink GPU interconnect, and advanced storage technologies like NFS over RDMA and GPUDirect Storage ensures that your infrastructure can keep pace with the most demanding training workloads.
  • Validated and Supported: FlexPod's design methodology and lab validation gives you tested configurations and detailed implementation guidance, reducing deployment risk, and accelerating time-to-value.
  • Operational Simplicity: Unified management through Cisco Intersight and NetApp tools simplify day-2 operations while the converged FlexPod architecture reduces infrastructure sprawl.
  • Investment Protection: A single FlexPod AI infrastructure can support the complete AI lifecycle. Train models today and deploy them for inferencing tomorrow, all on the same platform.

 

The Breadth of FlexPod AI

While model training represents a growing opportunity, it is only one part of a broader set of AI workflows enterprises need to support. The new model training solution joins a comprehensive portfolio of FlexPod AI solutions that address the full spectrum of enterprise AI requirements:

 

Solution

Primary Use Case

FlexPod AI for Model Training (new)

LLM/SLM training, fine-tuning, HPC

FlexPod AI for Generative AI Inferencing

Production GenAI deployment

FlexPod AI for RAG Pipelines

Retrieval-Augmented Generation with NVIDIA NIM

FlexPod AI for MLOps

End-to-end ML lifecycle management

FlexPod AI for GPU Intensive Applications

General GPU-accelerated AI/ML and scaling guidance

FlexPod AI with Suse Rancher

Suse Rancher Kubernetes-based AI/ML with multi-cluster management

 

Whether you are starting with inferencing, building RAG pipelines, or training custom models, FlexPod AI provides a consistent, validated infrastructure foundation that grows with your AI initiatives.

 

The Challenge of Securing AI

As AI adoption accelerates, new security challenges emerge, making security an integral component to any AI deployment. AI systems often have access to sensitive data, and trained models themselves can represent valuable intellectual property. Rapid deployments may bypass security best practices, resulting in shadow IT implementations, ungoverned data pipelines, and ad-hoc infrastructure deployments.

 

This creates meaningful risk and increases exposure to data breaches, model theft, and compliance violations. Without the appropriate security controls, organizations increase their exposure to data breaches, model theft, and compliance violations.

 

A Secure Foundation for AI

Tackling risks begins with designing security within the infrastructure itself, rather than adding it later. You can’t bolt security on after the fact and expect it to be effective. As a foundational component of your infrastructure, security must be built-in from the ground up.

 

FlexPod delivers this secure foundation through the combined capabilities of its core components. Cisco UCS and Nexus provide hardware root of trust, secure boot, and encrypted communications. And with NetApp ONTAP, the most secure storage on the planet, you can count on secure boot, hardware root of trust, encryption for data in flight and at rest, immutable snapshots, ransomware detection, rapid recovery, and more to keep your foundation secure.

 

This foundation matters for AI workloads because training data, model weights, and inference results all need to operate securely. When your infrastructure is secure by design, you can focus on building AI solutions rather than worrying about whether your data is protected.

 

Putting the Security Pieces Together

Building on the secure foundation, FlexPod provides a comprehensive set of security solutions that give you definitive guidance on hardening your environment and protecting against threats.

 

  • FlexPod Security Hardening with VMware provides detailed guidance on securing FlexPod deployments running VMware virtualization. This includes configuration recommendations for Cisco UCS, Nexus, and NetApp ONTAP that align with industry security benchmarks.
  • FlexPod Security Hardening with Red Hat OpenShift extends this guidance to Red Hat containerized and virtualized environments, addressing the unique security considerations of Kubernetes-based deployments.
  • FlexPod Ransomware Protection and Recovery focuses on protecting your data from ransomware attacks and ensuring rapid recovery if an attack occurs. This solution leverages NetApp ONTAP capabilities including Autonomous Ransomware Protection, Snapshot copies, SnapLock, and SnapMirror to create a multi-layered defense. To add visibility, the solution brings together NetApp Data Infrastructure Insights, NetApp Console, and Cisco Splunk for full stack monitoring and response.
  • FlexPod Zero Trust Framework ties everything together. The Design Guide and Deployment Guide provide a comprehensive approach to implementing Zero Trust principles across your FlexPod environment. Rather than assuming trust based on network location, Zero Trust implements workload isolation and continuous verification of every user, device, and workload.

These solutions work together to create a defense-in-depth approach that protects your infrastructure, your data, and your AI workloads.

 

AI-Centric Security with Cisco Secure AI Factory

While a secure foundation and comprehensive infrastructure hardening are essential, AI workloads also have unique security requirements that demand purpose-built solutions. This is where Cisco Secure AI Factory with FlexPod comes in.

 

As discussed in a recent blog post, Secure AI Factory with FlexPod AI, this solution addresses the specific security challenges of AI deployments. It provides visibility into AI workloads, protects AI data pipelines, and helps ensure that AI systems are deployed in compliance with organizational policies. Cisco Secure AI Factory is a reference architecture built on the strong foundations already in place with FlexPod AI solutions, extending those capabilities to meet the unique demands of AI security.

 

Picture1.png

 

Intelligent Data Management for AI

Beyond infrastructure security, managing and protecting AI data at scale requires intelligent tooling. NetApp is delivering new capabilities that expand what your data can offer:

 

  • NetApp AIDE (AI Data Engine) provides a unified approach to managing data across the AI lifecycle. From data preparation and curation to training and inferencing, AIDE helps organizations maintain visibility and control over their AI data pipelines. This includes capabilities for data lineage, governance, and optimization that are critical for enterprise AI deployments.
  • NetApp AFX is a scale-out intelligent data platform purpose-built for AI workloads. AFX delivers massive scalability, starting at 2PB and expanding to over 100PB in a single namespace, with linear throughput scaling and intelligent data mobility across hybrid and multi-cloud environments

These capabilities complement the Cisco Secure AI Factory architecture by providing the high-performance storage and data management intelligence that AI workloads require. Brought together, they create a comprehensive approach to AI security that addresses infrastructure, network, and data protection.

 

Cisco Secure AI Factory integrates with the FlexPod security foundation and hardening guidance to deliver a complete security solution for enterprise AI. You get the performance and scalability of FlexPod AI combined with the security controls that modern AI deployments require.

 

Conclusion

Enterprise AI is moving fast, and organizations need infrastructure that can keep pace. FlexPod AI for Model Training with the Cisco C885A server delivers the performance required for the most demanding training workloads, while the broader FlexPod AI portfolio addresses use cases from inferencing to RAG pipelines to MLOps.

 

Performance alone is not enough. AI systems must be built on a secure foundation, hardened against threats, and protected by purpose-built security solutions. FlexPod provides all of this through its secure-by-design architecture, comprehensive hardening guides, Zero Trust framework, and Cisco Secure AI Factory integration.

 

This comprehensive approach to enterprise AI is only possible through the strong partnership between NetApp, Cisco, and NVIDIA. Each partner brings critical capabilities to the table: NVIDIA delivers the GPU compute and AI software stack, Cisco provides the compute and networking infrastructure along with security expertise, and NetApp contributes high-performance storage and intelligent data management. Together, these partners have built and validated a complete platform that addresses the full AI lifecycle from training to inferencing, all on a secure foundation.

 

Whether you are just starting your AI journey or scaling existing initiatives, FlexPod AI gives you the performance, flexibility, and security you need to succeed.  And, for the most extreme AI and HPC environments where sustained parallel I/O and scratch performance dominate, NetApp also offers purpose-built EF-Series solutions paired with parallel file systems such as Lustre.

 

Explore the FlexPod AI for Model Training (NEW) deployment guide to configuration and implementation details.

To learn more about FlexPod, FlexPod AI for Model Training, and the complete FlexPod AI portfolio, contact your Cisco or NetApp account team.

Public