In this third part of our AI/ML storage benchmarking series, we dive into the practical application of the DLIO benchmark to evaluate storage performa ...read more
Introducing self-service partnerships through NetApp Console: Empowering partners to collaborate seamlessly with enterprises!
In today’s hybrid clou ...read more
NetApp® Console is built on a restructured and enhanced foundation, encompassing platform investments, core services, enterprise readiness, security, ...read more
Introduction
Automation is the need of the hour. It is a powerful tool that can scale a business, provide significant cost savings, and allow IT staff to focus on strategic rather than administrative work.
A wide range of datacenter and cloud operations can be automated, resulting in faster operations. Thanks to automation, IT environments can scale more quickly with fewer errors and are more responsive to business needs. The automation capabilities of the FlexPod Datacenter and Express solutions are a major step towards simplifying processes, minimizing errors, and increasing efficiency.
FlexPod is a best practice datacenter architecture that includes the following components - Cisco Unified Computing System, Cisco Nexus switches, Cisco MDS switches, and NetApp AFF/FAS/ASA systems. FlexPod's validated architecture provides a solid infrastructure foundation for a variety of business applications and solutions. With continuous integration with Cisco and NetApp technologies, FlexPod is at the forefront of innovation: through simplified management with Cisco Intersight, advanced hybrid cloud capabilities with NetApp ONTAP, and faster performance with end-to-end NVMe. FlexPod Datacenter and FlexPod Express deliver a baseline configuration and have the flexibility to be sized and optimized to accommodate many different use cases and requirements.
Existing FlexPod Datacenter customers can manage their FlexPod Express system with the same set of tools they are familiar with. New FlexPod Express customers can easily scale and manage their FlexPod solutions as they scale and grow their environment.
FlexPod automation solutions using Ansible can further improve IT efficiency and reduce operation errors. Infrastructure deployment and service provisioning which used to take hours can now be completed in just minutes. With FlexPod automation you can save time and be more productive.
Ansible Overview
Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. Ansible is agentless, temporarily connecting remotely via SSH or Windows Remote Management to do its tasks. It can automate IT environments whether they are hosted on traditional bare metal servers, virtualization platforms, or in the cloud. Ansible includes its own declarative language to describe system configuration. Vendor-specific modules for automation are available via Ansible Galaxy collections.
For more information, refer to Ansible documentation. Now, let’s look at the details about deploying FlexPod Datacenter using Ansible in UCSM and IMM mode, post that we will look at FlexPod Express deployment using Ansible.
FlexPod Datacenter Automation with Ansible
FlexPod Datacenter delivers an optimal multipurpose foundation for various workloads and applications. FlexPod automation delivers a fully automated solution deployment that covers all sections of the infrastructure and application layer. There are two modes to configure Cisco UCS, one is UCSM (UCS Managed) and the other is IMM (Intersight Managed Mode).
FlexPod with UCSM (UCS Managed)
You can leverage Ansible playbooks that have been designed to setup the day0 FlexPod using UCSM. It includes the configuration of the NetApp Storage, Cisco Network and Compute, and VMware. This automation capability augments the standard manual deployment procedures that are provided in the deployment guide.
Watch out the below demo videos to gain more insights into the FlexPod Datacenter with UCSM Automation via Ansible:
Cisco Nexus Setup
NetApp ONTAP Setup
Cisco UCS Setup
Cisco MDS Setup
VMware ESXi Setup
VMware vCenter Setup
FlexPod with IMM (Intersight Managed Mode)
The Cisco Intersight platform is a Software-as-a-Service (SaaS) infrastructure lifecycle management platform that delivers simplified configuration, deployment, maintenance, and support. It is designed to be modular, so customers can adapt services based on their individual requirements.
This section covers the day0 FlexPod setup using UCS IMM. The configuration of compute, networking, and storage including the hypervisor layers are automated by leveraging Ansible playbooks that have been designed to setup the Cisco, NetApp, and VMware components as per the solution best practices that were identified during the testing and validation.
Refer to the FlexPod IaC document for step-by-step deployment of FlexPod via Ansible. The standard manual procedures for configuring FlexPod are explained in the deployment guide.
Watch out the below demo videos to gain more insights into the FlexPod Datacenter with UCS IMM Automation via Ansible:
Cisco Nexus Setup
NetApp ONTAP Setup
Cisco UCS Setup
Cisco MDS Setup
VMware ESXi Setup
VMware vCenter Setup
FlexPod Express Automation with Ansible
FlexPod Express offers customers an entry-level solution with technologies available from Cisco and NetApp. FlexPod® Express with Cisco UCS C-series Standalone Rack Servers and NetApp AFF is a predesigned, best practice architecture built on the Cisco Unified Computing System (Cisco UCS), the Cisco Nexus family of switches, and NetApp storage technologies.
You can leverage Ansible playbooks that have been designed to setup the day0 FlexPod Express with local boot using NVMe/TCP & NFS storage protocols. With Ansible the day0 deployment of FlexPod Express would just take less than 2 hours. Majority of steps are automated to provide simple and seamless deployment experience.
Refer to the IaC document for step-by-step deployment via Ansible. To understand the design and manual deployment steps of FlexPod Express, refer to the latest deployment guide.
Highlights of Ansible deployment
FlexPod deployment with Ansible has different phases which involves the exchange of parameters or attributes between compute, network, storage, and virtualization and may also involve some manual intervention. All phases have been clearly demarcated and the implementation with automation is split into equivalent phases via Ansible playbooks with a ‘tag’ based execution of a specific section of the component’s configuration.
The Ansible playbooks, to configure the different sections of the solution invoke a set of Roles and consume the associated variables that are required to setup the solution. The variables needed for the solution can be split into two categories – user input and defaults/ best practices. Based on the installation environment customers can choose to modify the variables to suit their requirements and proceed with the automated installation.
The automation for ONTAP is scalable in nature that can configure anywhere from a single HA pair to a fully scaled 24 node ONTAP AFF/FAS cluster. After the base infrastructure is setup with NetApp ONTAP, Cisco Network and Compute, and VMware, customers can also deploy NetApp ONTAP Tools for VMware vSphere (formerly Virtual Storage Console), SnapCenter Plug-in for VMware vSphere, and Active IQ Unified Manager in an automated fashion. Another key benefit of this automation package is that customers can reuse parts of the code/roles to execute repeatable tasks using the tags that are associated with the fine-grained tasks within the roles.
Conclusion
Automation is imperative as business requirements are so dynamic and is critical to adapt to the enterprise needs. Infrastructure administrators are in need of processes which give the operational efficiency avoiding human errors and productivity of their IT landscape. The FlexPod automation helps customers to be able to build repeatable building blocks which is continuously built and updated to align with the technology innovations incorporating all the best practices conforming to the joint reference architectures from both Cisco and NetApp.
... View more
NetApp Console is new! Automation Hub, within NetApp Console allows you to accelerate adminstrative tasks and get up to speed on the latest in the solution space with a curated collection of scripts, toolkits and simulators.
... View more
Are you prepared for a ransomware attack?
How long would it take you to figure out you’ve been attacked? And then stop it?
How quickly could you recover your data after an attack and return to normal business operations?
... View more
Learn about hardening your VMware Private Cloud with the latest security features in this first installment of a three-part series on the new NetApp ONTAP tools for the VMware vSphere 10.5 release.
... View more
Deep Learning (DL) is the subfield of Artificial Intelligence (AI) that focuses on creating large neural network models capable of data-driven decisions [1].
While GPUs often take the spotlight in AI/ML infrastructure, storage plays a critical role throughout the pipeline. From storing raw datasets and engineered features to feeding data into GPUs during training, performance of the storage system has significant impact on the efficiency and scalability of the workloads.
Understanding how to configure a storage solution and clients to support AI/ML pipelines isn't just helpful, it's essential.
In this series, we will delve into:
Part I - Identifying storage demands for Deep Learning workloads through workflow analysis
Part II - Deep Learning I/O: An approach to overcome storage benchmarking challenges for Deep Learning workloads
Part III - The methodology for benchmarking storage performance for Training a UNET-3D model and its performance results
Part IV - The methodology for benchmarking storage performance for checkpointing a LLM and its performance results
We structured the series in the order above as it's important to understand the challenges, tools, and methods behind the data before diving into the performance results and insights.
*** TERMINOLOGY ALERT ***
If you are a data scientist, a machine learning engineer, a data engineer, or a data
platform engineer, please note that throughout this series, the term "storage" refers
specifically to the infrastructure component acting as a file system for your data.
This includes cloud-based services such as AWS FSx NetApp ONTAP, Azure NetApp Files
or Google Cloud NetApp Volumes, as well as on-premises NetApp engineered systems
like AFF A-series and C-series. This distinction is important because "storage"
can mean different things depending on your role or the system architecture you're
working with.
Identifying Storage Demands for Deep Learning Workloads Through Workflow Analysis
One of the core challenges in measuring storage performance for deep learning workloads is identifying which phases (data ingestion, preprocessing, model training, inference, etc) place the greatest demands on storage. This insight is essential for designing meaningful benchmarks, especially when data is accessed from multiple storage tiers based on the chosen data management strategy.
As deep learning models grow in complexity and scale, the performance of underlying storage systems becomes increasingly critical. From ingesting massive datasets to training models across distributed environments, each stage of the AI/ML pipeline interacts with storage in distinct ways.
We will walk through each phase of the AI/ML workflow to explain its purpose, expected load and I/O patterns. To support this analysis, we will introduce the "Medallion Data Architecture" (Figure 1) and the AI/ML workflow template (Figure 2). This combined view allows us to examine the AI/ML process in the context of the underlying data infrastructure.
The "Medallion Architecture" is a popular data management strategy that organizes data into multiple layers (typically bronze, silver, gold) to progressively improve data quality and usability. This layered approach, often used in data lakehouses, facilitates data processing, cleansing, and transformation, making data more suitable for various analytics, business intelligence, and AI use cases [2].
Figure 1 shows an example of a "Medallion Architecture". The bronze layer acts as the landing zone for raw, unprocessed data from various sources. It focuses on capturing data as it arrives, without any transformations or quality checks. The silver layer is where data from the bronze layer is refined. This includes tasks like data validation, cleansing, and deduplication, ensuring a more reliable and consistent dataset. The gold layer hosts curated data. Here, domain-specific features can be extracted, and the data is optimized for consumption by business intelligence tools, dashboards, decision-making applications, and AI/ML pipelines.
Figure 1. Storage plays a central role in the data management lifecycle. Adapted from (REIS & HOUSLEY, 2023) with modifications.
Figure 2 illustrates an AI/ML workflow template developed by Francesca Lazzeri, PhD. In her book, Lazzeri emphasizes the significance of each phase within the workflow. While her template is tailored for time series forecasting, its structure is broadly applicable to a wide range of AI/ML workflows [3].
Figure 2. AI/ML Workflow Template. Adapted from (LAZZERI, 2020) with modifications.
Let's walk through the AI/ML workflow template and examine how each stage interacts with, or places demands on storage systems.
Business Understanding Phase
In this phase there are no direct storage-related concerns. The focus is on understanding the business problem. Data scientists, machine learning engineers, and data engineers collaborate to define the problem, identify the types of data needed to solve it, and determine how to measure the success of the AI/ML solution.
Data Preparation Phase
In this phase, storage considerations begin to play a role. As shown in Figure 2 above, the data preparation phase subdivides further into specific stages, namely:
The data ingestion stage
The data exploration and understanding stage
The data pre-processing and feature development stage
During the Data Ingestion stage, data from multiple sources—whether in batch or streaming form—is ingested into the bronze layer of the data architecture. At this layer, the storage I/O pattern is primarily characterized by sequential write operations, driven by concurrent data streams from these sources.
The next stage is Data Exploration and Understanding. It is at this stage a data engineer/scientist will be reading CSV or Parquet files from the bronze layer. They will explore a subset of the dataset via a Jupyter Notebook to understand the data's shape, distribution, and cleaning requirements. The I/O pattern at this stage will be mostly a light load of sequential read operations against the underlying storage.
Now that data is understood, it’s at the Data Pre-Processing & Feature Engineering stage that data transformation begins.
The first step of this stage, Data Pre-Processing, involves reading data from the bronze layer. Data engineers/scientists clean the full dataset, writing the results to the silver layer.
The second step, Feature Engineering, uses the silver layer as the input source. New features are derived from the cleaned data, and this new dataset is then written to the gold layer.
The I/O pattern of this multi-step stage involves multiple streams of sequential reads from bronze and multiple streams of sequential writes to silver during the cleaning phase, as well as multiple streams of sequential reads from silver and multiple streams of sequential writes to gold.
Data Modeling Phase
This phase is divided into three stages, Model Building, Model Selection, and Model Deployment.
Training takes place during the Model Building stage. It is an iterative process, with batches read from storage into memory. These batches are used to update/populate the neural network (forward pass), results are evaluated (backward pass), and gradients/optimizers are updated. This process continues until all n-th number of samples have been processed by the accelerators in play. If configured by the data scientist, checkpoints are periodically triggered to save the model's weights and state to persistent storage.
The I/O pattern involves multiple streams of sequential reads served by the gold layer feeding the forward pass, and multiple streams of sequential writes to persistent storage as part of the checkpoint process. Be aware, neither the backward pass nor gradients/optimizers updates issue storage operations.
Once a model is selected, the process moves to the Model Deployment stage, where the chosen model is integrated into a production environment via a deployment pipeline, making it available for application-level consumption.
Neither the Model Selection nor Model Deployment are storage demanding stages.
Business Validation
This is the final phase. Here data scientists are responsible for verifying that the pipeline, model, and production deployment align with both customer and end-user goals.
Evaluating Resource Utilization
Having examined the AI/ML workflow and its storage demands, we can now evaluate which stages are the most resource intensive. This involves identifying phases where computational demand is highest and system utilization is sustained, as opposed to idle or waiting states.
Data scientists spend approximately 80% of their time preparing data [5]. Table 1 below highlights the most time-consuming resource for each phase and stage of the AI/ML workflow. Stages that involve human interaction tend to place a lighter load on the system. This is because, during activities such as analyzing data, evaluating data cleaning strategies, or designing new features, the system usage remains low while humans perform cognitive tasks. In contrast, stages with minimal human involvement, such as model training, typically apply higher pressure on system resources.
Table 1. Time-consuming resource for each phase and stage of the AI/ML workflow.
Based on this information, we addressed our first challenge by identifying "Model Building (Training)" as the AI/ML workflow stage that should be prioritized in our benchmark efforts.
The next challenge is determining how to measure model training performance with a focus on storage, in a world where GPUs are the most wanted and expensive computational resource on Earth. Here is where Deep Learning I/O (DLIO) comes into play solving this challenge.
References
[1] Mathew, A., Amudha, P., & Sivakumari, S. (2021). Deep learning techniques: an overview. Advanced Machine Learning Technologies and Applications: Proceedings of AMLTA 2020, 599-608.
[2] What is the medallion lakehouse architecture?. Available from <https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion>. Accessed on 2025-08-05.
[3] Lazzeri, F. (2020). Machine learning for time series forecasting with Python. Wiley.
[4] Reis, J., Housley, M. (2023). Fundamentals of Data Engineering: plan and build robust data systems. O'Reilly.
[5] AI Data Pipelines: The Ultimate Guide. MLTwist. Available from: <https://mltwist.com/wp-content/uploads/2024/03/MLtwist-AI-Data-Pipelines-The-Ultimate-Guide.pdf>. Accessed on 2025-08-07.
... View more