In this third part of our AI/ML storage benchmarking series, we dive into the practical application of the DLIO benchmark to evaluate storage performa ...read more
Introducing self-service partnerships through NetApp Console: Empowering partners to collaborate seamlessly with enterprises!
In today’s hybrid clou ...read more
NetApp® Console is built on a restructured and enhanced foundation, encompassing platform investments, core services, enterprise readiness, security, ...read more
Are you prepared for a ransomware attack?
How long would it take you to figure out you’ve been attacked? And then stop it?
How quickly could you recover your data after an attack and return to normal business operations?
... View more
Learn about hardening your VMware Private Cloud with the latest security features in this first installment of a three-part series on the new NetApp ONTAP tools for the VMware vSphere 10.5 release.
... View more
Deep Learning (DL) is the subfield of Artificial Intelligence (AI) that focuses on creating large neural network models capable of data-driven decisions [1].
While GPUs often take the spotlight in AI/ML infrastructure, storage plays a critical role throughout the pipeline. From storing raw datasets and engineered features to feeding data into GPUs during training, performance of the storage system has significant impact on the efficiency and scalability of the workloads.
Understanding how to configure a storage solution and clients to support AI/ML pipelines isn't just helpful, it's essential.
In this series, we will delve into:
Part I - Identifying storage demands for Deep Learning workloads through workflow analysis
Part II - Deep Learning I/O: An approach to overcome storage benchmarking challenges for Deep Learning workloads
Part III - The methodology for benchmarking storage performance for Training a UNET-3D model and its performance results
Part IV - The methodology for benchmarking storage performance for checkpointing a LLM and its performance results
We structured the series in the order above as it's important to understand the challenges, tools, and methods behind the data before diving into the performance results and insights.
*** TERMINOLOGY ALERT ***
If you are a data scientist, a machine learning engineer, a data engineer, or a data
platform engineer, please note that throughout this series, the term "storage" refers
specifically to the infrastructure component acting as a file system for your data.
This includes cloud-based services such as AWS FSx NetApp ONTAP, Azure NetApp Files
or Google Cloud NetApp Volumes, as well as on-premises NetApp engineered systems
like AFF A-series and C-series. This distinction is important because "storage"
can mean different things depending on your role or the system architecture you're
working with.
Identifying Storage Demands for Deep Learning Workloads Through Workflow Analysis
One of the core challenges in measuring storage performance for deep learning workloads is identifying which phases (data ingestion, preprocessing, model training, inference, etc) place the greatest demands on storage. This insight is essential for designing meaningful benchmarks, especially when data is accessed from multiple storage tiers based on the chosen data management strategy.
As deep learning models grow in complexity and scale, the performance of underlying storage systems becomes increasingly critical. From ingesting massive datasets to training models across distributed environments, each stage of the AI/ML pipeline interacts with storage in distinct ways.
We will walk through each phase of the AI/ML workflow to explain its purpose, expected load and I/O patterns. To support this analysis, we will introduce the "Medallion Data Architecture" (Figure 1) and the AI/ML workflow template (Figure 2). This combined view allows us to examine the AI/ML process in the context of the underlying data infrastructure.
The "Medallion Architecture" is a popular data management strategy that organizes data into multiple layers (typically bronze, silver, gold) to progressively improve data quality and usability. This layered approach, often used in data lakehouses, facilitates data processing, cleansing, and transformation, making data more suitable for various analytics, business intelligence, and AI use cases [2].
Figure 1 shows an example of a "Medallion Architecture". The bronze layer acts as the landing zone for raw, unprocessed data from various sources. It focuses on capturing data as it arrives, without any transformations or quality checks. The silver layer is where data from the bronze layer is refined. This includes tasks like data validation, cleansing, and deduplication, ensuring a more reliable and consistent dataset. The gold layer hosts curated data. Here, domain-specific features can be extracted, and the data is optimized for consumption by business intelligence tools, dashboards, decision-making applications, and AI/ML pipelines.
Figure 1. Storage plays a central role in the data management lifecycle. Adapted from (REIS & HOUSLEY, 2023) with modifications.
Figure 2 illustrates an AI/ML workflow template developed by Francesca Lazzeri, PhD. In her book, Lazzeri emphasizes the significance of each phase within the workflow. While her template is tailored for time series forecasting, its structure is broadly applicable to a wide range of AI/ML workflows [3].
Figure 2. AI/ML Workflow Template. Adapted from (LAZZERI, 2020) with modifications.
Let's walk through the AI/ML workflow template and examine how each stage interacts with, or places demands on storage systems.
Business Understanding Phase
In this phase there are no direct storage-related concerns. The focus is on understanding the business problem. Data scientists, machine learning engineers, and data engineers collaborate to define the problem, identify the types of data needed to solve it, and determine how to measure the success of the AI/ML solution.
Data Preparation Phase
In this phase, storage considerations begin to play a role. As shown in Figure 2 above, the data preparation phase subdivides further into specific stages, namely:
The data ingestion stage
The data exploration and understanding stage
The data pre-processing and feature development stage
During the Data Ingestion stage, data from multiple sources—whether in batch or streaming form—is ingested into the bronze layer of the data architecture. At this layer, the storage I/O pattern is primarily characterized by sequential write operations, driven by concurrent data streams from these sources.
The next stage is Data Exploration and Understanding. It is at this stage a data engineer/scientist will be reading CSV or Parquet files from the bronze layer. They will explore a subset of the dataset via a Jupyter Notebook to understand the data's shape, distribution, and cleaning requirements. The I/O pattern at this stage will be mostly a light load of sequential read operations against the underlying storage.
Now that data is understood, it’s at the Data Pre-Processing & Feature Engineering stage that data transformation begins.
The first step of this stage, Data Pre-Processing, involves reading data from the bronze layer. Data engineers/scientists clean the full dataset, writing the results to the silver layer.
The second step, Feature Engineering, uses the silver layer as the input source. New features are derived from the cleaned data, and this new dataset is then written to the gold layer.
The I/O pattern of this multi-step stage involves multiple streams of sequential reads from bronze and multiple streams of sequential writes to silver during the cleaning phase, as well as multiple streams of sequential reads from silver and multiple streams of sequential writes to gold.
Data Modeling Phase
This phase is divided into three stages, Model Building, Model Selection, and Model Deployment.
Training takes place during the Model Building stage. It is an iterative process, with batches read from storage into memory. These batches are used to update/populate the neural network (forward pass), results are evaluated (backward pass), and gradients/optimizers are updated. This process continues until all n-th number of samples have been processed by the accelerators in play. If configured by the data scientist, checkpoints are periodically triggered to save the model's weights and state to persistent storage.
The I/O pattern involves multiple streams of sequential reads served by the gold layer feeding the forward pass, and multiple streams of sequential writes to persistent storage as part of the checkpoint process. Be aware, neither the backward pass nor gradients/optimizers updates issue storage operations.
Once a model is selected, the process moves to the Model Deployment stage, where the chosen model is integrated into a production environment via a deployment pipeline, making it available for application-level consumption.
Neither the Model Selection nor Model Deployment are storage demanding stages.
Business Validation
This is the final phase. Here data scientists are responsible for verifying that the pipeline, model, and production deployment align with both customer and end-user goals.
Evaluating Resource Utilization
Having examined the AI/ML workflow and its storage demands, we can now evaluate which stages are the most resource intensive. This involves identifying phases where computational demand is highest and system utilization is sustained, as opposed to idle or waiting states.
Data scientists spend approximately 80% of their time preparing data [5]. Table 1 below highlights the most time-consuming resource for each phase and stage of the AI/ML workflow. Stages that involve human interaction tend to place a lighter load on the system. This is because, during activities such as analyzing data, evaluating data cleaning strategies, or designing new features, the system usage remains low while humans perform cognitive tasks. In contrast, stages with minimal human involvement, such as model training, typically apply higher pressure on system resources.
Table 1. Time-consuming resource for each phase and stage of the AI/ML workflow.
Based on this information, we addressed our first challenge by identifying "Model Building (Training)" as the AI/ML workflow stage that should be prioritized in our benchmark efforts.
The next challenge is determining how to measure model training performance with a focus on storage, in a world where GPUs are the most wanted and expensive computational resource on Earth. Here is where Deep Learning I/O (DLIO) comes into play solving this challenge.
References
[1] Mathew, A., Amudha, P., & Sivakumari, S. (2021). Deep learning techniques: an overview. Advanced Machine Learning Technologies and Applications: Proceedings of AMLTA 2020, 599-608.
[2] What is the medallion lakehouse architecture?. Available from <https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion>. Accessed on 2025-08-05.
[3] Lazzeri, F. (2020). Machine learning for time series forecasting with Python. Wiley.
[4] Reis, J., Housley, M. (2023). Fundamentals of Data Engineering: plan and build robust data systems. O'Reilly.
[5] AI Data Pipelines: The Ultimate Guide. MLTwist. Available from: <https://mltwist.com/wp-content/uploads/2024/03/MLtwist-AI-Data-Pipelines-The-Ultimate-Guide.pdf>. Accessed on 2025-08-07.
... View more
In today's fast-paced cloud environment, managing complex storage solutions efficiently is paramount. Google Cloud NetApp Volumes offers a fully managed, high-performance file storage service with integrated data protection. Its capabilities help you accelerate cloud deployments and migrations to the cloud while retaining the performance and features of on-premises storage.
Gemini Cloud Assist now offers comprehensive support to NetApp Volumes users, providing valuable assistance across various aspects of product usage, from understanding fundamental concepts to helping with troubleshooting.
Product concepts
Gemini Cloud Assist helps users understand key product concepts related to NetApp Volumes. Users can ask questions about:
Volume service levels and their use cases. Understand the differences between various service levels, such as Standard, Premium, Extreme, and Flex, and when to use each based on performance and cost requirements.
Replication and disaster recovery. Learn about replication mechanisms, setting up disaster recovery solutions, and ensuring data availability.
Snapshots and backups. Gain insight into how snapshots work, their role in data protection, and best practices for creating and managing backups.
Networking and connectivity. Understand the networking requirements for NetApp Volumes, including VPC peering and private service access.
Pricing and cost optimization. Get explanations of the billing model, cost drivers, and strategies for optimizing expenses.
Instructions on how to use Cloud console and gcloud CLI
Gemini Cloud Assist provides step-by-step instructions for performing tasks using both the Cloud console and the gcloud CLI, including guidance on how to:
Create and manage volumes. Provision new volumes, modify existing ones, and delete them when they’re no longer needed.
Configure replication. Set up replication relationships between volumes for high availability and disaster recovery.
Manage snapshots. Create, restore, and delete snapshots from the Cloud console and gcloud CLI.
Mount volumes. Connect and mount NetApp Volumes with virtual machines.
Monitor and log. Use Cloud Monitoring and Cloud Logging to track volume performance and troubleshoot issues.
Best practice suggestions
Gemini Cloud Assist offers valuable best practice suggestions to optimize the use of NetApp Volumes:
Performance optimization. Recommendations for choosing appropriate volume types, adjusting quotas, and configuring network settings for optimal performance.
Cost efficiency. Tips on right-sizing volumes, leveraging snapshots for cost-effective backups, and managing data lifecycles.
Security hardening. Suggestions for implementing robust security measures, including IAM policies, network security groups, and encryption.
High availability and disaster recovery. Best practices for designing resilient architectures and setting up effective disaster recovery plans.
Data migration strategies. Guidance on migrating data from on-premises NetApp ® systems or other cloud environments to NetApp Volumes.
Troubleshooting guidance
When users encounter issues with their NetApp Volumes deployment, Gemini Cloud Assist can help with troubleshooting by providing insights into possible causes and solutions for common problems. It can also help by documenting the environment for others such as Cloud Customer Care support engineers. Guidance includes:
Connectivity issues. Diagnose network configuration problems, firewall rules, or VPC peering issues that are preventing access to volumes.
Performance bottlenecks. Identify factors contributing to slow performance, such as I/O limits, network latency, or application misconfiguration.
Volume creation failures. Understand common reasons why volume creation might fail, such as insufficient quotas or invalid parameters.
Snapshot and replication errors. Troubleshoot issues with snapshot creation, restoration, or replication failures.
Access and permissions problems. Resolve issues related to incorrect IAM roles or permissions that are preventing users from accessing volumes.
By using Gemini Cloud Assist, Google Cloud NetApp Volumes users can enhance their understanding of the product, streamline their operations, and effectively troubleshoot issues, resulting in a more efficient and productive experience. To get started, simply click the Gemini icon in Cloud console as shown in the following screenshot. If Gemini Cloud Assist, your trusted advisor, surprises you with its findings, feel free to leave a comment below.
... View more
Unlock blazing-fast, scalable search and analytics—anywhere. Discover how to deploy OpenSearch, the open-source search and observability powerhouse, on NetApp ONTAP (FSxN) storage across AWS, Azure, Google Cloud, or on-premises. This guide walks you through a robust reference architecture, step-by-step deployment, and real-world performance benchmarks. Learn how NetApp’s enterprise-grade storage supercharges OpenSearch clusters with seamless scalability, high availability, and operational simplicity—no matter where your data lives.
... View more