Tech ONTAP Blogs
Tech ONTAP Blogs
Deep Learning (DL) is the subfield of Artificial Intelligence (AI) that focuses on creating large neural network models capable of data-driven decisions [1].
While GPUs often take the spotlight in AI/ML infrastructure, storage plays a critical role throughout the pipeline. From storing raw datasets and engineered features to feeding data into GPUs during training, performance of the storage system has significant impact on the efficiency and scalability of the workloads.
Understanding how to configure a storage solution and clients to support AI/ML pipelines isn't just helpful, it's essential.
In this series, we will delve into:
We structured the series in the order above as it's important to understand the challenges, tools, and methods behind the data before diving into the performance results and insights.
*** TERMINOLOGY ALERT ***
If you are a data scientist, a machine learning engineer, a data engineer, or a data
platform engineer, please note that throughout this series, the term "storage" refers
specifically to the infrastructure component acting as a file system for your data.
This includes cloud-based services such as AWS FSx NetApp ONTAP, Azure NetApp Files
or Google Cloud NetApp Volumes, as well as on-premises NetApp engineered systems
like AFF A-series and C-series. This distinction is important because "storage"
can mean different things depending on your role or the system architecture you're
working with.
One of the core challenges in measuring storage performance for deep learning workloads is identifying which phases (data ingestion, preprocessing, model training, inference, etc) place the greatest demands on storage. This insight is essential for designing meaningful benchmarks, especially when data is accessed from multiple storage tiers based on the chosen data management strategy.
As deep learning models grow in complexity and scale, the performance of underlying storage systems becomes increasingly critical. From ingesting massive datasets to training models across distributed environments, each stage of the AI/ML pipeline interacts with storage in distinct ways.
We will walk through each phase of the AI/ML workflow to explain its purpose, expected load and I/O patterns. To support this analysis, we will introduce the "Medallion Data Architecture" (Figure 1) and the AI/ML workflow template (Figure 2). This combined view allows us to examine the AI/ML process in the context of the underlying data infrastructure.
The "Medallion Architecture" is a popular data management strategy that organizes data into multiple layers (typically bronze, silver, gold) to progressively improve data quality and usability. This layered approach, often used in data lakehouses, facilitates data processing, cleansing, and transformation, making data more suitable for various analytics, business intelligence, and AI use cases [2].
Figure 1 shows an example of a "Medallion Architecture". The bronze layer acts as the landing zone for raw, unprocessed data from various sources. It focuses on capturing data as it arrives, without any transformations or quality checks. The silver layer is where data from the bronze layer is refined. This includes tasks like data validation, cleansing, and deduplication, ensuring a more reliable and consistent dataset. The gold layer hosts curated data. Here, domain-specific features can be extracted, and the data is optimized for consumption by business intelligence tools, dashboards, decision-making applications, and AI/ML pipelines.
Figure 1. Storage plays a central role in the data management lifecycle. Adapted from (REIS & HOUSLEY, 2023) with modifications.
Figure 2 illustrates an AI/ML workflow template developed by Francesca Lazzeri, PhD. In her book, Lazzeri emphasizes the significance of each phase within the workflow. While her template is tailored for time series forecasting, its structure is broadly applicable to a wide range of AI/ML workflows [3].
Figure 2. AI/ML Workflow Template. Adapted from (LAZZERI, 2020) with modifications.
Let's walk through the AI/ML workflow template and examine how each stage interacts with, or places demands on storage systems.
In this phase there are no direct storage-related concerns. The focus is on understanding the business problem. Data scientists, machine learning engineers, and data engineers collaborate to define the problem, identify the types of data needed to solve it, and determine how to measure the success of the AI/ML solution.
In this phase, storage considerations begin to play a role. As shown in Figure 2 above, the data preparation phase subdivides further into specific stages, namely:
During the Data Ingestion stage, data from multiple sources—whether in batch or streaming form—is ingested into the bronze layer of the data architecture. At this layer, the storage I/O pattern is primarily characterized by sequential write operations, driven by concurrent data streams from these sources.
The next stage is Data Exploration and Understanding. It is at this stage a data engineer/scientist will be reading CSV or Parquet files from the bronze layer. They will explore a subset of the dataset via a Jupyter Notebook to understand the data's shape, distribution, and cleaning requirements. The I/O pattern at this stage will be mostly a light load of sequential read operations against the underlying storage.
Now that data is understood, it’s at the Data Pre-Processing & Feature Engineering stage that data transformation begins.
The first step of this stage, Data Pre-Processing, involves reading data from the bronze layer. Data engineers/scientists clean the full dataset, writing the results to the silver layer.
The second step, Feature Engineering, uses the silver layer as the input source. New features are derived from the cleaned data, and this new dataset is then written to the gold layer.
The I/O pattern of this multi-step stage involves multiple streams of sequential reads from bronze and multiple streams of sequential writes to silver during the cleaning phase, as well as multiple streams of sequential reads from silver and multiple streams of sequential writes to gold.
This phase is divided into three stages, Model Building, Model Selection, and Model Deployment.
Training takes place during the Model Building stage. It is an iterative process, with batches read from storage into memory. These batches are used to update/populate the neural network (forward pass), results are evaluated (backward pass), and gradients/optimizers are updated. This process continues until all n-th number of samples have been processed by the accelerators in play. If configured by the data scientist, checkpoints are periodically triggered to save the model's weights and state to persistent storage.
The I/O pattern involves multiple streams of sequential reads served by the gold layer feeding the forward pass, and multiple streams of sequential writes to persistent storage as part of the checkpoint process. Be aware, neither the backward pass nor gradients/optimizers updates issue storage operations.
Once a model is selected, the process moves to the Model Deployment stage, where the chosen model is integrated into a production environment via a deployment pipeline, making it available for application-level consumption.
Neither the Model Selection nor Model Deployment are storage demanding stages.
This is the final phase. Here data scientists are responsible for verifying that the pipeline, model, and production deployment align with both customer and end-user goals.
Having examined the AI/ML workflow and its storage demands, we can now evaluate which stages are the most resource intensive. This involves identifying phases where computational demand is highest and system utilization is sustained, as opposed to idle or waiting states.
Data scientists spend approximately 80% of their time preparing data [5]. Table 1 below highlights the most time-consuming resource for each phase and stage of the AI/ML workflow. Stages that involve human interaction tend to place a lighter load on the system. This is because, during activities such as analyzing data, evaluating data cleaning strategies, or designing new features, the system usage remains low while humans perform cognitive tasks. In contrast, stages with minimal human involvement, such as model training, typically apply higher pressure on system resources.
Table 1. Time-consuming resource for each phase and stage of the AI/ML workflow.
Based on this information, we addressed our first challenge by identifying "Model Building (Training)" as the AI/ML workflow stage that should be prioritized in our benchmark efforts.
The next challenge is determining how to measure model training performance with a focus on storage, in a world where GPUs are the most wanted and expensive computational resource on Earth. Here is where Deep Learning I/O (DLIO) comes into play solving this challenge.
[1] Mathew, A., Amudha, P., & Sivakumari, S. (2021). Deep learning techniques: an overview. Advanced Machine Learning Technologies and Applications: Proceedings of AMLTA 2020, 599-608.
[2] What is the medallion lakehouse architecture?. Available from <https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion>. Accessed on 2025-08-05.
[3] Lazzeri, F. (2020). Machine learning for time series forecasting with Python. Wiley.
[4] Reis, J., Housley, M. (2023). Fundamentals of Data Engineering: plan and build robust data systems. O'Reilly.
[5] AI Data Pipelines: The Ultimate Guide. MLTwist. Available from: <https://mltwist.com/wp-content/uploads/2024/03/MLtwist-AI-Data-Pipelines-The-Ultimate-Guide.pdf>. Accessed on 2025-08-07.