Deep Learning (DL) is the subfield of Artificial Intelligence (AI) that focuses on creating large neural network models capable of data-driven decisio ...read more
In the first post of our series, we explored the AI/ML workflow through the lens of a Medallion Data Architecture. We explained our rationale to ident ...read more
The new SSD capacity decrease capability of FSx for ONTAP Gen-2 file systems, transforms high-performance storage workloads management on AWS, offerin ...read more
I'm excited to kick off a new blog series called Back to Basics (B2B). The goal is to revisit fundamental concepts that often slip through the cracks ...read more
Models don't usually fail because the code went rogue. They fail because the data moved. Schemas shift, labels drift, "latest.csv" isn't what you think it is, and auditors (compliance regulators, or even lawsuits) show up without an invite. If you've ever retrained the same code and got a different model, you've met the real culprit: unpinned and unprovenanced data.
Most everyone talks about solutions "at inference" which is far too late to start talking about compliance-based AI architectures. This blog post starts at the beginning, where it should... from the data scientist's perspective. How can we prove that the data, the most critical component, was used to produce the model, LLM, fine-tune, and embeddings in our solution?
... View more
Deep Learning (DL) is the subfield of Artificial Intelligence (AI) that focuses on creating large neural network models capable of data-driven decisions [1].
While GPUs often take the spotlight in AI/ML infrastructure, storage plays a critical role throughout the pipeline. From storing raw datasets and engineered features to feeding data into GPUs during training, performance of the storage system has significant impact on the efficiency and scalability of the workloads.
Understanding how to configure a storage solution and clients to support AI/ML pipelines isn't just helpful, it's essential.
In this series, we will delve into:
Part I - Identifying storage demands for Deep Learning workloads through workflow analysis
Part II - Deep Learning I/O: An approach to overcome storage benchmarking challenges for Deep Learning workloads
Part III - The methodology for benchmarking storage performance for Training a UNET-3D model and its performance results
Part IV - The methodology for benchmarking storage performance for checkpointing a LLM and its performance results
We structured the series in the order above as it's important to understand the challenges, tools, and methods behind the data before diving into the performance results and insights.
*** TERMINOLOGY ALERT ***
If you are a data scientist, a machine learning engineer, a data engineer, or a data
platform engineer, please note that throughout this series, the term "storage" refers
specifically to the infrastructure component acting as a file system for your data.
This includes cloud-based services such as AWS FSx NetApp ONTAP, Azure NetApp Files
or Google Cloud NetApp Volumes, as well as on-premises NetApp engineered systems
like AFF A-series and C-series. This distinction is important because "storage"
can mean different things depending on your role or the system architecture you're
working with.
Identifying Storage Demands for Deep Learning Workloads Through Workflow Analysis
One of the core challenges in measuring storage performance for deep learning workloads is identifying which phases (data ingestion, preprocessing, model training, inference, etc) place the greatest demands on storage. This insight is essential for designing meaningful benchmarks, especially when data is accessed from multiple storage tiers based on the chosen data management strategy.
As deep learning models grow in complexity and scale, the performance of underlying storage systems becomes increasingly critical. From ingesting massive datasets to training models across distributed environments, each stage of the AI/ML pipeline interacts with storage in distinct ways.
We will walk through each phase of the AI/ML workflow to explain its purpose, expected load and I/O patterns. To support this analysis, we will introduce the "Medallion Data Architecture" (Figure 1) and the AI/ML workflow template (Figure 2). This combined view allows us to examine the AI/ML process in the context of the underlying data infrastructure.
The "Medallion Architecture" is a popular data management strategy that organizes data into multiple layers (typically bronze, silver, gold) to progressively improve data quality and usability. This layered approach, often used in data lakehouses, facilitates data processing, cleansing, and transformation, making data more suitable for various analytics, business intelligence, and AI use cases [2].
Figure 1 shows an example of a "Medallion Architecture". The bronze layer acts as the landing zone for raw, unprocessed data from various sources. It focuses on capturing data as it arrives, without any transformations or quality checks. The silver layer is where data from the bronze layer is refined. This includes tasks like data validation, cleansing, and deduplication, ensuring a more reliable and consistent dataset. The gold layer hosts curated data. Here, domain-specific features can be extracted, and the data is optimized for consumption by business intelligence tools, dashboards, decision-making applications, and AI/ML pipelines.
Figure 1. Storage plays a central role in the data management lifecycle. Adapted from (REIS & HOUSLEY, 2023) with modifications.
Figure 2 illustrates an AI/ML workflow template developed by Francesca Lazzeri, PhD. In her book, Lazzeri emphasizes the significance of each phase within the workflow. While her template is tailored for time series forecasting, its structure is broadly applicable to a wide range of AI/ML workflows [3].
Figure 2. AI/ML Workflow Template. Adapted from (LAZZERI, 2020) with modifications.
Let's walk through the AI/ML workflow template and examine how each stage interacts with, or places demands on storage systems.
Business Understanding Phase
In this phase there are no direct storage-related concerns. The focus is on understanding the business problem. Data scientists, machine learning engineers, and data engineers collaborate to define the problem, identify the types of data needed to solve it, and determine how to measure the success of the AI/ML solution.
Data Preparation Phase
In this phase, storage considerations begin to play a role. As shown in Figure 2 above, the data preparation phase subdivides further into specific stages, namely:
The data ingestion stage
The data exploration and understanding stage
The data pre-processing and feature development stage
During the Data Ingestion stage, data from multiple sources—whether in batch or streaming form—is ingested into the bronze layer of the data architecture. At this layer, the storage I/O pattern is primarily characterized by sequential write operations, driven by concurrent data streams from these sources.
The next stage is Data Exploration and Understanding. It is at this stage a data engineer/scientist will be reading CSV or Parquet files from the bronze layer. They will explore a subset of the dataset via a Jupyter Notebook to understand the data's shape, distribution, and cleaning requirements. The I/O pattern at this stage will be mostly a light load of sequential read operations against the underlying storage.
Now that data is understood, it’s at the Data Pre-Processing & Feature Engineering stage that data transformation begins.
The first step of this stage, Data Pre-Processing, involves reading data from the bronze layer. Data engineers/scientists clean the full dataset, writing the results to the silver layer.
The second step, Feature Engineering, uses the silver layer as the input source. New features are derived from the cleaned data, and this new dataset is then written to the gold layer.
The I/O pattern of this multi-step stage involves multiple streams of sequential reads from bronze and multiple streams of sequential writes to silver during the cleaning phase, as well as multiple streams of sequential reads from silver and multiple streams of sequential writes to gold.
Data Modeling Phase
This phase is divided into three stages, Model Building, Model Selection, and Model Deployment.
Training takes place during the Model Building stage. It is an iterative process, with batches read from storage into memory. These batches are used to update/populate the neural network (forward pass), results are evaluated (backward pass), and gradients/optimizers are updated. This process continues until all n-th number of samples have been processed by the accelerators in play. If configured by the data scientist, checkpoints are periodically triggered to save the model's weights and state to persistent storage.
The I/O pattern involves multiple streams of sequential reads served by the gold layer feeding the forward pass, and multiple streams of sequential writes to persistent storage as part of the checkpoint process. Be aware, neither the backward pass nor gradients/optimizers updates issue storage operations.
Once a model is selected, the process moves to the Model Deployment stage, where the chosen model is integrated into a production environment via a deployment pipeline, making it available for application-level consumption.
Neither the Model Selection nor Model Deployment are storage demanding stages.
Business Validation
This is the final phase. Here data scientists are responsible for verifying that the pipeline, model, and production deployment align with both customer and end-user goals.
Evaluating Resource Utilization
Having examined the AI/ML workflow and its storage demands, we can now evaluate which stages are the most resource intensive. This involves identifying phases where computational demand is highest and system utilization is sustained, as opposed to idle or waiting states.
Data scientists spend approximately 80% of their time preparing data [5]. Table 1 below highlights the most time-consuming resource for each phase and stage of the AI/ML workflow. Stages that involve human interaction tend to place a lighter load on the system. This is because, during activities such as analyzing data, evaluating data cleaning strategies, or designing new features, the system usage remains low while humans perform cognitive tasks. In contrast, stages with minimal human involvement, such as model training, typically apply higher pressure on system resources.
Table 1. Time-consuming resource for each phase and stage of the AI/ML workflow.
Based on this information, we addressed our first challenge by identifying "Model Building (Training)" as the AI/ML workflow stage that should be prioritized in our benchmark efforts.
The next challenge is determining how to measure model training performance with a focus on storage, in a world where GPUs are the most wanted and expensive computational resource on Earth. Here is where Deep Learning I/O (DLIO) comes into play solving this challenge.
References
[1] Mathew, A., Amudha, P., & Sivakumari, S. (2021). Deep learning techniques: an overview. Advanced Machine Learning Technologies and Applications: Proceedings of AMLTA 2020, 599-608.
[2] What is the medallion lakehouse architecture?. Available from <https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion>. Accessed on 2025-08-05.
[3] Lazzeri, F. (2020). Machine learning for time series forecasting with Python. Wiley.
[4] Reis, J., Housley, M. (2023). Fundamentals of Data Engineering: plan and build robust data systems. O'Reilly.
[5] AI Data Pipelines: The Ultimate Guide. MLTwist. Available from: <https://mltwist.com/wp-content/uploads/2024/03/MLtwist-AI-Data-Pipelines-The-Ultimate-Guide.pdf>. Accessed on 2025-08-07.
... View more
In the first post of our series, we explored the AI/ML workflow through the lens of a Medallion Data Architecture. We explained our rationale to identify the key stages of the pipeline to target for storage benchmarking.
In this post, we introduce DLIO, a benchmarking tool purpose-built to simulate the I/O patterns of Deep Learning (DL) workloads. We'll walk you through its capabilities and how it enables storage benchmarking without the need for using any AI hardware.
Deep Learning I/O (DLIO)
DLIO is a benchmark tool to emulate the I/O pattern and behavior of deep learning applications [1a]. It was designed to emulate the AI/ML training process with the intent to measure how fast data is served from storage to RAM.
During the training process, data is loaded in batches concurrently through multiple threads while accelerators execute training. After processing each batch, the accelerator triggers a request to the host, prompting the loading of another batch from storage. This iterative cycle guarantees uninterrupted data processing, contributing to the efficiency of the training process [1b].
Many new AI hardware (e.g. GPU, DPU, TPU, Cerebras, etc.) have been designed and deployed to accelerate the computation during training [2]. That hardware is not cheap, but the good news is that you don't need any AI hardware to run DLIO and benchmark your storage solution for your AI/ML pipeline.
Deep learning frameworks like PyTorch and TensorFlow provide an abstraction called data loader, which simplifies key aspects of the data handling such as batching, shuffling, and parallel data loading.
When you iterate over a data loader instance, it triggers I/O operations - this is when the data loader opens files, reads samples, and prepares them for processing. Once the data is transferred to the GPU, the computation phase begins, including forward and back propagation. Interestingly, during this computation phase, no I/O operation related to training occur.
Therefore, if you wanted to measure how efficiently your storage solution delivers data to the GPU, you should focus on measuring the performance of the data loading mechanism specifically. The authors of DLIO recognized this pattern and came up with an elegant solution shown in Figure 1: replacing the computation stage with a sleep function.
The duration of the sleep function should match the time a specific GPU model takes to perform the forward and back propagation when training a given model. This approach allows researchers to isolate and accurately measure the performance of the data loading stage without the need to invest in GPU hardware.
Figure 1. DLIO solution for storage benchmark. Adapted from [1b] with modifications.
The DLIO benchmark achieves over 90% similarity in I/O behavior. This similarity validates that DLIO benchmark is an accurate representation of real applications. The loss of 3-6% similarity is because all applications have a distribution of transfer request sizes, which is represented as a median request size within the benchmark [2].
DLIO includes a variety of deep learning workload examples, such as UNET-3D, Cosmoflow, ResNet50, and LLaMA 3. It also supports the creation of customized workloads through a flexible configuration system.
Let's take a closer look at DLIO. Let me show you the steps I followed to get it working on my Ubuntu 22.04 virtual machine.
DLIO Installation Steps
I began by setting up a virtual machine running Ubuntu Server, opting for the minimal installation to keep the environment lightweight. I'm currently using Ubuntu 22.04, which includes Python 3.10.12 by default. As of this writing, Python 3.10.12 is the required version for installing DLIO without any compatibility issues. Once your VM is ready, you need to follow the steps outlined below.
1. Begin by installing the OS packages required by DLIO. Pay special attention to the MPI package. Based on my experience, MPICH tends to be more straightforward to work with compared to OpenMPI.
sudo apt install -y build-essential git vim sysstat cmake libhdf5-dev hwloc libhwloc-dev mpich libmpich-dev bc
2. Clone the DLIO repository.
git clone https://github.com/argonne-lcf/dlio_benchmark.git
3. Install the python modules required by DLIO:
pip3 install -r dlio_benchmark/requirements.txt
4. To avoid some warning messages thrown out by TensorFlow when running a workload, install the following package:
pip3 install tensorflow-cpu
5. Change to the DLIO directory and run install the dlio_benchmark:
cd dlio_benchmark ; pip3 install .
6. Run the dlio_benchmark command to test if your installation has been successful:
mpirun -np 8 dlio_benchmark workload=unet3d_a100 ++workload.workflow.generate_data=True ++workload.workflow.train=False
if you run dlio_benchmark and encounter an error indicating that the shared library libmpi.so.12 is missing, execute the command below and try again:
cd /lib/x86_64-linux-gnu ; ln -s libmpich.so.12 libmpi.so.12
Next, let me show you how DLIO works its magic to measure storage performance for deep learning workloads. From loading the datasets to faking the computation stage.
DLIO Execution Flow
DLIO begins by initializing the MPI stack via the DLIOMPI.get_instance().initialize() method.
# dlio_benchmark/main.py
def main() -> None:
"""
The main method to start the benchmark runtime.
"""
DLIOMPI.get_instance().initialize()
run_benchmark()
DLIOMPI.get_instance().finalize()
The DLIOMPI.initialize() method sets up the MPI environment by calling MPI.Init() , updates the MPI state to MPIState.MPI_INITIALIZE , and opens the MPI.COMM_WORLD communicator, which encompasses all participating processes.
# dlio_benchmark/utils/utility.py
class DLIOMPI:
...
def initialize(self):
from mpi4py import MPI
if self.mpi_state == MPIState.UNINITIALIZED:
# MPI may have already been initialized by dlio_benchmark_test.py
if not MPI.Is_initialized():
MPI.Init()
self.mpi_state = MPIState.MPI_INITIALIZED
self.mpi_rank = MPI.COMM_WORLD.rank
self.mpi_size = MPI.COMM_WORLD.size
self.mpi_world = MPI.COMM_WORLD
split_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
# Get the number of nodes
self.mpi_ppn = split_comm.size
self.mpi_local_rank = split_comm.rank
self.mpi_nodes = self.mpi_size//split_comm.size
elif self.mpi_state == MPIState.CHILD_INITIALIZED:
raise Exception(f"method {self.classname()}.initialize() called in a child process")
else:
pass # redundant call
Next, the run_benchmark() function is invoked, which instantiates a DLIOBenchmark object using a workload configuration. This configuration defines parameters including the directory to stored training and checkpoint files, the number of files for training, the batch size, among other options needed for setting up a training workload. The benchmark is then executed through a sequence of method calls: initialize(), run(), finalize() .
# dlio_benchmark/main.py
@hydra.main(version_base=None, config_path="configs", config_name="config")
def run_benchmark(cfg: DictConfig):
benchmark = DLIOBenchmark(cfg['workload'])
benchmark.initialize()
benchmark.run()
benchmark.finalize()
The run() method coordinates the training process across all epochs. For each epoch, it prepares the dataset for reading, performs training, and records execution stats using the StatsCounter class via the stats property of the benchmark object.
Training is initiated by the line steps = self._train(epoch) . To understand the training execution in detail, let's examine the _train(self, epoch) method.
# dlio_benchmark/main.py
...
class DLIOBenchmark:
...
@dlp.log
def run(self):
...
if (not self.generate_only) and (not self.args.checkpoint_only):
...
for epoch in range(1, self.epochs + 1):
self.stats.start_epoch(epoch)
self.next_checkpoint_step = self.steps_between_checkpoints
self.stats.start_train(epoch)
steps = self._train(epoch)
self.stats.end_train(epoch, steps)
self.logger.debug(f"{utcnow()} Rank {self.my_rank} returned after {steps} steps.")
self.framework.get_loader(DatasetType.TRAIN).finalize()
# Perform evaluation if enabled
if self.do_eval and epoch >= next_eval_epoch:
next_eval_epoch += self.epochs_between_evals
self.stats.start_eval(epoch)
self._eval(epoch)
self.stats.end_eval(epoch)
self.framework.get_loader(DatasetType.VALID).finalize()
self.args.reconfigure(epoch + 1) # reconfigure once per epoch
self.stats.end_epoch(epoch)
if (self.args.checkpoint_only):
self._checkpoint()
self.stats.end_run()
The data is loaded in batches via the for batch in loader.next(): loop. The interesting part here is how the training computation is simulated using a sleep function. This simulation begins with the call to self.framework.compute(batch, epoch, block_step, self.computation_time) .
# dlio_benchmark/main.py
...
class DLIOBenchmark:
...
def _train(self, epoch):
"""
Training loop for reading the dataset and performing training computations.
:return: returns total steps.
"""
block = 1 # A continuous period of training steps, ended by checkpointing
block_step = overall_step = 1 # Steps are taken within blocks
max_steps = math.floor(self.num_samples * self.num_files_train / self.batch_size / self.comm_size)
self.steps_per_epoch = max_steps
# Start the very first block
self.stats.start_block(epoch, block)
loader = self.framework.get_loader(dataset_type=DatasetType.TRAIN)
self.stats.start_loading()
for batch in loader.next():
self.stats.batch_loaded(epoch, overall_step, block)
computation_time = self.args.computation_time
if (isinstance(computation_time, dict) and len(computation_time) > 0) or (isinstance(computation_time, float) and computation_time > 0):
self.framework.trace_object("Train", overall_step, 1)
self.stats.start_compute()
self.framework.compute(batch, epoch, block_step, self.computation_time)
self.stats.batch_processed(epoch, overall_step, block)
# This is the barrier to simulate allreduce. It is required to simulate the actual workloads.
self.comm.barrier()
if self.do_checkpoint and (
self.steps_between_checkpoints >= 0) and overall_step == self.next_checkpoint_step:
self.stats.end_block(epoch, block, block_step)
self.stats.start_save_ckpt(epoch, block, overall_step)
self.checkpointing_mechanism.save_checkpoint(epoch, overall_step)
self.stats.end_save_ckpt(epoch, block)
block += 1
# Reset the number of steps after every checkpoint to mark the start of a new block
block_step = 1
self.next_checkpoint_step += self.steps_between_checkpoints
else:
block_step += 1
overall_step += 1
if overall_step > max_steps or ((self.total_training_steps > 0) and (overall_step > self.total_training_steps)):
if self.args.my_rank == 0:
self.logger.info(f"{utcnow()} Maximum number of steps reached")
if (block_step != 1 and self.do_checkpoint) or (not self.do_checkpoint):
self.stats.end_block(epoch, block, block_step - 1)
break
# start a new block here
if block_step == 1 and block != 1:
self.stats.start_block(epoch, block)
self.stats.start_loading()
self.comm.barrier()
if self.do_checkpoint and (self.steps_between_checkpoints < 0) and (epoch == self.next_checkpoint_epoch):
self.stats.end_block(epoch, block, block_step-1)
self.stats.start_save_ckpt(epoch, block, overall_step-1)
self.checkpointing_mechanism.save_checkpoint(epoch, overall_step)
self.stats.end_save_ckpt(epoch, block)
self.next_checkpoint_epoch += self.epochs_between_checkpoints
return overall_step
The compute method is implemented by the Framework class, which serves as an abstract base class defining the required methods for the classes implementing a framework like PyTorch or TensorFlow.
In the PyTorch implementation, the compute method invokes the model() method, which in turn calls a sleep function located in the utils/utility.py module. Specifically, the line base_sleep(sleep_time) simulates the time an accelerator takes to complete the computation stage. This includes the forward pass, backward pass, and weights and bias updates.
# dlio_benchmark/utils/utility.py
...
def sleep(config):
sleep_time = 0.0
if isinstance(config, dict) and len(config) > 0:
if "type" in config:
if config["type"] == "normal":
sleep_time = np.random.normal(config["mean"], config["stdev"])
elif config["type"] == "uniform":
sleep_time = np.random.uniform(config["min"], config["max"])
elif config["type"] == "gamma":
sleep_time = np.random.gamma(config["shape"], config["scale"])
elif config["type"] == "exponential":
sleep_time = np.random.exponential(config["scale"])
elif config["type"] == "poisson":
sleep_time = np.random.poisson(config["lam"])
else:
if "mean" in config:
if "stdev" in config:
sleep_time = np.random.normal(config["mean"], config["stdev"])
else:
sleep_time = config["mean"]
elif isinstance(config, (int, float)):
sleep_time = config
sleep_time = abs(sleep_time)
if sleep_time > 0.0:
base_sleep(sleep_time)
return sleep_time
For example, during UNET-3D training, it was measured that a NVIDIA H100 takes approximately 0.323 seconds to complete this stage. This value is passed to DLIO via the workload configuration file using the workflow.train.computation_time key.
The use of base_sleep(sleep_time) allows performance testing of storage systems for deep learning workloads without requiring expensive accelerators of any type in the lab. It's worth noting that DLIO's authors chose to alias Python's native sleep function as base_sleep in their implementation.
# dlio_benchmark/utils/utility.py
...
from time import time, sleep as base_sleep
Key Takeaways
DLIO does not require any accelerator (e.g., GPU, TPU, DPU) to benchmark your storage system.
Benchmark pass criteria are based on both throughput and latency. Therefore, focusing solely on high throughput is insufficient. You must also ensure the system responds quickly enough to maintain high accelerator utilization (AU).
Accelerator utilization depends on the workload type. For example:
To pass a UNET-3D benchmark, AU must be ≥ 90%
To pass a CosmoFlow benchmark, AU must be ≥ 70%
Closing Thoughts
Thanks for sticking with us through this deep dive! We know it's a lot to take in, but by now you should have a solid understanding of the context and challenges we faced in coming up with a cost-efficient method for measuring performance for deep learning workloads and the rationale behind our approach to overcoming these challenges.
In the next post of this series, we'll explore our methodology and share performance results from training a UNET-3D model using an AWS FSx for NetApp ONTAP scale-out file system.
References
[1a] DLIO Benchmark. Available from: <https://dlio-benchmark.readthedocs.io/en/latest/>
[1b] DLIO Benchmark Overview. Available from: <https://dlio-benchmark.readthedocs.io/en/latest/overview.html>
[2] H. Devarajan, H. Zheng, A. Kougkas, X. -H. Sun and V. Vishwanath, DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications., 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Melbourne, Australia, 2021, pp. 81-91, doi: 10.1109/CCGrid51090.2021.00018.
... View more
If you’re running Microsoft Hyper-V, you’ve probably felt the pain of juggling many siloed tools, chasing down performance issues, and never quite having the full picture of how your VMs, hosts, and storage all fit together.
This is where NetApp Data Infrastructure Insights (DII) really changes the game.
... View more
Deploying new infrastructure requires some pre-work to make sure that the hardware you selected will meet your performance requirements. In this post I guide you on how to make the right choices when sizing FSx for ONTAP appropriately to provide optimal performance for your workloads.
... View more