UNET-3D Training with DLIO: Storage Sizing and Performance Insights on Amazon FSx for NetApp ONTAP

RodrigoNascimento · ‎2025-10-21

In this third part of our AI/ML storage benchmarking series, we dive into the practical application of the DLIO benchmark to evaluate storage performance for training a UNET-3D model.

In Part 1, we identified storage demands for Deep Learning workloads through workflow analysis (https://community.netapp.com/t5/Tech-ONTAP-Blogs/Identifying-Storage-Demands-for-Deep-Learning-Workloads-Through-Workflow/bc-p/463044#M795), and in part 2, we introduced DLIO as a tool which helps to address key benchmarking challenges (https://community.netapp.com/t5/Tech-ONTAP-Blogs/DLIO-An-Approach-to-Overcome-Storage-Benchmarking-Challenges-for-Deep-Learning/ba-p/462887).

We explore how to size and tune a storage subsystem to meet the demanding I/O profile of UNET-3D training workloads, using AWS FSx for NetApp ONTAP as the underlying file system. By simulating accelerator utilization and analyzing throughput across multiple GPUs, we demonstrate how DLIO can be used to validate storage readiness for high-performance deep learning environments.

Sizing the Storage

Sizing a storage solution is a complex task with countless variables such as workload characteristics, hardware and software limitations, and performance targets. For sure, there are many approaches to consider and address this challenge.

For this benchmark, we focus on sizing the storage service based on the application's throughput and latency requirements.

You might be wondering: How can we determine the storage performance requirements for simulating the training of a UNET-3D model using DLIO?

To answer this question, we can use the formula shown by Figure 1 to estimate the theoretical throughput required to achieve 100% accelerator utilization during training.

Figure 1. Formula to calculate the theoretical throughput requirement per accelerator.

To achieve ideal throughput, the system must process a given volume of data (recordSize * batchSize) within a time window defined by the computation time in seconds.

Table 1 shows the theoretical throughput requirements for training UNET-3D and ResNet50 models considering NVIDIA accelerators A100 and H100.

Table 1. Theoretical throughput requirements for UNET-3D and ResNet50 considering NVIDIA accelerators A100 and H100.

It's important to note that UNET-3D and ResNet50 differ in terms of record size, batch size, and computation time. As a result, their storage throughput requirements also vary.

For UNET-3D, if your benchmark measurements using the A100 accelerator show a sustained throughput between 1,386 MiB/sec and 1,540 MiB/sec per simulated accelerator, your storage subsystem should be considered successfully passed the DLIO benchmark.

Figure 2. Theoretical Throughput Requirement for DLIO Benchmarks UNET-3D and ResNet50 versus Number of Accelerators.

As shown in Figure 2, simulating the training of a UNET-3D model on 16 A100 GPUs requires a storage subsystem capable of delivering up to 24,654 MiB/sec. In contrast, when simulating ResNet50 training using the same GPU model, it's possible to scale up to 256 GPUs, with a required storage throughput of 25,747 MiB/sec.

Next, let's check out our lab environment details.

Lab Environment

Figure 3 illustrates an environment built on an AWS FSx for NetApp ONTAP scale-out file system, configured with five HA-pairs using the 6GB SKU. This setup is capable of delivering up to 30 GiB/sec of throughput.

The system uses the NFS v4.1 protocol with pNFS enabled, and features a FlexGroup, a single namespace spanning all five HA-pairs, through 5 endpoints (one per active node) all managed by pNFS.

Figure 3. Lab environment diagram for DLIO UNET-3D training.

The environment includes four M6idn.16xlarge EC2 instances, each running Ubuntu 22.04.5. These instances provide 100 Gbps (or 12.5 GiB/sec) of network bandwidth per client. Collectively, the four clients offer an aggregate bandwidth of 400 Gbps (or 50 GiB/sec).

The FlexGroup endpoint was mounted on all four clients using the following options:

vers=4.1,rsize=65536,wsize=65536,proto=tcp,nconnect=16

Additionally, the Linux NFS module parameter max_session_slots was increased from 64 to 512 on all clients to support higher concurrency.

If each client is equipped with four A100 accelerators, the environment will have a total of sixteen A100 GPUs. Based on the information from Figure 2, this configuration has the potential to achieve up to 24,654 MiB/sec of storage throughput.

Results

As illustrated on Figure 2, passing the DLIO benchmark using 16 A100 accelerators requires a sustained storage throughput between 22,176 MiB/sec (for 90% accelerator utilization) and 24,654 MiB/sec (for 100% accelerator utilization).

After executing DLIO and collecting its results, the benchmark achieved an average accelerator utilization of 92.2% with a standard deviation of 0.80%, and an average storage throughput of 22,495 MiB/sec with a standard deviation of 196 MiB/sec as shown by Figure 4.

Figure 4. DLIO UNET-3D training average results.

A closer look at the storage throughput statistics in Figure 5 reveals that the maximum throughput reached during training was 29,930 MiB/sec, which is near the upper limit of what the file system service can deliver.

Figure 5. Storage throughput statistics during UNET-3D benchmark.

The first quartile was 13,968 MiB/sec, the second quartile (median) was 20,129 MiB/sec, and the third quartile reached 22,298 MiB/sec. These results suggest that our storage sizing was well-aligned with the workload requirements.

Amazon FSx for NetApp ONTAP sustained the required throughput to pass the benchmark, ensuring data was available in memory in time for the accelerator to consume it. This continuous data availability kept the accelerator busy, which in this context translates to operational efficiency.

Key Takeaways

It is possible to estimate the theoretical throughput requirement per accelerator for a given model once you know the sample size, batch size, GPU model, and its computation time in seconds.

The methodology presented in this post is storage-agnostic, allowing you to evaluate and compare various storage solutions to determine which one best meets your requirements.

Closing Thoughts

Benchmarking plays a critical role in uncovering performance bottlenecks and assessing the scalability of your infrastructure when done right.

Amazon FSx for NetApp ONTAP is a fully managed file system service that delivers the throughput required by demanding AI/ML workloads. Its flexible scale-out architecture allows you to expand file system resources dynamically, aggregating performance and/or capacity to meet evolving workload demands.

Thank you for reading! My goal with this blog series is to share insights from my journey exploring AI/ML workloads, and I hope the information provided here proves useful in your own work wherever it may take you.