In this third part of our AI/ML storage benchmarking series, we dive into the practical application of the DLIO benchmark to evaluate storage performa ...read more
Introducing self-service partnerships through NetApp Console: Empowering partners to collaborate seamlessly with enterprises!
In today’s hybrid clou ...read more
NetApp® Console is built on a restructured and enhanced foundation, encompassing platform investments, core services, enterprise readiness, security, ...read more
Amazon EVS extends the virtualization layer seamlessly between on-premises data centers and Amazon Web Services (AWS), giving VMware administrators a way to relocate and scale their environments without having to replatform or rewrite applications.
This post will explore how the NetApp® Workload Factory Amazon EVS migration advisor simplifies transitions from on-premises VMware deployments to Amazon EVS using Amazon FSx for NetApp ONTAP (FSx for ONTAP) as external storage, guiding you step by step to create a well-architected storage configuration.
... View more
In this third part of our AI/ML storage benchmarking series, we dive into the practical application of the DLIO benchmark to evaluate storage performance for training a UNET-3D model.
In Part 1, we identified storage demands for Deep Learning workloads through workflow analysis (https://community.netapp.com/t5/Tech-ONTAP-Blogs/Identifying-Storage-Demands-for-Deep-Learning-Workloads-Through-Workflow/bc-p/463044#M795), and in part 2, we introduced DLIO as a tool which helps to address key benchmarking challenges (https://community.netapp.com/t5/Tech-ONTAP-Blogs/DLIO-An-Approach-to-Overcome-Storage-Benchmarking-Challenges-for-Deep-Learning/ba-p/462887).
We explore how to size and tune a storage subsystem to meet the demanding I/O profile of UNET-3D training workloads, using AWS FSx for NetApp ONTAP as the underlying file system. By simulating accelerator utilization and analyzing throughput across multiple GPUs, we demonstrate how DLIO can be used to validate storage readiness for high-performance deep learning environments.
Sizing the Storage
Sizing a storage solution is a complex task with countless variables such as workload characteristics, hardware and software limitations, and performance targets. For sure, there are many approaches to consider and address this challenge.
For this benchmark, we focus on sizing the storage service based on the application's throughput and latency requirements.
You might be wondering: How can we determine the storage performance requirements for simulating the training of a UNET-3D model using DLIO?
To answer this question, we can use the formula shown by Figure 1 to estimate the theoretical throughput required to achieve 100% accelerator utilization during training.
Figure 1. Formula to calculate the theoretical throughput requirement per accelerator.
To achieve ideal throughput, the system must process a given volume of data (recordSize * batchSize) within a time window defined by the computation time in seconds.
Table 1 shows the theoretical throughput requirements for training UNET-3D and ResNet50 models considering NVIDIA accelerators A100 and H100.
Table 1. Theoretical throughput requirements for UNET-3D and ResNet50 considering NVIDIA accelerators A100 and H100.
It's important to note that UNET-3D and ResNet50 differ in terms of record size, batch size, and computation time. As a result, their storage throughput requirements also vary.
For UNET-3D, if your benchmark measurements using the A100 accelerator show a sustained throughput between 1,386 MiB/sec and 1,540 MiB/sec per simulated accelerator, your storage subsystem should be considered successfully passed the DLIO benchmark.
Figure 2. Theoretical Throughput Requirement for DLIO Benchmarks UNET-3D and ResNet50 versus Number of Accelerators.
As shown in Figure 2, simulating the training of a UNET-3D model on 16 A100 GPUs requires a storage subsystem capable of delivering up to 24,654 MiB/sec. In contrast, when simulating ResNet50 training using the same GPU model, it's possible to scale up to 256 GPUs, with a required storage throughput of 25,747 MiB/sec.
Next, let's check out our lab environment details.
Lab Environment
Figure 3 illustrates an environment built on an AWS FSx for NetApp ONTAP scale-out file system, configured with five HA-pairs using the 6GB SKU. This setup is capable of delivering up to 30 GiB/sec of throughput.
The system uses the NFS v4.1 protocol with pNFS enabled, and features a FlexGroup, a single namespace spanning all five HA-pairs, through 5 endpoints (one per active node) all managed by pNFS.
Figure 3. Lab environment diagram for DLIO UNET-3D training.
The environment includes four M6idn.16xlarge EC2 instances, each running Ubuntu 22.04.5. These instances provide 100 Gbps (or 12.5 GiB/sec) of network bandwidth per client. Collectively, the four clients offer an aggregate bandwidth of 400 Gbps (or 50 GiB/sec).
The FlexGroup endpoint was mounted on all four clients using the following options: vers=4.1,rsize=65536,wsize=65536,proto=tcp,nconnect=16
Additionally, the Linux NFS module parameter max_session_slots was increased from 64 to 512 on all clients to support higher concurrency.
If each client is equipped with four A100 accelerators, the environment will have a total of sixteen A100 GPUs. Based on the information from Figure 2, this configuration has the potential to achieve up to 24,654 MiB/sec of storage throughput.
Results
As illustrated on Figure 2, passing the DLIO benchmark using 16 A100 accelerators requires a sustained storage throughput between 22,176 MiB/sec (for 90% accelerator utilization) and 24,654 MiB/sec (for 100% accelerator utilization).
After executing DLIO and collecting its results, the benchmark achieved an average accelerator utilization of 92.2% with a standard deviation of 0.80%, and an average storage throughput of 22,495 MiB/sec with a standard deviation of 196 MiB/sec as shown by Figure 4.
Figure 4. DLIO UNET-3D training average results.
A closer look at the storage throughput statistics in Figure 5 reveals that the maximum throughput reached during training was 29,930 MiB/sec, which is near the upper limit of what the file system service can deliver.
Figure 5. Storage throughput statistics during UNET-3D benchmark.
The first quartile was 13,968 MiB/sec, the second quartile (median) was 20,129 MiB/sec, and the third quartile reached 22,298 MiB/sec. These results suggest that our storage sizing was well-aligned with the workload requirements.
Amazon FSx for NetApp ONTAP sustained the required throughput to pass the benchmark, ensuring data was available in memory in time for the accelerator to consume it. This continuous data availability kept the accelerator busy, which in this context translates to operational efficiency.
Key Takeaways
It is possible to estimate the theoretical throughput requirement per accelerator for a given model once you know the sample size, batch size, GPU model, and its computation time in seconds.
The methodology presented in this post is storage-agnostic, allowing you to evaluate and compare various storage solutions to determine which one best meets your requirements.
Closing Thoughts
Benchmarking plays a critical role in uncovering performance bottlenecks and assessing the scalability of your infrastructure when done right.
Amazon FSx for NetApp ONTAP is a fully managed file system service that delivers the throughput required by demanding AI/ML workloads. Its flexible scale-out architecture allows you to expand file system resources dynamically, aggregating performance and/or capacity to meet evolving workload demands.
Thank you for reading! My goal with this blog series is to share insights from my journey exploring AI/ML workloads, and I hope the information provided here proves useful in your own work wherever it may take you.
... View more
We are thrilled to announce that our very own Anthony Mashford has been recognized as a Microsoft Most Valuable Professional (MVP). This is a significant achievement, and we couldn’t be prouder to have his expertise and leadership on our team. Please join us in congratulating Anthony on this well-deserved honor!
What is the Microsoft MVP Award?
The Microsoft MVP Award is a prestigious recognition for technology experts who are passionate about sharing their knowledge with the community. It’s not just about what you know; it’s about what you share. For over 30 years, Microsoft has honored individuals who demonstrate exceptional community leadership and a deep commitment to helping others get the most out of Microsoft technologies.
This is an exclusive group. The global MVP community consists of just over 4,000 technical experts across more than 90 countries. These individuals are a vital part of the tech ecosystem, providing invaluable real-world feedback that helps shape the future of Microsoft products and services.
Why this recognition matters
Becoming an MVP is a testament to an individual's dedication, expertise, and community spirit. MVPs are leaders who:
Share knowledge freely: They write articles, speak at events, lead user groups, and contribute to online forums to help others solve technical challenges.
Provide crucial feedback: MVPs have a direct line to Microsoft product teams, offering insights that lead to better products and innovations.
Foster a strong community: They are advocates for building positive, inclusive, and helpful tech communities for everyone.
The benefits are substantial, including early access to Microsoft products, direct communication channels with product development teams, and an invitation to the exclusive MVP Global Summit. It’s an opportunity to network with fellow experts and the minds behind the technology.
A round of applause for Anthony Mashford
Anthony’s recognition as a Microsoft MVP is a direct reflection of his unwavering dedication and inspiring contributions to the tech community. His passion for sharing knowledge and helping others succeed embodies the spirit of the MVP program. We are incredibly fortunate to benefit from his expertise and leadership.
His work not only elevates our team but also strengthens the broader technology community.
Join the celebration
Help us celebrate this remarkable achievement. Please leave a comment below to congratulate Anthony Mashford on his Microsoft MVP award!
Curious about what it takes to become an MVP? You can learn more about the program and its impact on the official Microsoft MVP site.
... View more
Introducing self-service partnerships through NetApp Console: Empowering partners to collaborate seamlessly with enterprises!
In today’s hybrid cloud world, enterprises rely heavily on trusted partners such as managed service providers, service providers, and resellers to manage their NetApp ® intelligent data infrastructures. These partners play a crucial role in deploying, operating, and optimizing NetApp storage systems across on-premises and cloud environments. But until now, there hasn’t been a secure, scalable, and seamless way for partners to manage customer resources with NetApp Console.
We’re excited to announce the launch of self-service partnerships through Console, a powerful new capability that enables enterprises and their approved partners to collaborate effortlessly and securely.
... View more
A Strategic Perspective from Product Leadership at NetApp and Starburst Data
The enterprise AI revolution isn't waiting for anyone. As organizations rush to implement AI, with 87% planning deployments within 12 months, a harsh reality is emerging. Most AI projects will fail not due to inadequate AI models, but because of insufficient data infrastructure. The critical challenge most customers face is integrating their existing data, whether in the cloud, on-premises, or both, with their enterprise AI workflows. The strategic partnership between NetApp and Starburst Data represents a fundamental reimagining of how enterprises should architect for AI success.
The $100 Billion Question: Why Data Infrastructure Determines AI Success
Every large enterprise faces the same challenge. AI initiatives demand instant access to all data, everywhere, with zero compromise on governance or security. Traditional storage vendors have responded with incremental improvements. We've chosen to break the mold.
The Fatal Flaw in Current Approaches
Traditional storage vendors continue to treat storage as a passive repository, a place where data waits to be moved, processed, and returned. This outdated paradigm creates three critical failures:
A Data Movement Tax: Every AI workflow requires multiple hops between storage and processing layers, introducing latency, cost, and complexity.
Siloed Intelligence: Data remains trapped in infrastructure silos, preventing the holistic view AI requires.
Static Architecture: Traditional storage can't adapt to AI's dynamic requirements for training, inference, and real-time processing.
Beyond Traditional Storage: The Intelligent Data Platform
The new design center in NetApp® ONTAP® is built on the tenets of disaggregation and composable architecture. This design center is ideally suited for workloads that require high bandwidth and scale and provides cost-effective scaling of data infrastructure for deep learning. NetApp's new disaggregated architecture includes an enhanced metadata management engine that helps customers understand all the data assets in their organization, enabling them to simplify AI training and fine- tuning. This new engine will automatically capture changes to your data, generate highly compressed vector embeddings, making that data available for searches and AI/RAG inferencing workloads that leverage generative AI technologies and large language models (LLMs). This shift holds several important implications. Let’s unpack them one by one.
Data preparation and data management happen where data lives. This approach reduces the inefficiency of external processing
AI workflows accelerate. This allows the value of AI to scale by orders of magnitude through in-platform optimization
Governance remains intact. This means that, with Starburst Data’s hybrid data federation capabilities, data never leaves the secure perimeter, enabling in-place query execution without data movement.
The Starburst Multiplier Effect
Starburst's distributed and hybrid platform amplifies this vision by providing federated data access across all data sources, whether in NetApp ONTAP or StorageGRID, cloud object stores, or legacy databases, along with the ability to secure, govern, and maintain the compliance of the data with AI.
Together, we deliver:
Single data access interface across petabytes of distributed and hybrid data for a unified data view. NetApp ONTAP unifies storage through a simplified, software-defined approach for secure and efficient data management. Customers can leverage ONTAP for NAS, SAN, and S3 object protocol, which enables effective use of cloud resources for improved performance and seamless scalability. Performant 50+ native connectors to every major enterprise data platform. NetApp ONTAP offers extended REST API capabilities enabling S3 object administration and AI service integration, which also includes enhanced OAuth 2.0 support.
Ability to not only store but govern the data to enforce and maintain compliance across all data sources with AI. ONTAP includes several synchronous and asynchronous data protection and compliance policies that ensure client I/O is not disrupted across your AI workflow.
The NetApp-Starburst Advantage: Quantifiable Business Outcomes
The adoption of NetApp and Starburst together results in several advantages and quantifiable improvements in business outcomes. Let’s unpack them one by one.
Accelerated AI Time-to-Value
Using NetApp and Starburst accelerates the time spent to achieve value when compared to traditional approaches. Consider the typical scenario listed below.
Traditional Approach
NetApp-Starburst Platform
6-12 months to production
6-8 weeks to production
Multiple data movements
Zero data movement
5-7 tools required
Single integrated platform
Limited to structured data
All data types supported
Dramatic Cost Reduction
Using NetApp and Starburst also results in a dramatic reduction in cost. The sample below shows a typical cost reduction using this strategy.
70% reduction in data preparation costs through in-platform processing
50% lower cloud egress fees via intelligent data placement
78% which is best in class storage efficiency
Eliminated redundant infrastructure, recouping dollars for priority projects
Enterprise-Grade AI Governance
NetApp and Starubrst offer rock-solid, enterprise-ready data governance. While competitors struggle with compliance, we are built to deliver a compliance-ready approach no matter where your data lives. This includes data on-premises, in the cloud, or a mix of both.
This results in:
NSA-validated encryption, the only storage vendor to achieve this
Immutable audit trails for regulatory compliance
Unified access control across all data sources
Real-time lineage tracking for AI model governance
Why NetApp and Starburst Help You to Act on AI Today
AI adoption is transforming the business landscape at every organization worldwide. Organizations that fail to modernize their data infrastructure today will find themselves permanently disadvantaged. As you decide how and when to adopt AI at your organization, there are several pathways to consider.
Address the hurdle to AI implementation, adoption, and success: Data, data access, and governed data management are the foundation for AI project success.
Compound Learning: Every day is a proverbial head start on your competitors.
Convert on the AI productivity and cost premise: Successful AI projects drive organizational and business productivity and cost gains - every day.
NetApp + Starburst: Your Unfair Advantage
NetApp and Starburst are a perfect pairing. Together, they provide a complete transformation of your AI readiness that offers.
The only platform with embedded data intelligence + cloud-neutral hybrid lakehouse, with federated access to all data and data management
The only unified solution spanning all protocols, all clouds, all data types, without the need to move data
The only partnership delivering true federated data intelligence at scale
The enterprise race to AI success will not be won by those with the largest models, but by those with the strongest data foundations. NetApp and Starburst together redefine what it means to be AI-ready. We unite intelligent storage, federated access, and governed data management into one cohesive ecosystem.
While traditional storage vendors remain trapped in legacy architectures, NetApp and Starburst are here to help you build an AI-ready, hybrid, future-proof data platform. This partnership transforms data from a static asset into an active enabler of insight, speed, and innovation. The enterprises that act now to modernize their data infrastructure will be the ones setting the pace in the AI-driven economy.
Because in the AI era, your data infrastructure isn't just supporting your business, it IS your business value.
... View more