Optimize GPU-Accelerated Workloads on NetApp Storage Systems using NVIDIA GPUDirect Storage

arnette · ‎2022-02-28

Computing performance is growing at an astonishing rate these days, and NVIDIA is outpacing Moore’s law with its accelerated computing platform by doubling compute performance every 18 months or less. Advances in scientific computing and DL are able to tackle problems with a larger number of parameters, bigger data sets, and for sensors delivering higher resolution models — driving the need for high bandwidth and lower latency to keep up with advances in algorithms and multi-node, multi-GPU performance. Increased storage performance means algorithms can scale out for faster execution to deliver faster time-to-insight or handle more users concurrently to deliver better ROI on GPU compute infrastructure. In this blog, I'm going to explain how NVIDIA Magnum IO GPUDirect Storage (GDS) using Remote Direct Memory Access (RDMA) can solve storage bottleneck issues for some workloads and talk about how NetApp can support GDS for a variety of use cases and customer infrastructure preferences.

What is GPUDirect Storage?

As the first release in the NVIDIA Magnum IO™ family of solutions, GPUDirect RDMA has been around for a few years and extends RDMA to allow movement of data directly from GPU memory to other devices on the PCI bus, either network cards or other GPUs, to accelerate the exchange of weights and biases in a distributed training job. Networking is typically the bottleneck in large-scale distributed training, and GPUDirect RDMA significantly reduced latency and enabled a massive increase in the number of GPUs that could be applied to a single training process. GDS is used to provide a similar increase in storage performance for use cases where data access speeds are the main bottleneck. This includes things like large-scale batch analytics or inferencing against massive pools of customer data using tools like Apache Spark. It also includes workloads like CGI video production and video gaming where scene data and skins are loaded and rendered in real time, and better storage performance means that the cast and crew spend more time filming and less time waiting for data to load.

GDS is focused on optimizing data handling within the host by using RDMA to move data directly between GPUs and a storage system, which reduces latency and CPU utilization. Traditionally, RDMA IO is performed between the storage system and main system RAM, but this doesn't provide optimal performance for GPU operations because the data will still need to be copied from CPU RAM to GPU RAM. GDS uses the same technology as GPUDirect RDMA, allowing remote storage systems to address memory directly on the GPU and eliminating the main memory bounce buffer and copy processes involved. As shown in the graphic below, this is done by inserting a new kernel driver (nvidia-fs.ko and nvidia.ko) into the Virtual File System stack that manages the GPU memory address space and directs IO to the appropriate blocks in either system RAM or GPU RAM. This extends all the way down to the PCI bus, allowing data to move directly between the GPU and the network interface over the PCI bus. Many operations like metadata will still use CPU RAM, while data blocks can go directly into GPU memory.

There are a couple of ways this can be implemented with existing protocols, and NetApp supports both using the E/EF Series systems with BeeGFS for IB/RDMA protocol connectivity, and with AFF/FAS systems using NFS over RDMA and RoCE as the transport protocol.

GDS with EF600 and BeeGFS

For customers who prefer HPC-style infrastructures, or who are working on ML/DL problems at the largest scale, InfiniBand-based Parallel File Systems (PFS) are often preferred because of the massively scalable performance and capacity they offer. NetApp now offers the BeeGFS filesystem including L1/L2 software support with E- and EF-series storage systems to provide customers with a complete PFS solution. By combining BeeGFS with the high-performance EF600 storage system with 200GB/s HDR InfiniBand support, we've created a building block that offers up to 74GB/s of read performance in 8RU. Services can be deployed with high availability within each block, and blocks can be replicated as many times as necessary for deployments of any scale. And with Ansible, Prometheus/Grafana and Kubernetes CSI integration, the full stack is easily deployed and managed.

NetApp is supporting the modification of BeeGFS to support GDS operations, including changes to client processes to understand the GPU address space as well as the server processes that actually store data chunks on disk. GPUDirect Storage with BeeGFS is currently available in a limited Tech Preview capacity and will be publicly available in the upcoming BeeGFS 7.3 release in March 2022. Customers interested in trying it out before March can reach out to sales@thinkparq.com for more information.

GDS and ONTAP

For customers who require advanced data management capabilities or prefer traditional enterprise network technologies, NetApp's ONTAP storage operating system now offers NFS over RDMA to bring standard NFS operations to the next level. RDMA has been around for decades, and while it is typically associated with InfiniBand networks, RDMA over Converged Ethernet (RoCE) leverages advanced flow control mechanisms to provide a low latency, dropless fabric using Ethernet networks. NFS over RDMA is supported on most current Linux distributions using appropriate hardware and drivers, in this case NVIDIA ConnectX-5 and -6 cards, and easily integrates into existing workflows and applications because there are no changes to the applications. GDS uses the NVIDIA Mellanox™ Open Fabrics Enterprise Distribution (MOFED) software to extend basic NFS over RDMA capability to include GPU memory space and bypass CPU and main memory by making RPC/RDMA calls directly to the NIC buffers. This allows the highest level of performance for applications while enabling data pipelines that can span cloud or remote locations, as well as primary datacenters.

ONTAP support for NFS over RDMA is available starting in ONTAP 9.10.1 with the NFSv4.0 protocol only, and GDS will be supported at that time so customers can start to test GDS functionality. RDMA feature parity for GDS will align with an upcoming release of ONTAP that will include support for NFSv3 and/or NFSv4.1 to significantly improve performance and close any other feature gaps.

For more information on GDS and NFS over RDMA with ONTAP 9.10.1, please see the ONTAP 9 Documentation Center.

Conclusion

Just like GPUDirect RDMA changed the way compute clusters communicate, GDS is likely going to change the way we think about storage. Emerging models and training techniques are driving data throughput and latency requirements higher than ever, and other inferencing and GPU rendering use cases are only possible with next-level storage performance. It's going to take some time for the software around those use cases to reach production readiness, but when that happens we will likely see another explosion in storage performance requirements as each GPU-accelerated system may require the same storage performance that a full cluster does currently. NVIDIA is extending software frameworks like NVIDIA RAPIDS™ to use GDS when applicable, and software developers in a variety of industries are working to update code to take full advantage. NetApp is uniquely positioned to offer a solution for any GDS-based workload or requirement while continuing to offer industry-leading data management capabilities across our entire range of products. For more information on NetApp GDS solutions, you can watch the video of my Insight presentation, Accelerating Workloads on NVIDIA GPUs with GPU Direct Storage (BRK-1359-3), or visit www.netapp.com/ai for information about all of NetApp’s solutions for ML/DL workloads.