Optimize MLOps with Google Cloud NetApp Volumes and Google Kubernetes Engine

rarvind · ‎2025-04-08

In the world of AI and machine learning (ML), the scalability and efficiency of the backbone infrastructure play a critical role in the success of an AI project. MLOps solutions combined with the back-end infrastructure must have an inherent ability to optimize costs at every step of an AI project so that it is sustainable in the long run.

The combination of Google Cloud NetApp Volumes and Google Kubernetes Engine has proven to be a game changer in enhancing ML workflows. In this blog post, we will cover the steps that are involved in training a model. Along the way, we’ll explore how this integration can improve data handling, performance, resource usage, and model training while it optimizes costs.

And don’t miss the demonstration that showcases all the capabilities that are presented in this blog post!

Solution overview

The following graphic shows the basic components of the solution.

At the foundation of this solution is Google Cloud NetApp Volumes, which serves as a high-performance, scalable file storage solution. The dataset for training an ML model is hosted and served out of this layer.

Moving into the compute layer, there is Google Kubernetes Engine (GKE), with the NetApp® Trident™ Container Storage Interface (CSI) driver to orchestrate and to present the dataset to the compute layer. To add more power to the compute, NVIDIA GPU accelerators are added as worker nodes in GKE, which are essential for speeding up the model training process.

At the top of the stack, the ML framework is deployed in a containerized form, using JupyterLab for code execution and experimentation. This setup gives you the flexibility to consume resources on demand and to fully use them when they’re provisioned.

Workflow

Training a model is a multistep process, and following is the series of operations that you need to carry out. Throughout this flow, we discuss how the capabilities of NetApp Volumes and GKE combined deliver a highly efficient and feature-rich infrastructure back end for MLOps.

Step 1: Data preparation with Google Cloud NetApp Volumes

The first step in any ML workflow is to prepare the data. In this context, the dataset typically resides within a volume in NetApp Volumes and, ideally, with the the Extreme service level in place. The Extreme service level of NetApp Volumes delivers up to 30GiBps of throughput and supports volumes as large as 1PiB, which is optimal for data-heavy and performance-demanding AI/ML workloads.

Step 2: Kubernetes setup with GKE

Next comes a GKE cluster, where the NetApp Trident CSI driver is configured to manage NetApp Volumes. This driver enables GKE to use Google Cloud NetApp Volumes as the storage backend, making it easy to integrate the dataset into the Kubernetes environment.

By using storage classes in Kubernetes, the performance profile for the volume that will host the dataset can be easily mapped. Another advantage of running ML workflows in Google Cloud Platform with GKE is the ability to scale up the infrastructure only when you need to. For general workloads, the standard compute instances are an excellent fit, but for model training, GPU-powered nodes can be used when the workload demands it.

Step 3: Data presentation to Kubernetes

With GKE set up and the dataset stored in Google Cloud NetApp Volumes, the data can now be presented to the model.

By using the Trident volume import feature, the volume in NetApp Volumes that contains the dataset can be presented to the GKE cluster through a PersistentVolumeClaim (PVC) reference:

tridentctl import volume <<backend_name>> <<volume_name>> -f pvc.yaml

After import, by using the PVC, the data can be presented to the ML framework through a deployment spec.

Step 4: Running the ML framework with JupyterLab

After the data has been presented to Kubernetes, the ML framework can be deployed on GKE by using the container image that you choose. The use of images with a Jupyter implementation provides an interactive notebook interface for running and testing ML models.

Step 5: Scaling with GPU-powered compute

Training ML models, especially with large datasets, can require significant computational power. This need is where the integration with NVIDIA GPUs comes into play. A GPU-powered worker node can be added to the GKE cluster through a node pool, and GKE can also automatically install the requisite drivers for the GPUs and deliver them ready for immediate use.

Step 6: Model training and fine-tuning

This step involves loading the dataset into the ML framework and performing the training.

ML is an iterative process, and the first round of training may not always yield the best results. As the model evolves, its state (weights, biases, and other parameters) can be saved to another volume in NetApp Volumes that is designated for storing model artifacts. This approach helps in maintaining separate and efficient storage layers that are based on your storage needs—training or artifacts.

Step 7: Versioning and cost-optimization

The state of both the dataset and the model after training can be captured by using NetApp Snapshot™ technology. This step creates point-in-time Snapshot copies of the volumes that host the dataset and the model, both of which are vital for versioning and reproducibility in your ML projects. These Snapshots help you in building a cost optimized data lineage for your AI projects.

Now see the steps in action

Those steps are one complete round of model training. At this point, the GPU-powered worker node can be released because it is no longer needed. The focus must switch back to model optimization and dataset hydration for the next round of training.

For a full-on view of all these activities with an example use case for model training, watch this demonstration.

Optimize your ML workflows today

Through Google Kubernetes Engine and Google Cloud NetApp Volumes, you gain significant benefits for running your ML workflows. With on-demand access to high-performance compute resources like GPUs combined with scalable, efficient, and high-performance storage, you can be confident that your ML workflows are fast, reusable, and highly cost-effective.

Get started with Google Cloud NetApp Volumes for MLOps.

Happy training!