NetApp Volumes and Google Cluster Toolkit: Two Ways to Integrate Shared Storage

okrause

Google Cluster Toolkit (open source, gcluster) helps you deploy repeatable AI/ML and HPC environments on Google Cloud—compute, networking, storage, and schedulers—using composable modules and blueprints. Google Cloud NetApp Volumes is a fully managed file and block storage service with multiprotocol access (NFS, SMB, iSCSI), snapshots, replication, FlexCache, and tiering, so you can run enterprise and technical workloads without redesigning how applications see storage.

Cluster Toolkit integrates NetApp Volumes through dedicated filesystem modules and through a generic “mount what already exists” module. This post introduces both patterns, when each fits, and the trade-offs that matter for production clusters—especially lifecycle safety, feature coverage, and performance on large capacity volumes (multiple access IPs).

What Cluster Toolkit provides for NetApp Volumes

The toolkit splits NetApp Volumes into two Terraform-backed modules:

netapp-storage-pool — Creates a storage pool (region, VPC attachment via Private Service Access, service level, capacity, optional CMEK, Active Directory, LDAP, auto-tiering flags, and so on). Pools are the capacity container; volumes live inside pools.
netapp-volume — Creates one or more NFS volumes inside a pool referenced by the blueprint, with export policy, protocols, tiering, and client mount wiring for modules that use the volume.

NetApp Volumes requires Private Service Access (PSA) between your VPC and the NetApp Volumes service network. Blueprints typically pair the pool module with the toolkit’s private service access module; see the examples below.

For a broader comparison of network storage options in the toolkit, see the project’s network storage documentation in the repository.

Example blueprints to start from

examples/netapp-volumes.yaml — End-to-end sample: VPC, PSA to netapp.servicenetworking.goog, a storage pool, a volume, and VMs that mount the share. Good minimal reference for “NetApp Volumes + Cluster Toolkit” in one deployment.
community/examples/eda — Electronic Design Automation reference architectures: shared NetApp Volumes NFS, Slurm, and optional hybrid patterns with FlexCache and pre-existing volumes. The README explains deployment groups (base, optional software_installation, cluster) so you can tear down compute without destroying shared storage—an important operational pattern.

Approach 1: Mount existing NetApp Volumes with pre-existing-network-storage

The pre-existing-network-storage module does not create a NetApp pool or volume. It describes storage that already exists and produces the client install and mount runners other modules expect, so compute nodes (or Slurm images) mount an NFS export the same way they would for Filestore or another NFS server.

You (or another team) provision pools and volumes in the Google Cloud console, gcloud, Terraform outside the toolkit, or your standard automation. The blueprint only needs the server identity (IP or DNS name), export path, local mount path, and NFS options.

Why choose this approach

Independent lifecycle — Storage is owned by its own process (landing zone, storage team, or separate Terraform root module). Cluster deployments focus on compute and schedulers.
Full product surface — You are not limited to what the netapp-storage-pool / netapp-volume modules expose today. For example, the netapp-storage-pool module documents support for Standard, Premium, and Extreme service levels; Flex tiers and other options may require provisioning outside the module and attaching here. Similarly, FlexCache, ONTAP-mode workflows, or SMB-heavy designs may be easier to manage outside toolkit-native NetApp modules (the netapp-volume module is oriented toward Linux NFS clients).
Safer teardown — Destroying a Cluster Toolkit deployment does not delete volumes that were never created by that deployment’s Terraform state. Valuable datasets are far less likely to disappear because someone ran gcluster destroy on a cluster blueprint.
Large volumes and multiple IPs — Large capacity volumes can expose six IP addresses for the same data. For very large client counts, you should spread mounts across those IPs to avoid hot-spotting a single endpoint. The EDA community documentation describes using a Cloud DNS record with all six IPs and mounting by FQDN so clients resolve to different addresses (round-robin). You can point pre-existing-network-storage at that DNS name instead of a single IP. (The netapp-volume module notes that pre-existing-network-storage does not accept a list of IPs; DNS is the practical way to represent multi-IP exports in one field.)

Trade-offs

Two planes of management — Storage and compute are coordinated by convention (naming, networking, exports), not a single Terraform graph. Teams that want one blueprint to own everything may find this split less convenient—though many enterprises prefer it for blast-radius and ownership reasons.

Approach 2: Provision pools and volumes with netapp-storage-pool + netapp-volume

Here, Terraform in the Cluster Toolkit deployment creates the pool and volumes and wires exports to clients through module use relationships—similar to the netapp-volumes.yaml example.

Why choose this approach

Single blueprint — One gcluster create / deploy flow can stand up VPC, PSA, pool, volumes, and VMs or Slurm partitions that consume those volumes, which simplifies onboarding and CI-driven environments.
Repeatable dev/test — Ephemeral projects benefit from defining storage and compute together for reproducible performance tests.

Trade-offs

Destroy = delete data risk — The netapp-volume module documentation states clearly that it does not implement deletion protection: running destroy on the deployment that created the volume deletes the volume and its data. Operational discipline (separate deployment groups, destroy only the cluster group, backups, snapshots) is essential. For long-lived data stores, many teams still prefer Approach 1.
Large volumes and Slurm — For large capacity volumes, the module exposes multiple server_ips, but when clients attach via the toolkit’s use directive, only the first IP is used today; spreading across all six IPs is listed as a future improvement. The EDA reference notes the same limitation for Slurm-managed VMs: Slurm currently uses only the first IP of a large volume. For maximum throughput across all IPs on large volumes with Slurm, you may need the DNS-based pattern from Approach 1 or other client-side distribution until toolkit support evolves.

Choosing a path

Concern	Prefer pre-existing-network-storage	Prefer netapp-storage-pool + netapp-volume
Protect data from accidental blueprint teardown	Strong fit	Requires strict process or separate deployment groups
Need Flex / features not in pool module	Strong fit	May need external provisioning + pre-existing mount
Multi-IP large volume fan-out (many clients)	DNS + FQDN pattern	Limited with use / Slurm until improved
One-shot lab or integrated demo	Optional	Strong fit
Slurm + large volume performance scaling	DNS approach often better	Check current toolkit release notes for IP usage

Summary

Cluster Toolkit gives you two sound ways to use NetApp Volumes: reference existing exports with pre-existing-network-storage for lifecycle isolation, full feature flexibility, and safer teardown; or provision pools and volumes in the blueprint with netapp-storage-pool and netapp-volume when you want a single automated stack and can manage destroy risk deliberately.

Start from examples/netapp-volumes.yaml for integrated provisioning, or from community/examples/eda for Slurm and hybrid storage patterns. For product background, see the NetApp Volumes overview and the Cluster Toolkit repository: https://github.com/GoogleCloudPlatform/cluster-toolkit.