Google Cluster Toolkit (open source, gcluster) helps you deploy repeatable AI/ML and HPC environments on Google Cloud—compute, networking, storage, and schedulers—using composable modules and blueprints. Google Cloud NetApp Volumes is a fully managed file and block storage service with multiprotocol access (NFS, SMB, iSCSI), snapshots, replication, FlexCache, and tiering, so you can run enterprise and technical workloads without redesigning how applications see storage.
Cluster Toolkit integrates NetApp Volumes through dedicated filesystem modules and through a generic “mount what already exists” module. This post introduces both patterns, when each fits, and the trade-offs that matter for production clusters—especially lifecycle safety, feature coverage, and performance on large capacity volumes (multiple access IPs).
What Cluster Toolkit provides for NetApp Volumes
The toolkit splits NetApp Volumes into two Terraform-backed modules:
- netapp-storage-pool — Creates a storage pool (region, VPC attachment via Private Service Access, service level, capacity, optional CMEK, Active Directory, LDAP, auto-tiering flags, and so on). Pools are the capacity container; volumes live inside pools.
- netapp-volume — Creates one or more NFS volumes inside a pool referenced by the blueprint, with export policy, protocols, tiering, and client mount wiring for modules that use the volume.
NetApp Volumes requires Private Service Access (PSA) between your VPC and the NetApp Volumes service network. Blueprints typically pair the pool module with the toolkit’s private service access module; see the examples below.
For a broader comparison of network storage options in the toolkit, see the project’s network storage documentation in the repository.
Example blueprints to start from
- examples/netapp-volumes.yaml — End-to-end sample: VPC, PSA to netapp.servicenetworking.goog, a storage pool, a volume, and VMs that mount the share. Good minimal reference for “NetApp Volumes + Cluster Toolkit” in one deployment.
- community/examples/eda — Electronic Design Automation reference architectures: shared NetApp Volumes NFS, Slurm, and optional hybrid patterns with FlexCache and pre-existing volumes. The README explains deployment groups (base, optional software_installation, cluster) so you can tear down compute without destroying shared storage—an important operational pattern.
Approach 1: Mount existing NetApp Volumes with pre-existing-network-storage
The pre-existing-network-storage module does not create a NetApp pool or volume. It describes storage that already exists and produces the client install and mount runners other modules expect, so compute nodes (or Slurm images) mount an NFS export the same way they would for Filestore or another NFS server.
You (or another team) provision pools and volumes in the Google Cloud console, gcloud, Terraform outside the toolkit, or your standard automation. The blueprint only needs the server identity (IP or DNS name), export path, local mount path, and NFS options.
Why choose this approach
- Independent lifecycle — Storage is owned by its own process (landing zone, storage team, or separate Terraform root module). Cluster deployments focus on compute and schedulers.
- Full product surface — You are not limited to what the netapp-storage-pool / netapp-volume modules expose today. For example, the netapp-storage-pool module documents support for Standard, Premium, and Extreme service levels; Flex tiers and other options may require provisioning outside the module and attaching here. Similarly, FlexCache, ONTAP-mode workflows, or SMB-heavy designs may be easier to manage outside toolkit-native NetApp modules (the netapp-volume module is oriented toward Linux NFS clients).
- Safer teardown — Destroying a Cluster Toolkit deployment does not delete volumes that were never created by that deployment’s Terraform state. Valuable datasets are far less likely to disappear because someone ran gcluster destroy on a cluster blueprint.
- Large volumes and multiple IPs — Large capacity volumes can expose six IP addresses for the same data. For very large client counts, you should spread mounts across those IPs to avoid hot-spotting a single endpoint. The EDA community documentation describes using a Cloud DNS record with all six IPs and mounting by FQDN so clients resolve to different addresses (round-robin). You can point pre-existing-network-storage at that DNS name instead of a single IP. (The netapp-volume module notes that pre-existing-network-storage does not accept a list of IPs; DNS is the practical way to represent multi-IP exports in one field.)
Trade-offs
- Two planes of management — Storage and compute are coordinated by convention (naming, networking, exports), not a single Terraform graph. Teams that want one blueprint to own everything may find this split less convenient—though many enterprises prefer it for blast-radius and ownership reasons.
Approach 2: Provision pools and volumes with netapp-storage-pool + netapp-volume
Here, Terraform in the Cluster Toolkit deployment creates the pool and volumes and wires exports to clients through module use relationships—similar to the netapp-volumes.yaml example.
Why choose this approach
- Single blueprint — One gcluster create / deploy flow can stand up VPC, PSA, pool, volumes, and VMs or Slurm partitions that consume those volumes, which simplifies onboarding and CI-driven environments.
- Repeatable dev/test — Ephemeral projects benefit from defining storage and compute together for reproducible performance tests.
Trade-offs
- Destroy = delete data risk — The netapp-volume module documentation states clearly that it does not implement deletion protection: running destroy on the deployment that created the volume deletes the volume and its data. Operational discipline (separate deployment groups, destroy only the cluster group, backups, snapshots) is essential. For long-lived data stores, many teams still prefer Approach 1.
- Large volumes and Slurm — For large capacity volumes, the module exposes multiple server_ips, but when clients attach via the toolkit’s use directive, only the first IP is used today; spreading across all six IPs is listed as a future improvement. The EDA reference notes the same limitation for Slurm-managed VMs: Slurm currently uses only the first IP of a large volume. For maximum throughput across all IPs on large volumes with Slurm, you may need the DNS-based pattern from Approach 1 or other client-side distribution until toolkit support evolves.
Choosing a path
|
Concern
|
Prefer pre-existing-network-storage
|
Prefer netapp-storage-pool + netapp-volume
|
|
Protect data from accidental blueprint teardown
|
Strong fit
|
Requires strict process or separate deployment groups
|
|
Need Flex / features not in pool module
|
Strong fit
|
May need external provisioning + pre-existing mount
|
|
Multi-IP large volume fan-out (many clients)
|
DNS + FQDN pattern
|
Limited with use / Slurm until improved
|
|
One-shot lab or integrated demo
|
Optional
|
Strong fit
|
|
Slurm + large volume performance scaling
|
DNS approach often better
|
Check current toolkit release notes for IP usage
|
Summary
Cluster Toolkit gives you two sound ways to use NetApp Volumes: reference existing exports with pre-existing-network-storage for lifecycle isolation, full feature flexibility, and safer teardown; or provision pools and volumes in the blueprint with netapp-storage-pool and netapp-volume when you want a single automated stack and can manage destroy risk deliberately.
Start from examples/netapp-volumes.yaml for integrated provisioning, or from community/examples/eda for Slurm and hybrid storage patterns. For product background, see the NetApp Volumes overview and the Cluster Toolkit repository: https://github.com/GoogleCloudPlatform/cluster-toolkit.