Tech ONTAP Blogs

NetApp Volumes and Google Cluster Toolkit: Two Ways to Integrate Shared Storage

okrause
NetApp
34 Views

Google Cluster Toolkit (open source, gcluster) helps you deploy repeatable AI/ML and HPC environments on Google Cloud—compute, networking, storage, and schedulers—using composable modules and blueprints. Google Cloud NetApp Volumes is a fully managed file and block storage service with multiprotocol access (NFS, SMB, iSCSI), snapshots, replication, FlexCache, and tiering, so you can run enterprise and technical workloads without redesigning how applications see storage.

 

Cluster Toolkit integrates NetApp Volumes through dedicated filesystem modules and through a generic “mount what already exists” module. This post introduces both patterns, when each fits, and the trade-offs that matter for production clusters—especially lifecycle safety, feature coverage, and performance on large capacity volumes (multiple access IPs).

 

What Cluster Toolkit provides for NetApp Volumes

The toolkit splits NetApp Volumes into two Terraform-backed modules:

 

  • netapp-storage-pool — Creates a storage pool (region, VPC attachment via Private Service Access, service level, capacity, optional CMEK, Active Directory, LDAP, auto-tiering flags, and so on). Pools are the capacity container; volumes live inside pools.
  • netapp-volume — Creates one or more NFS volumes inside a pool referenced by the blueprint, with export policy, protocols, tiering, and client mount wiring for modules that use the volume.

 

NetApp Volumes requires Private Service Access (PSA) between your VPC and the NetApp Volumes service network. Blueprints typically pair the pool module with the toolkit’s private service access module; see the examples below.

 

For a broader comparison of network storage options in the toolkit, see the project’s network storage documentation in the repository.

 

Example blueprints to start from

  • examples/netapp-volumes.yaml — End-to-end sample: VPC, PSA to netapp.servicenetworking.goog, a storage pool, a volume, and VMs that mount the share. Good minimal reference for “NetApp Volumes + Cluster Toolkit” in one deployment.
  • community/examples/eda — Electronic Design Automation reference architectures: shared NetApp Volumes NFS, Slurm, and optional hybrid patterns with FlexCache and pre-existing volumes. The README explains deployment groups (base, optional software_installation, cluster) so you can tear down compute without destroying shared storage—an important operational pattern.

 

 

Approach 1: Mount existing NetApp Volumes with pre-existing-network-storage

The pre-existing-network-storage module does not create a NetApp pool or volume. It describes storage that already exists and produces the client install and mount runners other modules expect, so compute nodes (or Slurm images) mount an NFS export the same way they would for Filestore or another NFS server.

 

You (or another team) provision pools and volumes in the Google Cloud console, gcloud, Terraform outside the toolkit, or your standard automation. The blueprint only needs the server identity (IP or DNS name), export path, local mount path, and NFS options.

 

Why choose this approach

  • Independent lifecycle — Storage is owned by its own process (landing zone, storage team, or separate Terraform root module). Cluster deployments focus on compute and schedulers.
  • Full product surface — You are not limited to what the netapp-storage-pool / netapp-volume modules expose today. For example, the netapp-storage-pool module documents support for Standard, Premium, and Extreme service levels; Flex tiers and other options may require provisioning outside the module and attaching here. Similarly, FlexCache, ONTAP-mode workflows, or SMB-heavy designs may be easier to manage outside toolkit-native NetApp modules (the netapp-volume module is oriented toward Linux NFS clients).
  • Safer teardown — Destroying a Cluster Toolkit deployment does not delete volumes that were never created by that deployment’s Terraform state. Valuable datasets are far less likely to disappear because someone ran gcluster destroy on a cluster blueprint.
  • Large volumes and multiple IPs — Large capacity volumes can expose six IP addresses for the same data. For very large client counts, you should spread mounts across those IPs to avoid hot-spotting a single endpoint. The EDA community documentation describes using a Cloud DNS record with all six IPs and mounting by FQDN so clients resolve to different addresses (round-robin). You can point pre-existing-network-storage at that DNS name instead of a single IP. (The netapp-volume module notes that pre-existing-network-storage does not accept a list of IPs; DNS is the practical way to represent multi-IP exports in one field.)

 

Trade-offs

  • Two planes of management — Storage and compute are coordinated by convention (naming, networking, exports), not a single Terraform graph. Teams that want one blueprint to own everything may find this split less convenient—though many enterprises prefer it for blast-radius and ownership reasons.

 

 

Approach 2: Provision pools and volumes with netapp-storage-pool + netapp-volume

Here, Terraform in the Cluster Toolkit deployment creates the pool and volumes and wires exports to clients through module use relationships—similar to the netapp-volumes.yaml example.

 

Why choose this approach

  • Single blueprint — One gcluster create / deploy flow can stand up VPC, PSA, pool, volumes, and VMs or Slurm partitions that consume those volumes, which simplifies onboarding and CI-driven environments.
  • Repeatable dev/test — Ephemeral projects benefit from defining storage and compute together for reproducible performance tests.

 

Trade-offs

  • Destroy = delete data risk — The netapp-volume module documentation states clearly that it does not implement deletion protection: running destroy on the deployment that created the volume deletes the volume and its data. Operational discipline (separate deployment groups, destroy only the cluster group, backups, snapshots) is essential. For long-lived data stores, many teams still prefer Approach 1.
  • Large volumes and Slurm — For large capacity volumes, the module exposes multiple server_ips, but when clients attach via the toolkit’s use directive, only the first IP is used today; spreading across all six IPs is listed as a future improvement. The EDA reference notes the same limitation for Slurm-managed VMs: Slurm currently uses only the first IP of a large volume. For maximum throughput across all IPs on large volumes with Slurm, you may need the DNS-based pattern from Approach 1 or other client-side distribution until toolkit support evolves.

 

 

Choosing a path

Concern

Prefer pre-existing-network-storage

Prefer netapp-storage-pool + netapp-volume

Protect data from accidental blueprint teardown

Strong fit

Requires strict process or separate deployment groups

Need Flex / features not in pool module

Strong fit

May need external provisioning + pre-existing mount

Multi-IP large volume fan-out (many clients)

DNS + FQDN pattern

Limited with use / Slurm until improved

One-shot lab or integrated demo

Optional

Strong fit

Slurm + large volume performance scaling

DNS approach often better

Check current toolkit release notes for IP usage

 

 

Summary

Cluster Toolkit gives you two sound ways to use NetApp Volumes: reference existing exports with pre-existing-network-storage for lifecycle isolation, full feature flexibility, and safer teardown; or provision pools and volumes in the blueprint with netapp-storage-pool and netapp-volume when you want a single automated stack and can manage destroy risk deliberately.

 

Start from examples/netapp-volumes.yaml for integrated provisioning, or from community/examples/eda for Slurm and hybrid storage patterns. For product background, see the NetApp Volumes overview and the Cluster Toolkit repository: https://github.com/GoogleCloudPlatform/cluster-toolkit.

 

Public