Tech ONTAP Blogs

Modernizing SQL Server Failover Cluster Instances (FCI) with Google Cloud NetApp Volumes

sajith
NetApp
1,702 Views

Are you moving your self-managed SQL Server instances to Google Cloud, but are concerned about maintaining high availability, controlling licensing costs, and ensuring consistent backup and recovery practice? Google Cloud NetApp Volumes (GCNV) now supports iSCSI shared storage with the Flex Unified service level, so now you can deploy the same reliable cost savings model in the cloud that you trust OnPrem.

 

Google Cloud NetApp Volumes (GCNV) Flex Unified, with iSCSI block storage, allows you to maintain the same OnPrem FCI model by delivering cloud‑native shared iSCSI storage. GCNV provides enterprise‑grade performance with  ONTAP data‑management capabilities, including snapshots, thin clones, backups, and cross‑region replication all in a fully managed service!

 

This blog walks through the SQL Server HA architectures, how FCI is deployed using GCNV Regional iSCSI volumes, how application‑consistent snapshots and clones work, and how to build cross‑region DR for SQL Server.

 

SQL Server High Availability Architectures: FCI vs Always On Availability Groups

Most organizations use one of two architectures to deploy a MSSQL database solution, Failover Cluster Instance (FCI) architecture or Always On Availability Groups (AOAG) architecture. Both of these provide a highly available solution, but FCI is generally considered more efficient. The architectures are described below. 

 

The Failover Cluster Instance Architecture (FCI)

A FCI architecture is a single SQL Server instance installed across multiple Windows Server Failover Cluster (WSFC) nodes. It uses shared storage, meaning only the active node accesses the database files at any time. If the active node fails, WSFC automatically fails the instance over to another node with zero data loss because both nodes share the same underlying LUNs. FCI provides instance‑level protection, covering logins, jobs, SQL Agent metadata, and system databases.

sajith_0-1771580487989.png

 

 

The Always On Availability Groups (AOAG) Architecture

The AOAG architecture consists of multiple SQL server instances that need to replicate their data to separate LUNs to maintain high availability. One instance maintains read/write and the other instances are read-only. When the primary instance has a change to its data, MSSQL must replicate the change to all the read-only instances. If the primary read/write instance fails, one of the read-only instances becomes the primary. 

sajith_1-1771580487996.png

 

 

Why organizations  choose FCI on GCNV

While maintaining high availability, the AOAG architecture requires several copies of the same volume, thereby increasing storage costs. MSSQL replication must also be configured to all the other read-only instances, which requires expensive MSSQL Enterprise licensing. Additionally, to achieve required performance to the database, very often larger compute instances than required by MSSQL need to be deployed to avoid a hypervisor bottleneck. The larger compute instances then cause the MSSQL Enterprise license costs to increase even further.  

 

However, enterprises  adopting FCI with iSCSI shared storage on Google Cloud NetApp Volumes obtain high availability all while simplifying the architecture and drastically saving on costs. Since the databases do not need to be replicated among several MSSQL instances, enterprises with the FCI architecture can save on storage costs, and can deploy the less expensive SQL Server Standard licenses. They can also right size their VMs to precisely only what MSSQL needs - knowing that network performance is not affected by VM size. This results in a large cost savings as seen at Cut SQL Server Costs by up to 50% with Google Cloud NetApp Volumes  . As a bonus, we can let the storage layer handle backups and disaster recovery lowering CPU usage on the VMs. 

 

Deploying SQL Server FCI with GCNV Regional iSCSI Volumes

 

The following steps are required to deploy the FCI Architecture in Google Cloud with Google Cloud NetApp Volumes

  1. Create a GCNV Flex Unified Regional storage pool.
  2. Provision multiple LUNs for Data, Log, TempDB, and Quorum.
  3. Attach the LUNs to both SQL nodes using Windows iSCSI initiator.
  4. Create a Windows Failover Cluster with the required Windows Servers.
  5. Add the disks to WSFC.
  6. Install SQL Server FCI and point storage paths to the shared GCNV LUNs.

These steps mirror traditional on‑prem SAN design, minimizing change and easing migrations of existing SQL deployments.

 

Protect SQL Server FCI from Zonal Failures on Google Cloud with Regional Pools

 

One of the most powerful high‑availability features for SQL Server FCI deployments on Google Cloud NetApp Volumes is the use of Regional storage pools. These types of pools replicate data synchronously across two independent zones within a region, ensuring that shared block storage remains available even if an entire zone goes down. They deliver  99.99% availability for shared LUNs. This matches the expectations for mission‑critical applications requiring High Availability. Regional pools can be provisioned with independent capacity, throughput (up to 5,120 MiB/s), and IOPS (up to 160k IOPS). Volumes used for databases are deployed within the pools.

Since Regional GCNV pools protect the database against a zonal failure, the WSFC nodes are intentionally deployed so that the primary WSFC node resides in the same zone as the primary GCNV pool, while the standby WSFC node is placed in the replica zone. This ensures that SQL Server FCI has consistent shared block storage accessible from both nodes across the two zones.

This design directly complements Windows Server Failover Clustering, enabling fully automated failover with no manual intervention, even during zonal outages. Let’s go through an example.

 

Deploying MSSQL across zones 

Let’s take a concrete example:

  • Regional storage pool:
    Replicated across us‑east1‑b and us‑east1‑c
  • WSFC cluster configuration:
    • SQL Node 1 in us‑east1‑b
    • SQL Node 2 in us‑east1‑c

GCNV synchronously mirrors storage blocks between these two zones, so both nodes see the same shared iSCSI LUNs at all times.

 

sajith_2-1771580487998.png

 

 

Normal Operation (Both Zones Healthy)

When both zones are available:

  1. The WSFC SQL node in us‑east1‑b typically acts as the active node, hosting the SQL Server FCI instance.
  2. All read/write I/O operations are served from the Regional storage pool, with synchronous writes mirrored to the secondary zone.
  3. Latency remains low because both zones are in the same region and are engineered for synchronous durability.

This matches the expected behaviour of on‑prem SANs providing dual‑controller synchronous mirroring — except now it’s cloud‑native.

 

Zonal Failure Scenario and Automatic Failover

If us‑east1‑b experiences a zonal outage due to infrastructure, networking, or power disruptions, the following will occur on the storage layer:

  • The Regional pool continues operating from the surviving zone (us‑east1‑c).
  • Because the storage is synchronously replicated, there is no data loss.
  • LUNs remain fully available to SQL Node 2.

Google Cloud NetApp Volumes Regional pools are specifically built to ensure availability during zonal failure events. But what happens to the compute (Windows Server Failover Cluster) layer?

  • The WSFC cluster detects that Node 1 in us‑east1‑b is unreachable.
  • WSFC automatically triggers a failover to SQL Node 2 in us‑east1‑c, bringing the SQL Server FCI instance online.
  • No storage remapping or manual recovery is required — the LUNs are already visible and consistent.

Access to the database, both compute and storage layers, is maintained throughout the zonal failover process. This is exactly how customers expect SQL FCI to behave on‑prem — and it now works the same way in Google Cloud.

 

Application‑Consistent Snapshots on Google Cloud NetApp Volumes

One of the biggest advantages of running SQL Server on Google Cloud NetApp Volumes is the ability to take instantaneous storage level snapshots for backup, rapid restore, cloning, and Dev/Test. 

By default, GCNV creates crash‑consistent snapshots—which are suitable for many workloads but may contain unflushed data because client caching is not synchronized. For workloads like SQL Server that maintain their own sophisticated buffer pool and transaction logging, crash‑consistency is safe, but recovery involves going through crash‑recovery roll‑forward/roll‑back logic upon restore.

However, for mission‑critical SQL Server databases and FCI deployments, organizations  often require application consistent snapshots, ensuring a clean recovery point that aligns with how SQL Server manages in‑flight I/O and page flushing.

GCNV fully supports application consistent snapshots through a simple workflow that combines SQL Server’s native T‑SQL quiesce mechanisms with ONTAP’s instantaneous snapshot engine. This workflow provides a true application‑consistent recovery point without the overhead of a streaming full backup. Unlike a backup, an app‑consistent snapshot requires no streaming I/O, has negligible performance impact and uses almost no additional storage capacity initially (copy-on-write).

 

Achieving True Application‑Consistent Snapshots for SQL Server

To reach full application consistency, use the following steps: the workflow is:

  1. Quiesce SQL Server using T‑SQL
    SQL Server provides mechanisms (e.g., ALTER DATABASE <db_name> SET SUSPEND_FOR_SNAPSHOT_BACKUP = ON ) in SQL Server 2022 to flush dirty pages and temporarily freeze writes.
    This ensures all unwritten data is committed to disk.
  2. Take a GCNV Snapshot
    Once the database is quiesced, you trigger a NetApp Volumes snapshot at the storage layer.
    • ONTAP‑based snapshots complete within seconds, regardless of the database size.
    • They are instant copy‑on‑write metadata operations, so the SQL freeze window is extremely short.
      GCNV snapshots are “instant captures of data within a volume” with no impact on performance even during snapshot creation.
  3. Unquiesce SQL Server
    Immediately resume database operations (e.g., ALTER DATABASE <db_name> SET SUSPEND_FOR_SNAPSHOT_BACKUP = OFF ).
    Because the freeze period is very small, application impact is minimal.

This workflow provides a true application‑consistent recovery point without the overhead of a streaming full backup.

 

Restoring from  App‑Consistent Snapshots 

Restoring from an application‑consistent snapshot on GCNV is nearly instantaneous and

  • Snapshots allow volume reversion back to that exact consistent state within minutes.
  • Snapshots also allow creating a clone and attaching it to a SQL host to extract database files, enabling granular restore.

Because the snapshot was taken after SQL had flushed all buffers, recovery time inside SQL Server is minimized—no crash‑recovery roll‑back/roll‑forward operations.

This enables:

  • Clean “point‑in‑time” restores for production
  • Rapid rollbacks during patching or deployment
  • Consistent DR seeds before replication to another region

 

Thin Clones from Application‑Consistent Snapshots

An additional benefit is that any application‑consistent snapshot can be used to create thin clones, giving immediate, writable copies of the database for:

  • Dev/Test
  • Analytics
  • QA/UAT environments
  • Reporting

You can now spin up a full environment with production‑grade data—without duplicating terabytes of storage.

 

Cross‑Region DR for SQL Server: Replicating to a Standalone SQL VM

GCNV supports SnapMirror‑based cross‑region replication, allowing SQL Server databases stored on iSCSI LUNs to be replicated asynchronously to another Google Cloud region.
Replication schedules include every 10 minutes for low‑RPO DR.

How DR Works

  1. Primary region hosts SQL FCI on Regional GCNV iSCSI volumes.
  2. SnapMirror replicates the storage volumes to a secondary region. (Destination volumes are read‑only during replication.)
  3. A standalone SQL Server instance in the DR region mounts the replicated LUNs post‑failover.
  4. For DR testing, the read‑only DR copy can be used safely without affecting the source.

sajith_3-1771580488004.png

 

This architecture provides:

  • Low RPO (10 minutes)
  • Fast RTO with pre‑provisioned SQL instance
  • No need for AGs or extra SQL licensing
  • Multi‑region failover capability

 

Conclusion

Google Cloud NetApp Volumes Flex Unified, with iSCSI block storage, brings enterprise‑grade SAN capabilities to Google Cloud—critical for organizations running SQL Server workloads that demand predictable performance and consistent shared storage. Running SQL Server FCI on Google Cloud NetApp Volumes iSCSI gives organizations the best of both worlds: the robustness and familiarity of an on‑prem SAN architecture combined with the agility and cost efficiency of the cloud.

 

Ready to get started? Head to the Google Cloud NetApp Volumes console and experience the power of NetApp Volumes block storage today! Contact a specialist for more information.

Public