Google Cloud NetApp Volumes: Volume replication and Terraform

okrause · ‎2024-07-30

Volume replication is an easy-to-use, cross-region replication feature of Google Cloud NetApp Volumes. Although it leverages powerful NetApp^® SnapMirror^® technology, its operational model has subtle differences that make it more user friendly and less prone to administrative errors.

This article dives into the differences and discusses the implications for Terraform-based management.

SnapMirror in ONTAP

If you have used SnapMirror in NetApp ONTAP^® before, you know that it is a powerful, robust, and efficient replication technology. It’s used to solve all kinds of data replication problems, like building disaster recovery concepts, distributing data globally, or migrating data from one ONTAP system to another without having to worry about files, permissions, file locks, and more. Everyone who knows it, loves it.

But one aspect can be a bit annoying. SnapMirror takes care of everything within the volume, but it doesn’t manage the settings of the volume itself. Simple tasks like resizing the source volume or changing volume settings require an administrator to manually make the same changes on the destination volume. If the changes are not made thoroughly, the settings of source and destination volume diverge and can cause problems in operation or in the moment when you switch your workload over to the destination after your source was taken out by a disaster. Really, that’s the worst time to discover a configuration drift.

Volume replication on NetApp Volumes

When building NetApp Volumes, we wondered how we could simplify an operator's life and reduce configuration drift. We came up with an approach that replicates the data of a volume, and also “replicates” the settings of a source volume to the destination. Here’s how it works.

Volumes that are in a volume replication are in a relationship. The relationship can be in one of two modes:

While the mirror state is MIRRORED or TRANSFERRING, the relationship is active. Updates from the source volume are shipped to the destination on the defined replication schedule. While updates are shipped, the mirror state is TRANSFERRING. When a transfer is finished and the replication waits for the next scheduled transfer, the mirror state is MIRRORED.
The content of the source volume is accessible read-write; the content of the destination volume can only be accessed read-only and is an asynchronous copy of the source. Additionally, volume settings are kept in sync. Any setting change done to the source or destination volume is also done to the other volume. This synchronization eliminates configuration drift.
When the mirror state is STOPPED, the relationship is inactive. Both volumes are read-writable and volume content can be changed independently. The settings of both volumes can also be changed independently. Doing a RESUME or REVERSE AND RESUME action on source or destination makes the relationship active again.
The destination volume (note that REVERSE AND RESUME swaps source and destination roles) becomes a mirror of the source again. This means that the source volume overwrites the content and settings of the destination volume with the content and settings of the source volume.

This simple but powerful approach eliminates configuration drift. We went even further: In ONTAP, you must create a destination volume manually before setting up a replication. In NetApp Volumes, we wrapped the creation of the destination volume into the replication setup process. All settings for the destination volume are inherited from the source. Just specify a destination storage pool, replication details, destination share and volume name, and NetApp Volumes takes care of all the other volume settings for you. This approach simplifies creating a replication considerably.

Volume replication and Terraform

NetApp Volumes simplifies volume replication lifecycle management, but it is still a powerful and complex feature. When building the netapp_volume_replication resource for the google Terraform provider, we had to add some additional controls. In addition to the obvious input parameters like name, volume_name, location, replication_schedule, description, and labels, the resource includes a few other input parameters that are worth discussing,

replication_enabled

This parameter controls the mode of the relationship.

If it is set to true, the desired state of the relationship is active. If the relationship is inactive, a RESUME operation is triggered. Note that a RESUME operation overwrites all changes made to the destination volume with source volume information. Be sure that this is your intention before enabling the replication.

If it is set to false, the desired state of the relationship is inactive. If the relationship is active, a STOP operation is triggered.

wait_for_mirror

When set to true, the provider waits for ongoing transfers to finish before stopping a replication. This is desirable, but it can take a long time for large transfers.

When set to false, the provider does not wait for transfers to finish.

force_stopping

An active relationship can have one of two mirror_states. A mirror is either TRANSFERRING an update or it is waiting for the next scheduled transfer (mirror_state==MIRRORED) to start.

Ongoing transfers cannot be stopped except by using a force stop.

Set this parameter to true if you can’t wait for a long-running replication transfer to finish. The default is false.

delete_destination_volume

Setting this parameter to true deletes the destination volume automatically if a replication relationship is deleted/destroyed. Stopping or resuming a mirror doesn’t delete the relationship. Take care: It’s great for testing but using it in production might lead to unintended loss of the destination volume.

destination_volume_parameters

This parameter block is used to specify the destination storage_pool, the name of the destination volume (volume_id), the share_name, and an optional description. This block is used only while creating the resource. It is ignored for all other operations. This fact has multiple implications:

Don’t try to use it to update any of these parameters. Attempted updating either doesn’t yield the desired results (for example, changing description), or it triggers a re-creation of the replication resource, which triggers a re-creation of a destination volume and a new baseline transfer.
Because the API won’t return the content of this block, importing an existing replication resource with Terraform does not contain this block. Terraform manages the replication happily without this block.
The destination volume is created as part of the replication creation workflow. The destination volume is not managed by Terraform.

Best practices

Recommended settings for replication parameters

For normal operation without time pressure, NetApp recommends letting ongoing transfers finish before stopping a replication. This is done by setting the parameters to:

force_stopping            = false
wait_for_mirror           = true
delete_destination_volume = false

With this setting, the provider waits for an ongoing transfer to finish before stopping the replication when doing replication_enabled = false.

When your priority is to get the destination volume as fast as possible to production, change the parameters to:

force_stopping            = true
wait_for_mirror           = false
delete_destination_volume = false

This setting stops the replication quickly and makes the destination volume read-write. Any ongoing transfer is aborted, and your destination has the content of the latest successful transfer.

How to handle the destination volume

A common question is how to handle the destination volume, which gets created automatically by the replication. Should you import it into Terraform to manage it?

The answer depends on whether the replication is active. In an active replication, any change done to one volume is done to both, which confuses Terraforms state tracking. It’s better not to put the destination volume under Terraform management while the replication is active.

When the replication is inactive, the destination volume becomes independent and you can manage it using Terraform by importing it. The drawback is that if you enable the replication again, you may need to drop the destination volume from your HCL code and the Terraform state manually.

How to handle REVERSE AND RESUME

Reverse and resume allows you to swap source and destination volume roles for an existing replication relationship and activates the replication. All data and settings of the former-source-but-now-new-destination volume are overwritten by the new source. Make sure that this is what you intend before triggering it.

The provider doesn’t support this operation. It needs to be triggered manually by using Cloud Console, gcloud, or the API. In addition, running this operation “confuses” the existing Terraform state. After running a reverse and resume, NetApp recommends manually dropping Terraform HCL code and state for the replication and the former source volume and reimporting the replication and the new source volume.

If you reverse and resume twice to establish the initial replication direction, you can leave the Terraform code and state untouched. State problems will resolve after the second reverse and resume.

Happy terraforming

Volume replication is a powerful feature that is easy to use. The google Terraform provider allows you to manage all NetApp Volumes resources, including volume replication. Day 1 operations like setting up a replication are very simple. Day 2 operations like changing the properties of the replication are also easy. Day X operations like stopping, resyncing, and reversing replications can cause data loss if not done carefully. Before applying your Terraform execution plans, make sure that they contain the results that you expect.

This blog evolved into a series. Here links to all the blogs: