Trident Protect: Power Up Kubernetes Replication for Protection & Disaster Recovery

Rahul-Rana

Exploring Trident Protect: Application Mirror Relationship (AMR) for High Availability and Disaster Recovery

In today's fast-paced digital landscape, businesses rely heavily on applications to drive innovation, growth, and customer engagement. However, application downtime or data loss can have devastating consequences, including lost revenue, damaged reputation, and compromised customer trust. To mitigate these risks, organizations need a robust disaster recovery strategy.

In this blog, we'll delve into the concepts of failover/failback and explore how NetApp Trident Protect Application Mirror built on top of NetApp ONTAP SnapMirror can help businesses ensure seamless application mobility and disaster recovery.

Benefits of Application Protection with Trident Protect Application Mirror

Minimized Downtime: Failover and Failback capabilities minimize downtime, ensuring that applications remain accessible and operational.
Improved Business Continuity: AppMirror ensures business continuity by providing failover and failback capabilities, reducing the risk of data loss and downtime.
Reduced Costs: The solution reduces costs associated with disaster recovery and application mobility, including infrastructure, personnel, and downtime costs.
Simplified Disaster Recovery: Trident Protect Application Mirror simplifies disaster recovery by providing failover and failback capabilities, reducing complexity and minimizing downtime.
Automated Replication: Snapshots are automatically replicated to a target environment, ensuring data consistency and minimizing downtime.
Reduced RTO and RPO: Reduce your RPO to as low as 5 minutes.

Steps to configure and use Trident Protect Appmirror:

Prerequisites for AMR - Trident Protect setup configurations for AMR
Source Cluster Requirements
Destination Cluster Requirements
Primary Site Outage -Execute Failover to Restore Application Operations
Recovery scenarios - Restoring Replication Relationships
Resync a failed over replication relationship
Reverse resync a failed over replication relationship
Failback applications to the original source cluster

Prerequisites for AMR - Trident Protect setup configurations for AMR

ONTAP - Storage backend should be peered as mentioned in our documentation.

AppVault - AppVault is used to store metadata (k8s resources) for the application used during failover operations. . We recommend creating two separate AppVault configurations for your source and destination sites.

Source Cluster Requirements

Source Cluster AppVault: Ensure AppVault(bucket) CR common between source and destination has been created.
Source Application CR: A Custom Resource (CR) for your source application.
Source Snapshot CR: A Custom Resource (CR) for your source snapshot.
Source Snapshot Schedule: A schedule for snapshots (CR).

Destination Cluster Requirements

Destination Cluster AppVault: Ensure AppVault(bucket) CR common between source and destination has been created.
AppMirrorRelationship CR: A Custom Resource (CR) defining the application mirror relationshipincluding a replication schedule.

AMR is established and ready to protect your K8s applications.

Primary Site Outage - Execute Failover to Restore Application Operations

Failover is the process of switching to a standby system or environment when the primary system or environment fails or becomes unavailable. In a failover scenario, the standby system or environment takes over the responsibilities of the primary system or environment, ensuring business continuity. Failover needs to be triggered by the user in case of various events, including hardware failures, software crashes, network outages, or natural disasters.

Failover the AppMirrorRelationship to bring up your application in Region B.

Recovery scenarios - Restoring Replication Relationships

From the failed over state you can select one of the three scenarios based on your needs:

1. Resync - Conduct disaster recovery (DR) testing by disregarding changes on the destination site while resynchronizing.

2. Reverse resync - Swap the roles of the source and destination sites.

3. Failback - In this scenario we restore the initial replication direction, we first reverse resynchronize any application changes back to the original source application before switching the replication direction.

Resync a failed over replication relationship

Goal: The original source application becomes the running application, and any changes made to the running application on the destination cluster are discarded.

Create a source snapshot: Establish a new snapshot on the source.
Re-establish AppMirrorRelationship - On the destination cluster, update the AppMirrorRelationship desired state from "Promoted" to "Established".
Remove Schedules on Destination - Delete any schedules that were copied to the destination volume during the failover process.

Reverse resync a failed over replication relationship

Goal: Destination application becomes the source application, and the source becomes the destination. Changes made to the destination application during failover are kept.

Syncing the changes back from Region B to Region A

Delete existing AMR CR: Remove the AppMirrorRelationship CR on Region B.
Capture changes since failover: Create a new base snapshot on Region B.
Create snapshot schedule: Create a new snapshot schedule CR on Region B.
Create new AMR CR: Establish a new AppMirrorRelationship CR on Region A.

Ensure namespace mapping is accurate
Ensure AppVaults have been swapped if using a source and destination app vault (destination will become source and source will become destination)
Ensure srcApplicationName matches the name of the Application CR created on secondary instance
Ensure srcApplicationUID matches the .metadata.uid from the Application CR created on the secondary instance

Wait for AMR establishment: Wait for the AppMirrorRelationship to reach the "Established" state in Region A.

Note: If you want to keep things in this current state where replication direction swapped then you can stop here. Another option would be to continue to the next section of failing back application to the original source cluster.

Failback applications to the original source cluster

Goal: Revert to the original replication direction and state, we first replicate (resynchronize) any application changes to the original source application prior to reversing the replication direction.

Syncing changes back to original Region A and bringing the App down on Region B

Reversing the replication direction back from Region B to Region A

Prerequisite to this section would be Reverse resync a failed over replication relationship as outlined above.

Disable Schedules on Region A - Delete any snapshot schedules in Region A.
ShutdownSnapshot CR: Create a ShutdownSnapshot CR on Region B to take a final snapshot and gracefully shutdown your application.
After the ShutdownSnapshot has completed, get the name of the snapshot from the CR status as mentioned in our documention.
Perform a Failover using the snapshot basename in apparchive path retrieved from previous step.
Follow Reverse Resync steps from Region A to Region B.
Enable schedules on your original site Region A.

Note: This workflow is expected to incur application downtime.

Conclusion
In conclusion, failback and failover are critical components of a disaster recovery strategy, ensuring business continuity and minimizing downtime. NetApp Trident Protect Application Mirror provides failover and failback capabilities, supporting seamless application mobility and disaster recovery. By leveraging Trident Protect Application Mirror, businesses can ensure minimal downtime, improved business continuity, and reduced costs.