NetApp Announces Exciting Enhancements to the BlueXP Digital Wallet
We’re thrilled to share some exciting news with you. We’ve rolled out a series o ...read more
Google Cloud NetApp Volumes is a fully managed file storage service that reaches customers across all regions in Google Cloud though the Flex service ...read more
Exploring Trident Protect: Application Mirror Relationship (AMR) for High Availability and Disaster Recovery
In today's fast-paced digital landscape, businesses rely heavily on applications to drive innovation, growth, and customer engagement. However, application downtime or data loss can have devastating consequences, including lost revenue, damaged reputation, and compromised customer trust. To mitigate these risks, organizations need a robust disaster recovery strategy.
In this blog, we'll delve into the concepts of failover/failback and explore how NetApp Trident Protect Application Mirror built on top of NetApp ONTAP SnapMirror can help businesses ensure seamless application mobility and disaster recovery.
Benefits of Application Protection with Trident Protect Application Mirror
Minimized Downtime: Failover and Failback capabilities minimize downtime, ensuring that applications remain accessible and operational.
Improved Business Continuity: AppMirror ensures business continuity by providing failover and failback capabilities, reducing the risk of data loss and downtime.
Reduced Costs: The solution reduces costs associated with disaster recovery and application mobility, including infrastructure, personnel, and downtime costs.
Simplified Disaster Recovery: Trident Protect Application Mirror simplifies disaster recovery by providing failover and failback capabilities, reducing complexity and minimizing downtime.
Automated Replication: Snapshots are automatically replicated to a target environment, ensuring data consistency and minimizing downtime.
Reduced RTO and RPO: Reduce your RPO to as low as 5 minutes.
Steps to configure and use Trident Protect Appmirror:
Prerequisites for AMR - Trident Protect setup configurations for AMR
ONTAP - Storage backend should be peered as mentioned in our documentation.
AppVault - AppVault is used to store metadata (k8s resources) for the application used during failover operations. . We recommend creating two separate AppVault configurations for your source and destination sites.
Source Cluster Requirements
Source Cluster AppVault: Ensure AppVault(bucket) CR common between source and destination has been created.
Source Application CR: A Custom Resource (CR) for your source application.
Source Snapshot CR: A Custom Resource (CR) for your source snapshot.
Source Snapshot Schedule: A schedule for snapshots (CR).
Destination Cluster Requirements
Destination Cluster AppVault: Ensure AppVault(bucket) CR common between source and destination has been created.
AppMirrorRelationship CR: A Custom Resource (CR) defining the application mirror relationshipincluding a replication schedule.
AMR is established and ready to protect your K8s applications.
Primary Site Outage - Execute Failover to Restore Application Operations
Failover is the process of switching to a standby system or environment when the primary system or environment fails or becomes unavailable. In a failover scenario, the standby system or environment takes over the responsibilities of the primary system or environment, ensuring business continuity. Failover needs to be triggered by the user in case of various events, including hardware failures, software crashes, network outages, or natural disasters.
Failover the AppMirrorRelationship to bring up your application in Region B.
Recovery scenarios - Restoring Replication Relationships
From the failed over state you can select one of the three scenarios based on your needs:
1. Resync - Conduct disaster recovery (DR) testing by disregarding changes on the destination site while resynchronizing.
2. Reverse resync - Swap the roles of the source and destination sites.
3. Failback - In this scenario we restore the initial replication direction, we first reverse resynchronize any application changes back to the original source application before switching the replication direction.
Resync a failed over replication relationship
Goal: The original source application becomes the running application, and any changes made to the running application on the destination cluster are discarded.
Create a source snapshot: Establish a new snapshot on the source.
Re-establish AppMirrorRelationship - On the destination cluster, update the AppMirrorRelationship desired state from "Promoted" to "Established".
Remove Schedules on Destination - Delete any schedules that were copied to the destination volume during the failover process.
Reverse resync a failed over replication relationship
Goal: Destination application becomes the source application, and the source becomes the destination. Changes made to the destination application during failover are kept.
Syncing the changes back from Region B to Region A
Delete existing AMR CR: Remove the AppMirrorRelationship CR on Region B.
Capture changes since failover: Create a new base snapshot on Region B.
Create snapshot schedule: Create a new snapshot schedule CR on Region B.
Create new AMR CR: Establish a new AppMirrorRelationship CR on Region A.
Ensure namespace mapping is accurate
Ensure AppVaults have been swapped if using a source and destination app vault (destination will become source and source will become destination)
Ensure srcApplicationName matches the name of the Application CR created on secondary instance
Ensure srcApplicationUID matches the .metadata.uid from the Application CR created on the secondary instance
Wait for AMR establishment: Wait for the AppMirrorRelationship to reach the "Established" state in Region A.
Note: If you want to keep things in this current state where replication direction swapped then you can stop here. Another option would be to continue to the next section of failing back application to the original source cluster.
Failback applications to the original source cluster
Goal: Revert to the original replication direction and state, we first replicate (resynchronize) any application changes to the original source application prior to reversing the replication direction.
Syncing changes back to original Region A and bringing the App down on Region B
Reversing the replication direction back from Region B to Region A
Prerequisite to this section would be Reverse resync a failed over replication relationship as outlined above.
Disable Schedules on Region A - Delete any snapshot schedules in Region A.
ShutdownSnapshot CR: Create a ShutdownSnapshot CR on Region B to take a final snapshot and gracefully shutdown your application.
After the ShutdownSnapshot has completed, get the name of the snapshot from the CR status as mentioned in our documention.
Perform a Failover using the snapshot basename in apparchive path retrieved from previous step.
Follow Reverse Resync steps from Region A to Region B.
Enable schedules on your original site Region A.
Note: This workflow is expected to incur application downtime.
Conclusion In conclusion, failback and failover are critical components of a disaster recovery strategy, ensuring business continuity and minimizing downtime. NetApp Trident Protect Application Mirror provides failover and failback capabilities, supporting seamless application mobility and disaster recovery. By leveraging Trident Protect Application Mirror, businesses can ensure minimal downtime, improved business continuity, and reduced costs.
... View more
In the world of AI and machine learning (ML), the scalability and efficiency of the backbone infrastructure play a critical role in the success of an AI project. MLOps solutions combined with the back-end infrastructure must have an inherent ability to optimize costs at every step of an AI project so that it is sustainable in the long run.
The combination of Google Cloud NetApp Volumes and Google Kubernetes Engine has proven to be a game changer in enhancing ML workflows. In this blog post, we will cover the steps that are involved in training a model. Along the way, we’ll explore how this integration can improve data handling, performance, resource usage, and model training while it optimizes costs.
And don’t miss the demonstration that showcases all the capabilities that are presented in this blog post!
Solution overview
The following graphic shows the basic components of the solution.
At the foundation of this solution is Google Cloud NetApp Volumes, which serves as a high-performance, scalable file storage solution. The dataset for training an ML model is hosted and served out of this layer.
Moving into the compute layer, there is Google Kubernetes Engine (GKE), with the NetApp® Trident™ Container Storage Interface (CSI) driver to orchestrate and to present the dataset to the compute layer. To add more power to the compute, NVIDIA GPU accelerators are added as worker nodes in GKE, which are essential for speeding up the model training process.
At the top of the stack, the ML framework is deployed in a containerized form, using JupyterLab for code execution and experimentation. This setup gives you the flexibility to consume resources on demand and to fully use them when they’re provisioned.
Workflow
Training a model is a multistep process, and following is the series of operations that you need to carry out. Throughout this flow, we discuss how the capabilities of NetApp Volumes and GKE combined deliver a highly efficient and feature-rich infrastructure back end for MLOps.
Step 1: Data preparation with Google Cloud NetApp Volumes
The first step in any ML workflow is to prepare the data. In this context, the dataset typically resides within a volume in NetApp Volumes and, ideally, with the the Extreme service level in place. The Extreme service level of NetApp Volumes delivers up to 30GiBps of throughput and supports volumes as large as 1PiB, which is optimal for data-heavy and performance-demanding AI/ML workloads.
Step 2: Kubernetes setup with GKE
Next comes a GKE cluster, where the NetApp Trident CSI driver is configured to manage NetApp Volumes. This driver enables GKE to use Google Cloud NetApp Volumes as the storage backend, making it easy to integrate the dataset into the Kubernetes environment.
By using storage classes in Kubernetes, the performance profile for the volume that will host the dataset can be easily mapped. Another advantage of running ML workflows in Google Cloud Platform with GKE is the ability to scale up the infrastructure only when you need to. For general workloads, the standard compute instances are an excellent fit, but for model training, GPU-powered nodes can be used when the workload demands it.
Step 3: Data presentation to Kubernetes
With GKE set up and the dataset stored in Google Cloud NetApp Volumes, the data can now be presented to the model.
By using the Trident volume import feature, the volume in NetApp Volumes that contains the dataset can be presented to the GKE cluster through a PersistentVolumeClaim (PVC) reference:
tridentctl import volume <<backend_name>> <<volume_name>> -f pvc.yaml
After import, by using the PVC, the data can be presented to the ML framework through a deployment spec.
Step 4: Running the ML framework with JupyterLab
After the data has been presented to Kubernetes, the ML framework can be deployed on GKE by using the container image that you choose. The use of images with a Jupyter implementation provides an interactive notebook interface for running and testing ML models.
Step 5: Scaling with GPU-powered compute
Training ML models, especially with large datasets, can require significant computational power. This need is where the integration with NVIDIA GPUs comes into play. A GPU-powered worker node can be added to the GKE cluster through a node pool, and GKE can also automatically install the requisite drivers for the GPUs and deliver them ready for immediate use.
Step 6: Model training and fine-tuning
This step involves loading the dataset into the ML framework and performing the training.
ML is an iterative process, and the first round of training may not always yield the best results. As the model evolves, its state (weights, biases, and other parameters) can be saved to another volume in NetApp Volumes that is designated for storing model artifacts. This approach helps in maintaining separate and efficient storage layers that are based on your storage needs—training or artifacts.
Step 7: Versioning and cost-optimization
The state of both the dataset and the model after training can be captured by using NetApp Snapshot™ technology. This step creates point-in-time Snapshot copies of the volumes that host the dataset and the model, both of which are vital for versioning and reproducibility in your ML projects. These Snapshots help you in building a cost optimized data lineage for your AI projects.
Now see the steps in action
Those steps are one complete round of model training. At this point, the GPU-powered worker node can be released because it is no longer needed. The focus must switch back to model optimization and dataset hydration for the next round of training.
For a full-on view of all these activities with an example use case for model training, watch this demonstration.
Optimize your ML workflows today
Through Google Kubernetes Engine and Google Cloud NetApp Volumes, you gain significant benefits for running your ML workflows. With on-demand access to high-performance compute resources like GPUs combined with scalable, efficient, and high-performance storage, you can be confident that your ML workflows are fast, reusable, and highly cost-effective.
Get started with Google Cloud NetApp Volumes for MLOps.
Happy training!
... View more
Google Cloud NetApp Volumes introduces cross-region backup to enhance data protection. This feature allows users to back up their primary volumes to a different region, providing disaster recovery in case of regional failure and helping meet compliance requirements.
... View more