SnapMirror active sync and the active-active data center

steiner · ‎2024-09-17

SnapMirror active sync and the active-active data center
RPO=0
RTO=0
SnapMirror active sync architecture
SAN design with SnapMirror active sync
Uniform access
Uniform access with local proximity
Nonuniform access
Uniform access with ASA
SnapMirror active sync vs MetroCluster
SnapMirror active sync setup
Storage Virtual Machines (SVM)
Provision storage
Create the volumes
Create the LUNS
Define the CG
Establish replication
Create destination volumes
Initialize replication
Define the igroup
Map the LUNs
Oracle Configuration
Device Names
ASM Configuration
Database Creation
Failure scenarios
Preferred sites
Oracle and css_critical
SnapMirror active sync preferred site
Loss of SnapMirror replication connectivity
Site A
Site B
Loss of Oracle RAC replication
Complete loss of replication network
Storage system failure
Cut the power to the mediator
Restoring services
SnapMirror active sync failover
Summary

SnapMirror active sync and the active-active data center

I've been beating up on SnapMirror active sync in the lab for about a year now, and I have to say this is the coolest feature I’ve seen added to ONTAP in years . In particular, I'm talking about SnapMirror active sync running in symmetric active-active mode. The value isn’t in the feature itself, it’s in the sorts of solutions it allows you to create. The foundation of SnapMirror active sync is SnapMirror synchronous, and it's been around much longer and is well-proven, but I only started using the active-active capabilities a year ago.

There are a number of ways to configure SnapMirror active sync, but the one I want to focus on here is an active-active application environment. You can build clustered environments including Oracle RAC, VMware, and Windows Server Failover Clusters (WSFC) where the application cluster is distributed across two sites and IO response times are symmetric. You get RPO=0 through synchronous cross-site mirroring and RTO=0 through built-in ONTAP automation and the fact that storage resources are available on both sites, all the time.

This post should be useful for anyone looking at synchronous replication solutions, but I’m going to use Oracle RAC to illustrate SnapMirror active sync functionality because you can examine the RTO=0 and RPO=0 characteristics of the overall solution from the application layer all the way to the storage layer. The same database is operational at both sites, all the time. A disaster requires breaking a mirror, but you don’t really fail anything over because the services were running at the surviving site all along.

The examples used might be Oracle RAC, but the principles are universal. This includes concepts like "preferred site", heartbeats, tiebreaking, and timeouts. Even if you know next to nothing about Oracle RAC, read on.

Before diving into the details of how SnapMirror active sync (SM-as) works, I’d like to explain what SnapMirror active sync (SM-as) is for.

RPO=0

If you work in IT, you’ve definitely heard about the Recovery Point Objective. How much data can you afford to lose if certain events happen? If you're looking at synchronous replication, that means RPO=0, but that's more nuanced than it seems. When I see a requirement for RPO=0, I usually divide it into two categories:

RPO for common problems
RPO for disaster scenarios

RPO=0 for normal problems is fairly easy to accomplish. For example, if you have a database you should expect to need to recover it occasionally. Maybe a patch went wrong and caused corruption or someone deleted the wrong file. All you usually need is RAID-protected storage and a basic backup plan and you can restore the database, replay logs, and fix the problem with no loss of data. You should expect to have RPO=0 recoverability for these types of common problems, and if you choose ONTAP we can make the procedures easier, faster, and less expensive.

If you want RPO=0 for disaster scenarios, things get more complicated. As an example, what if you have a malicious insider who decided to destroy your data and its backups? That requires a very different approach using the right technology with the right solution. We have some cool things to offer in that space.

SnapMirror active sync is about RPO=0 for disaster scenarios that lead to site loss. It might be permanent site loss because of a fire or maybe it's a temporary loss because of a power outage. Perhaps it's more isolated, like a power surge that destroys a storage array. SM-as addresses these scenarios with synchronous mirroring. It's not just RPO=0 replication either, it's a self-healing, integrated, intelligent RPO=0 replication solution.

RTO=0

A requirement for RPO=0 often comes with a requirement for Recovery Time Objective (RTO) of zero as well. The RTO is the amount of time you can be without a service, and overall it's an easily understood SLA, with one exception, which is RTO=0. What does that mean?

The term "RTO=0" across the whole IT industry. I continue to insist there’s no such thing as RTO=0. That would require instantaneously recognizing a problem exists and then instantly correcting it to restore service. That isn’t possible. From a storage perspective, there’s no way to know whether a 10 millisecond wait for a LUN to respond is because the storage system is merely busy with other IO or the data center itself fell into a black hole and no longer exists.

You can’t have RTO=0, but you can have a near-zero RTO from the perspective of IT operations. That exact definition of “near-zero” will depend on business needs. If an RTO is low enough that operations aren’t significantly affected, that’s what the industry calls RTO=0. It means “nondisruptive”.

Not all of the RTO is determined solely by storage availability. For example, a typical ONTAP controller failover completes in around 2-4 seconds, and during this time there will be a pause in SAN IO. The OS’s, however, often have much higher timeouts. ONTAP can fail over in 2 seconds and be fully ready to serve data again, but sometimes the host will wait 15 seconds before retrying an IO and you can’t change that. OS vendors have optimized SAN behavior over the years, but it's not perfect. Those rare but real and occasionally lengthy IO timeouts are probably happening from time to time in your SAN already. Sometimes an FC frame will go missing because a cable is just slightly damaged and there will be a brief performance blip. It shouldn’t cause real-world problems, though. Nothing should crash.

Since SM-as failover operations should be nondisruptive to ongoing IT operations, including disasters, I’ll call SM-as an RTO=0 storage solution since that's the way the RTO term is used in the industry.

You still have to think about the application layer, though. For example, you could distribute a VMware cluster across sites and replicate the underlying storage with SM-as, but failure of a given site would require VMware HA to detect the loss of a given VM and restart it on the surviving site. That takes some time. Does that requirement meet your RTO? If you can’t start a VM quickly enough, you can consider containerization. One of its benefits is a container can be activated nearly instantaneously. There’s no OS boot sequence. That might be an option to improve the big-picture RTO. Finally, if you’re using a truly active-active application like Oracle RAC, you can proactively have the same database running at both sites all the time. The right solution depends on the RTO and the behavior of the entire solution during failover events.

From a storage perspective, achieving that near-zero RTO is a lot more complicated than you’d think. How does one site know the other site is alive? What should happen if a write cannot be replicated? How does a storage system know the difference between a replication partner that is permanently offline versus a partner that is only temporarily unreachable? What should happen to LUNs that are being accessed by a host but are no longer guaranteed to be able to serve the most recent copy of data? What happens when replication is restored?

On to the diagrams…

SnapMirror active sync architecture

Let’s start with the basics of how SM-as replication works.

Here’s what this shows:

These are two different storage clusters. One of them is Cluster1, and the other is Cluster2. (I'll explain the jfs_as1 and jfs_as2 labels a little later…)
It might look like there are six LUNs but it’s logically only three.
The reason is there are six different LUN images, but the LUN1 on each jfs_as1 is functionally the same as LUN1 on jfs_as2, the LUN2's are the same, and the LUN3's are the same.
From a host point of view, this looks like ordinary SAN multipathing. LUN1 and its data is available and accessible on either cluster.
IO behavior and IO performance are symmetric.
- If you perform a read of LUN1 from the cluster on site A, the read will be serviced by the local copy of the data on site A
- If you perform a read of LUN1 from the cluster on site B, the read will be serviced by the local copy of the data on site B
- Performing a write of LUN1 will require replication of the write to the opposite site before the write is acknowledged.

Now let's look at failover. There’s another component involved – the mediator.

The mediator is required for safely automating failover. Ideally, it would be placed on an independent 3^rd site, but it can still function for most needs if it’s placed on site A or site B. The mediator is not really a tiebreaker, although that’s effectively the function it provides. It's not taking any actions; it’s providing an alternate communication channel to the storage systems.

The #1 challenge with automated failover is the split-brain problem, and that problem arises if your two sites lose connectivity with each other. What should happen? You don’t want to have two different sites activate themselves as the surviving copy of the data, but how can a single site tell the difference between actual loss of the opposite site and an inability to communicate with the opposite site?

This is where the mediator enters the picture. If placed on a 3^rd site, and each site has a separate network connection to that site, then you have an additional path for each site to validate the health of the other. Look at the picture above again and consider the following scenarios.

What happens if the mediator fails or is unreachable from one or both sites?
- The two clusters can still communicate with each other over the same link used for replication services.
- Data is still served with RPO=0 protection
What happens if Site A fails?
- Site B will see both of the communication channels to the partner storage system go down.
- Site B will take over data services, but without RPO=0 mirroring
What happens if Site B fails?
- Site A will see both of the communication channels to the partner storage system go down.
- Site A will take over data services, but without RPO=0 mirroring

There is one other scenario to consider: Loss of the data replication link. If the replication link between sites is lost, RPO=0 mirroring will obviously be impossible. What should happen then?

This is controlled by the preferred site status. In an SM-as relationship, one of the sites is secondary to the other. This makes no difference in normal operations, and all data access is symmetric, but if replication is interrupted then the tie will have to be broken to resume operations. The results is that the preferred site will continue operations without mirroring and the secondary will halt IO processing until replication is reestablished. More on this topic below…

SAN design with SnapMirror active sync

As explained above, SM-as functionally provides the same LUNs on two different clusters. That doesn’t mean you necessarily want your hosts to access their LUNs across all available paths. You might, or you might not. There are two basic options called uniform and non-uniform access.

Uniform access

Let’s say you had a single host, and you wanted RPO=0, RTO=0 data protection. You’d probably want to configure SM-as with uniform access, which means the LUNs would be visible to your host from both clusters.

Obviously, this will mean that complete loss of site A would also result in the loss of your host and its applications, but you would have extremely high availability storage services for your data. This is probably a less common use case for SM-as, but it does happen. Not all applications can be clustered. Sometimes all you can do is provide ultra-available storage and accept that you’ll need to find alternate servers on a remote site in the event of disaster.

Another option with uniform access is in a full mesh networked cluster.

If you use SM-as with uniform access in a cluster as shown above, any host should always have access to a usable copy of the data. Even if one of the storage systems failed or storage replication connectivity was lost, the hosts would continue working. From a host multipath point of view, it all looks like a single storage system. A failure of any one component would result in some of the SAN paths disappearing off the network, but otherwise everything would continue working as usual.

Uniform access with local proximity

Another feature of SM-as is the option to configure the storage systems know where the hosts are located. When you map the LUNs to a given host, you can indicate whether or not they are local to the storage system.

fullmesh.paths.png

In normal operation, all IO is local IO. Reads and writes are serviced from the local storage array. Write IO will, of course, need to be replicated by the local controller to the remote system before being acknowledged, but all read IO will be serviced locally and will not incur extra latency by traversing the SAN.

The only time the nonoptimized paths will be used is when all active/optimized paths are lost. For example, if the entire array on site A lost power, the hosts at site A would still be operational, although they would be experiencing higher read latency.

Note: There are redundant paths through the local cluster that are not shown on these diagrams for the sake of simplicity. ONTAP storage systems are HA themselves, so a controller failure should not result in site failure. It should merely result in a change in which local paths are used by an affected host.

Nonuniform access

Nonuniform access means each host has access to only a subset of available ports.

The primary benefit to this approach is SAN simplicity: you remove the need to stretch a SAN over the network. Some users don't have dark fiber between sites or lack the infrastructure to tunnel FC SAN traffic over an IP network. In some cases, the additional latency overhead of an application accessing storage across sites on a regular basis would be unacceptable, rendering the improved availability of uniform access unnecessary.

The disadvantage to nonuniform access is that certain failure scenarios, including loss of the replication link, will result in half of your hosts losing access to storage. Applications that run as single instances, such as a non-clustered database that is inherently only running on a single host at any given mount, would fail if local storage connectivity were lost. The data would still be protected, but the database server would no longer have access. It would need to be restarted on a remote site, preferably through an automated process. For example, VMware HA can detect an all-paths-down situation and restart a VM on another server where paths are available. In contrast, a clustered application such as Oracle RAC can deliver the same services that are constantly running at both sites. Losing a site with Oracle RAC doesn’t mean loss of the application service as a whole. Instances are still available and running at the surviving site.

Uniform access with ASA

The diagrams above showed path prioritization with AFF storage systems. NetApp ASA systems provide active-active multipathing, including with the use of SM-as. Consider the following diagram:

fullmesh.asa.png

An ASA configuration with non-uniform access would work largely the same as it would with AFF. With uniform access, IO would be crossing the WAN. This may or may not be desirable. If the two sites were 100 meters apart with fiber connectivity there should be no detectable additional latency crossing the WAN, but if the sites were a long distance apart then read performance would suffer on both sites. In contrast, with AFF those WAN-crossing paths would only be used if there were no local paths available and read performance would be better.

ASA with SM-as in a low-latency configuration offers two interesting benefits. First, it essentially doubles the performance for any single host because IO can be serviced by twice as many controllers using twice as many paths. Second, it offers extreme availability because an entire storage system could be lost without interrupting host access.

SnapMirror active sync vs MetroCluster

Those of you that know NetApp probably know SM-as' cousin, MetroCluster. SM-as is similar to NetApp MetroCluster in overall functionality, but there are important differences in the way in which RPO=0 replication is implemented and how it is managed.

A MetroCluster configuration is more like one integrated cluster with nodes distributed across sites. SM-as behaves like two otherwise independent clusters that are cooperating in serving data from specified RPO=0 synchronously replicated LUNs.
The data in a MetroCluster configuration is only accessible from one particular site at any given time. A second copy of the data is present on the opposite site, but the data is passive. It cannot be accessed without a storage system failover.
MetroCluster and SM-as mirroring occur at different levels. MetroCluster mirroring is performed at the RAID layer. The low-level data is stored in a mirrored format using SyncMirror. The use of mirroring is virtually invisible up at the LUN, volume, and protocol layers.
In contrast, SM-as mirroring occurs at the protocol layer. The two clusters are overall independent clusters. Once the two copies of data are in sync, the two clusters only need to mirror writes. When a write occurs on one cluster, it is replicated to the other cluster. The write is only acknowledged to the host when the write has completed on both sites. Other than this protocol splitting behavior, the two clusters are otherwise normal ONTAP clusters.
The sweet spot for MetroCluster is large-scale replication. You can replicate an entire array with RPO=0 and near-zero RTO. This simplifies the failover process because there is only one "thing" to fail over, and it scales extremely well in terms of capacity and IOPS.
The first sweet spot for SM-as is granular replication. Sometimes you don’t want to replicate all data on a storage system as a single unit, or you need to be able to selectively fail over certain workloads.
The second sweet spot for SM-as is for active-active operations, where you want fully usable copies of data to be available on two different clusters located in two different locations with identical performance characteristics and, if desired, no requirement to stretch the SAN across sites. You can have your applications already running on both sites, which reduces the overall RTO during failover operations.

SnapMirror active sync setup

Although the following sections focus on an Oracle RAC configuration, the concepts are applicable to most any SnapMirror active sync deployment. Using Oracle RAC as an example is especially helpful because Oracle RAC is inherently complicated. If you understand how to configure and manage Oracle RAC on SM-as, including all the nuances, requirements, and especially the benefits ONTAP delivers for such a configuration, you’ll be able to use SM-as for any workload.

Additionally, I’ll be using the CLI to configure ONTAP, but you can also use the SystemManager GUI, and I want to say that I think it's outstanding. The UI team did great work with SnapMirror active sync. I’m using the CLI mostly because I personally prefer a CLI, but also because it's easier to explain what's happening with ONTAP if I can go step-by-step with the CLI. The GUI automates multiple operations in a single step.

Finally, I'll assume a mediator is already properly configured.

Storage Virtual Machines (SVM)

One item of ONTAP terminology you need to understand is the SVM, also called a vserver. ONTAP storage clusters are built for multitenancy. When you first install a system, it doesn't provide storage services in the same way that a new VMware server can't run applications. You need to create a VM.

An ONTAP SVM (again, also called a vserver, especially at the CLI) is not an actual VM running under a hypervisor, but conceptually it's the same thing. It's a virtual storage system with its own network addresses, security configuration, user accounts, and so forth. Once you create an SVM, you then provision your file shares, LUNs, S3 buckets, and other storage resources under the control of that SVM.

You'll see the use of -vserver in the examples below. I have two clusters, Cluster1 and Cluster2. Cluster1 includes an SVM called jfs_as1, which will be participating in a SnapMirror active sync relationship with SVM jfs_as2, located on Cluster2.

Provision storage

Once the SVM is defined, the next step is to create some volumes and LUNs. Remember – with ONTAP, a volume is not a LUN. A volume is a management object for storing data, including LUNs. We usually combine related LUNs in a single volume. The volume layout is usually designed to simplify management of those LUNs.

Confused? Here’s a diagram of my basic volume and LUN layout.

pre-mirroring layout.png

There is no true best practice for any database layout. The best option depends on what you want to do with the data. In this case, there are three volumes:

One volume for the Oracle Grid management and quorum LUNs.
One volume for the Oracle datafiles.
One volume for the Oracle logs, and this includes redo logs, archive logs, and the controlfiles.
LUN count for datafile IO of 8 LUNs. This isn't an ONTAP limitation, it's about the host OS. You need multiple LUNs to maximum SAN performance through the host IO stack. 4-8 LUNs is usuall required. The LUN count for logs and other sequentially accessed files is less important.

This design is all about manageability. For example, if you need to restore a database, the quickest way to do it would be to revert the state of the datafiles and then replay logs to the desired restore point. If you store the datafiles in a single volume, you can revert the state of that one volume to an earlier snapshot, but you need to ensure the log files are separated from datafiles. Otherwise, reverting the state of the datafiles would result in loss of archive log data that is critical for RPO=0 data recovery.

Likewise, this design allows you to manage performance more easily. You simply place a QoS limit on the datafile volume, rather than individual LUNs. As an aside, never place a QoS limit on a database transaction log because of the bursty nature of log IO. The average IO for transaction logs is usually fairly low, but the IO occurs in short bursts. If QoS engages during those bursts, the result will be significant performance damage. QoS is for datafiles only.

Create the volumes

I created my three volumes as follows.

Cluster1::> vol create -vserver jfs_as1 -volume jfsAA_grid_siteA -snapshot-policy none -percent-snapshot-space 0 -size 256g -space-guarantee
Cluster1::> vol create -vserver jfs_as1 -volume jfsAA_oradata_siteA -snapshot-policy none -percent-snapshot-space 0 -size 1t -space-guarantee
Cluster1::> vol create -vserver jfs_as1 -volume jfsAA_logs_siteA -snapshot-policy none -percent-snapshot-space 0 -size 500g -space-guarantee none

There’s a lot of different options when creating a volume. If you use SystemManager, you’ll get volumes with default behavior that is close to universally appropriate, but when using the CLI you might need to look at all the available options.

In my case, I wanted to create volumes for the grid, datafile, and log LUNs that include the following attributes:

Disabling scheduled snapshots. Scheduled snapshots can provide powerful on-box data protection, but the schedule, retention policy, and naming conventions need to be based on the SLAs. For now, I’d rather just disable snapshots to ensure no snapshots are unknowingly created.
Set the snapshot reserve to 0%. There is no reason to reserve snapshot space in a SAN environment.
Set space guarantees to none. Space guarantees would reserve the volume’s capacity on the system, which is almost always wasteful with a database. Most databases compress well, so reserving the full size of the volume would be unnecessary.
I want space efficiency settings to be enabled, but this is default behavior and does not require special arguments.

Create the LUNS

The next step is to create the required LUNs inside the volumes. I’ll be using the defaults, plus disabling space reservations. This will result in LUNs only consuming the space actually required by that LUN within the volume.

Cluster1::> lun create -vserver jfs_as1 /vol/jfsAA_grid_siteA/lun0 -size 64g -ostype linux -space-reserve disabled
Cluster1::> lun create -vserver jfs_as1 /vol/jfsAA_grid_siteA/lun1 -size 64g -ostype linux -space-reserve disabled
…
Cluster1::> lun create -vserver jfs_as1 /vol/jfsAA_logs_siteA/lun0 -size 64g -ostype linux -space-reserve disabled
Cluster1::> lun create -vserver jfs_as1 /vol/jfsAA_logs_siteA/lun1 -size 64g -ostype linux -space-reserve disabled

Define the CG

There are now three volumes of data: Oracle datafiles LUNs, Oracle log LUNs, and RAC cluster resource LUNs. While three separate volumes deliver optimal manageability, synchronous replication requires a single container. This RAC environment needs to be replicated and kept in sync as a unified whole. If site failure occurs, all resources at the surviving site need to be consistent with one another. We need a consistency group.

As mentioned above, with ONTAP, a volume is a not a LUN, it’s just a management container. If all LUNs in a given dataset were placed on a single volume, you can create snapshots, clone, restore, or replicate that single volume as a unit. In other words, a volume in ONTAP is natively a consistency group.

In many SM-as use cases, placing all the LUNs of a given application in a single volume might be all you need to meet your data protection requirements. Sometimes your requirements are more complicated. You might need to separate an application into multiple volumes based on manageability requirements, but also want to manage the application as a unit. That’s why we created ONTAP Consistency Groups, and you can read more about them here and here.

I can define a consistency group for my three volumes at the CLI by providing a CG name and the current volumes.

Cluster1::> consistency-group create -vserver jfs_as1 -consistency-group jfsAA -volumes jfsAA_oradata_siteA,jfsAA_logs_siteA,jfsAA_grid_siteA

The result is I've created a CG called jfsAA (I used the letters AA to denote the active-active configuration I’m building) based on the current grid, datafile, and log volumes and the LUNs they contain. Also, note that I used a suffix of _siteA so I can more easily keep track of while cluster hosts which volumes and the data within those volumes. More on that below…

Establish replication

There are now volumes of LUNs on site A, but replication isn’t operational yet. The first step to prepare for replication is to create volumes on site B. As mentioned previously, the SystemManager UI automates all of the setup work, but it’s easier to explain each step in the automated sequence of events using the CLI

Create destination volumes

Before I can replicate data, I need a place to host the replicas.

Cluster2::> vol create -vserver jfs_as2 -volume jfsAA_grid_siteB -snapshot-policy none -percent-snapshot-space 0 -size 256g -type DP -space-guarantee none
Cluster2::> vol create -vserver jfs_as2 -volume jfsAA_oradata_siteB -snapshot-policy none -percent-snapshot-space 0 -size 1t -type DP -space-guarantee none
Cluster2::> vol create -vserver jfs_as2 -volume jfsAA_logs_siteB -snapshot-policy none -percent-snapshot-space 0 -size 500g -type DP -space-guarantee none

The commands above created three volumes on site B using the same settings as used as site A, except they are a type “DP”, which means a data protection volume. This identifies a volume that will be joined to a replication relationship. No LUNs are being provisioned. They will be automatically created once replication is initialized and the content of the volumes on Cluster2 are synchronized from the source volumes on Cluster1.

Initialize replication

The following command creates an SnapMirror active sync relationship in active-active mode. This command is run on the uninitialized controller. SnapMirror is designed as a pull technology. For example, asynchronous SnapMirror updates pull new data from the source. It is not pushed from the source to a destination.

Cluster2::> snapmirror create -source-path jfs_as1:/cg/jfsAA -destination-path jfs_as2:/cg/jfsAA -cg-item-mappings jfsAA_grid_siteA:@jfsAA_grid_siteB,jfsAA_oradata_siteA:@jfsAA_oradata_siteB,jfsAA_logs_siteA:@jfsAA_logs_siteB -policy AutomatedFailOverDuplex
Operation succeeded: SnapMirror create for the relationship with destination "jfs_as2:/cg/jfsAA".

This operation makes a lot more sense if you see it in the GUI, but I can break down the command. Here’s what it does:

snapmirror create

That's self-evident. It's creating a snapmirror relationship.

-source-path jfs_as1:/cg/jfsAA

Use a source of the SVM called jfs_as1 and the consistency group called jfsAA. You'll see this syntax elsewhere in the ONTAP CLI. A consistency group is denoted as [svm name]:/cg/[cg name].

-destination-path jfs_as2:/cg/jfsAA

Replicate that source CG called jfsAA to the SVM called jfs_as2 and use the same CG name of jfsAA

-cg-item-mappings \
jfsAA_grid_siteA:@jfsAA_grid_siteB, \
jfsAA_oradata_siteA:@jfsAA_oradata_siteB,\
jfsAA_logs_siteA:@jfsAA_log_siteB

This section controls the mapping of volumes to volumes. The syntax is [source volume]:@[destination volume]. I've mapped the source grid volume to the destination grid volume, oradata to oradata, and logs to logs. The only difference in the name is the suffix. I used a siteA and siteB suffix for the volumes to avoid user errors. If someone is performing management activities on the UI, using either SystemManager or the CLI, they should be readily able to tell whether they're working on the site A or site B system based on the suffix of the volume names.

-policy AutomatedFailOverDuplex

The final argument specifies a snapmirror relationship of type AutomatedFailoverDuplex, which means bidirectional synchronous replication with automated failure detection. The relationship now exists, but it's not yet initialized. This requires the following command on the second cluster.

Cluster2::> snapmirror initialize -destination-path jfs_as2:/cg/jfsAA
Operation is queued: SnapMirror initialize of destination "jfs_as2:/cg/jfsAA".

I can check the status as follows, and the key is looking for a Relationship Status of InSync and Healthy being true.

Cluster2::> snapmirror show -vserver jfs_as2 -destination-path jfs_as2:/cg/jfsAA
                            Source Path: jfs_as1:/cg/jfsAA
                       Destination Path: jfs_as2:/cg/jfsAA
                      Relationship Type: XDP
                Relationship Group Type: consistencygroup
                      SnapMirror Policy: AutomatedFailOverDuplex
                           Mirror State: Snapmirrored
                    Relationship Status: InSync
                                Healthy: true

Define the igroup

Before I make the LUNs available, I need to define the initiator group (igroup). I’m building an Oracle RAC cluster, which means two different hosts will be accessing the same LUNs. I’m using iSCSI, but FC works the same. It just uses WWNs rather than iSCSI initiator IDs.

First, I'll create the igroup on the site A system. Any LUNs mapped to this igroup will be available via iSCSI to hosts using the specified initiator.

Cluster1::> igroup create -vserver jfs_as1 -igroup jfsAA -ostype linux -initiator iqn.1994-05.com.redhat:a8ee93358a32

Next, I'll enter Advanced mode and associate the local initiator name with the local SVM. This is how ONTAP controls path priority. Any host with a WWN or iSCSI initiator listed in igroup will be able to access LUNs mapped to that igroup, but the path priorities would not be optimal. I want paths originating on site A to only be advertised as optimal paths to the hosts located on site A.

Cluster1::> set advanced
Cluster1::*> igroup initiator add-proximal-vserver -vserver jfs_as1 iqn.1994-05.com.redhat:a8ee93358a32 -proximal-vservers jfs_as1

I then repeat the process on site B, using the WWN or iSCSI initiator of the host on site B.

Cluster2::> igroup create -vserver jfs_as2  -igroup jfsAA -ostype linux -initiator iqn.1994-05.com.redhat:5214562dfc56
Cluster1::*> set advanced
Cluster2::*> igroup initiator add-proximal-vserver -vserver jfs_as2 iqn.1994-05.com.redhat:5214562dfc56 -proximal-vservers jfs_as2

I didn't really need to add the proximal-vserver information. Read on to understand why not.

Map the LUNs

Next, I map the LUNs to the igroup so the hosts will be able to access them. Although I used the same igroup name on each site, these igroups are located on different SVMs.

Cluster1::> lun map -vserver jfs_as1 /vol/jfsAA_grid_siteA/lun0  -igroup jfsAA
Cluster1::> lun map -vserver jfs_as1 /vol/jfsAA_grid_siteA/lun1  -igroup jfsAA
…
Cluster1::> lun map -vserver jfs_as1 /vol/jfsAA_oradata_siteA/lun6 -igroup jfsAA
Cluster1::> lun map -vserver jfs_as1 /vol/jfsAA_oradata_siteA/lun7 -igroup jfsAA

Cluster2::> lun map -vserver jfs_as2 /vol/jfsAA_grid_siteB/lun0  -igroup jfsAA
Cluster2::> lun map -vserver jfs_as2 /vol/jfsAA_grid_siteB/lun1  -igroup jfsAA
…
Cluster2::> lun map -vserver jfs_as2 /vol/jfsAA_oradata_siteB/lun6 -igroup jfsAA
Cluster2::> lun map -vserver jfs_as2 /vol/jfsAA_oradata_siteB/lun7 -igroup jfsAA

Oracle Configuration

From this point, setup is exactly like any other Oracle RAC server. Functionally, this is like a 2-site Oracle Extended RAC cluster, except there's no need for configuring ASM failgroups. The replication services are built-in to the storage system.

Device Names

There are multiple ways to control device names with Oracle, but my personal preference is using udev rules and multipath aliases. It takes more up-front work, but I have more control over the exact naming conventions to be used.

The multipath.conf file looks like this on each RAC node:

[root@jfs12 ~]# cat /etc/multipath.conf
multipaths {
    multipath {
        wwid  3600a0980383041327a2b55676c547247
        alias grid0
    }
    multipath {
        wwid  3600a0980383041327a2b55676c547248
        alias grid1
    }
    multipath {
        wwid  3600a0980383041327a2b55676c547249
        alias grid2
    }
    …
    }
    multipath {
        wwid  3600a0980383041334a3f55676c69734d
        alias logs5
    }
    multipath {
        wwid  3600a0980383041334a3f55676c69734e
        alias logs6
    }
    multipath {
        wwid  3600a0980383041334a3f55676c69734f
        alias logs7
    }
}

Then I have the following udev rule:

[root@jfs12 ~]# cat /etc/udev/rules.d/99-asm.rules
ENV{DM_NAME}=="grid*", GROUP:="asmadmin", OWNER:="grid", MODE:="660"
ENV{DM_NAME}=="oradata*", GROUP:="asmadmin", OWNER:="grid", MODE:="660"
ENV{DM_NAME}=="logs*", GROUP:="asmadmin", OWNER:="grid", MODE:="660"

The result is clear device names that are automatically assigned the correct user and group permissions. You can even see the devices in /dev/mapper.

[root@jfs12 ~]# ls /dev/mapper
control  grid1  logs0  logs2  logs4  logs6  oradata0  oradata2  oradata4  oradata6  vg00-root

ASM Configuration

Unlike typical Extended RAC, I created the ASM diskgroups with external redundancy, which means no mirroring by ASM itself. Replication services are provided by the storage system, not RAC. I created the following ASM diskgroups:

[root@jfs12 ~]# /grid/bin/asmcmd ls
DBF/
GRID/
LOGS/

From this point on, the installation process is exactly like any other Oracle RAC installation.

Database Creation

I used the following database layout:

[root@jfs12 ~]# /grid/bin/asmcmd ls DBF/NTAP
124D13D9FE3FFF29E06370AAC00A260E/
124DA9A2A3BB954AE06370AAC00A7624/
DATAFILE/
NTAPpdb1/
PARAMETERFILE/
PASSWORD/
TEMPFILE/
pdbseed/
sysaux01.dbf
system01.dbf
temp01.dbf
undotbs01.dbf
undotbs02.dbf
users01.dbf

[root@jfs12 ~]# /grid/bin/asmcmd ls LOGS/NTAP
ARCHIVELOG/
CONTROLFILE/
ONLINELOG/
control01.ctl
control02.ctl
redo01.log
redo02.log
redo03.log
redo04.log

There is a reason for this, but it's not connected to the use of SnapMirror active sync. See the section above called "Provision storage" for an explanation.

Failure scenarios

The following diagram shows the Oracle database and storage configuration as it exists at the hardware level (mediator not shown), and it is using a non-uniform network configuration. There is no SAN connectivity across sites.

logical.v1.png

I wrote above that I didn't need to configure host proximity. The reason is I'm using a nonuniform configuration. I have not stretched the SAN across sites. The only paths available to hosts are local paths. There's no reason to add host proximity settings because hosts will never be able to see storages paths to an opposite site. I included the instructions for configuring host proximity for readers who may be using uniform configurations and need to know how proximity settings are controlled.

Several of the scenarios described below that resulted in loss of database services would not have happened with a uniform network configuration. There is no cross-site SAN connectivity, so anything that results in the loss of active paths on a given site will mean there were no paths remaining at all. In a uniform network configuration, each site would be able to use alternate paths on the opposite site. The reason a non-uniform configuration was chosen for these tests was to illustrate that it is possible to have active-active RPO=0 replication without extending the SAN. It's a more demanding use case, and while it does have some limitations it also has benefits, such as a simpler SAN architecture.

There's another way to look at the storage architecture. The existence of replication is essentially invisible to the Oracle database and the RAC cluster. Various failure scenarios might result in disruption to certain database servers or the loss of certain paths, but as far as the database is concerned, there's just one set of LUNs. Logically, it looks like this:

Preferred sites

The configuration is symmetric, with one exception that is connected to split-brain management.

The question you need to ask is this - what happens if the replication link is lost and neither site has quorum? What do you want to happen? This question applies to both the Oracle RAC and the ONTAP behavior. If changes cannot be replicated across sites, and you want to resume operations, one of the sites will have to survive and the other site will have to become unavailable.

Oracle and css_critical

In the case of Oracle RAC, the default behavior is that one of the nodes in a RAC cluster that consists of an even number of servers will be deemed more important than the other nodes. The site with that higher priority node will survive site isolation while the nodes on the other site will evict. The prioritization is based on multiple factors, but you can also control this behavior using the css_critical setting.

My environment has two nodes, jfs12 and jfs13. The current settings for css_critical are as follows:

[root@jfs12 ~]# /grid/bin/crsctl get server css_critical
CRS-5092: Current value of the server attribute CSS_CRITICAL is no.

[root@jfs13 trace]# /grid/bin/crsctl get server css_critical
CRS-5092: Current value of the server attribute CSS_CRITICAL is no.

I want the site with jfs12 to be the preferred site, so I changed this value to yes on a site A node and restart services.

[root@jfs12 ~]# /grid/bin/crsctl set server css_critical yes
CRS-4416: Server attribute 'CSS_CRITICAL' successfully changed. Restart Oracle High Availability Services for new value to take effect.

[root@jfs12 ~]# /grid/bin/crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'jfs12'
CRS-2673: Attempting to stop 'ora.crsd' on 'jfs12'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on server 'jfs12'
CRS-2673: Attempting to stop 'ora.ntap.ntappdb1.pdb' on 'jfs12'
…

CRS-2673: Attempting to stop 'ora.gipcd' on 'jfs12'
CRS-2677: Stop of 'ora.gipcd' on 'jfs12' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'jfs12' has completed
CRS-4133: Oracle High Availability Services has been stopped.

[root@jfs12 ~]# /grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

SnapMirror active sync preferred site

At any given moment, SnapMirror active sync will consider one site the "source" and the other the "destination". This implies a one-way replication relationship, but that's not what's happening. This is how the preferred site is determined. If the replication link is lost, the LUN paths on the source will continue to serve data while the LUN paths on the destination will become unavailable until replication is reestablished and enters a synchronous state. The paths will then resume serving data.

My current configuration has site A as the preferred site for both Oracle and ONTAP. This can be viewed via SystemManager:

sm.1.png

or at the CLI:

Cluster2::> snapmirror show -destination-path jfs_as2:/cg/jfsAA
                            Source Path: jfs_as1:/cg/jfsAA
                       Destination Path: jfs_as2:/cg/jfsAA
                      Relationship Type: XDP
                Relationship Group Type: consistencygroup
                    SnapMirror Schedule: -
                 SnapMirror Policy Type: automated-failover-duplex
                      SnapMirror Policy: AutomatedFailOverDuplex
                           Mirror State: Snapmirrored
                    Relationship Status: InSync

The key is that the source is the SVM on cluster1. As mentioned above, the terms "source" and "destination" don't describe the flow of replicated data. Both sites can process a write and replicate it to the opposite site. In effect, both clusters are sources and destinations. The effect of designating one cluster as a source simply controls which cluster survives as a read-write storage system if the replication link is lost.

Loss of SnapMirror replication connectivity

If I cut the SM-as replication link, write IO cannot be completed because it would be impossible for a cluster to replicate changes to the opposite site. Here's what will happen with an Oracle RAC environment.

Site A

The result on site A of a replication link failure will be an approximately 15 second pause in write IO processing as ONTAP attempts to replicate writes before it determines that the replication link is genuinely inoperable. After the 15 seconds elapses, the ONTAP cluster on site A resumes read and write IO processing. The SAN paths will not change, and the LUNs will remain online.

Site B

Since site B is not the SnapMirror active sync preferred site, its LUN paths will become unavailable after about 15 seconds.

The replication link was cut at the timestamp 15:19:44. The first warning from Oracle RAC arrives 100 seconds later as the 200 second timeout (controlled by the Oracle RAC parameter disktimeout) approaches.

2024-09-10 15:21:24.702 [ONMD(2792)]CRS-1615: No I/O has completed after 50% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 99340 milliseconds.
2024-09-10 15:22:14.706 [ONMD(2792)]CRS-1614: No I/O has completed after 75% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 49330 milliseconds.
2024-09-10 15:22:44.708 [ONMD(2792)]CRS-1613: No I/O has completed after 90% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 19330 milliseconds.
2024-09-10 15:23:04.710 [ONMD(2792)]CRS-1604: CSSD voting file is offline: /dev/mapper/grid2; details at (:CSSNM00058:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc.
2024-09-10 15:23:04.710 [ONMD(2792)]CRS-1606: The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc
2024-09-10 15:23:04.716 [ONMD(2792)]CRS-1699: The CSS daemon is terminating due to a fatal error from thread: clssnmvDiskPingMonitorThread; Details at (:CSSSC00012:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc
2024-09-10 15:23:04.731 [OCSSD(2794)]CRS-1652: Starting clean up of CRSD resources.

Once the 200 second voting disk timeout has been reached, this Oracle RAC node will evict itself from the cluster and reboot.

Loss of Oracle RAC replication

Loss of the Oracle RAC replication link will produce a similar result, except the timeouts will be shorter by default. An Oracle RAC node will wait 200 seconds after loss of storage connectivity before evicting, but it will only wait 30 seconds after loss of the RAC network heartbeat.

The CRS messages are similar to those shown below. You can see the 30 second timeout lapse. Since css_critical was set on jfs12, located on site A, that will be the site to survive and jfs13 on site B will be evicted.

2024-09-12 10:56:44.047 [ONMD(3528)]CRS-1611: Network communication with node jfs13 (2) has been missing for 75% of the timeout interval.  If this persists, removal of this node from cluster will occur in 6.980 seconds
2024-09-12 10:56:48.048 [ONMD(3528)]CRS-1610: Network communication with node jfs13 (2) has been missing for 90% of the timeout interval.  If this persists, removal of this node from cluster will occur in 2.980 seconds
2024-09-12 10:56:51.031 [ONMD(3528)]CRS-1607: Node jfs13 is being evicted in cluster incarnation 621599354; details at (:CSSNM00007:) in /gridbase/diag/crs/jfs12/crs/trace/onmd.trc.
2024-09-12 10:56:52.390 [CRSD(6668)]CRS-7503: The Oracle Grid Infrastructure process 'crsd' observed communication issues between node 'jfs12' and node 'jfs13', interface list of local node 'jfs12' is '192.168.30.1:33194;', interface list of remote node 'jfs13' is '192.168.30.2:33621;'.
2024-09-12 10:56:55.683 [ONMD(3528)]CRS-1601: CSSD Reconfiguration complete. Active nodes are jfs12 .
2024-09-12 10:56:55.722 [CRSD(6668)]CRS-5504: Node down event reported for node 'jfs13'.
2024-09-12 10:56:57.222 [CRSD(6668)]CRS-2773: Server 'jfs13' has been removed from pool 'Generic'.
2024-09-12 10:56:57.224 [CRSD(6668)]CRS-2773: Server 'jfs13' has been removed from pool 'ora.NTAP'.

Complete loss of replication network

Oracle RAC split-brain detection has a dependency on the Oracle RAC storage heartbeat. If loss of site-to-site connectivity results in simultaneous loss of both the RAC network heartbeat and storage replication services, the result is the RAC sites will not be able to communicate cross-site via either the RAC interconnect or the RAC voting disks. The result in an even-numbed set of nodes may be eviction of both sites under default settings. The exact behavior will depend on the sequence of events and the timing of the RAC network and disk heartbeat polls.

The risk of a 2-site outage can be addressed in two ways. First, using an odd number of RAC nodes, preferably by use of a 3^rd site tiebreaker. If a 3^rd site is unavailable, the tiebreaker instance could be placed on one site. This means that loss of site-to-site connectivity will cause LUN paths to go down on one site, but one of the RAC sites will still have quorum and will not evict.

If a 3^rd site is not available, this problem can be addressed by adjusting the misscount parameter on the RAC cluster. Under the defaults, the RAC network heartbeat timeout is 30 seconds. This normally is used by RAC to identify failed RAC nodes and remove them from the cluster. It also has a connection to the voting disk heartbeat.

If, for example, the conduit carrying intersite traffic for both Oracle RAC and storage replication services is cut by a backhoe, the 30 second misscount countdown will begin. If the RAC preferred site node cannot reestablish contact with the opposite site within 30 seconds, and it also cannot use the voting disks to confirm the opposite site is down within that same 30 second window, then the preferred site nodes will also evict. The result is a full database outage.

Depending on when the misscount polling occurs, 30 seconds may not be enough time for SnapMirror active sync to time out and allow storage on the preferred site to resume services before the 30 second window expires. This 30 second window can be increased.

[root@jfs12 ~]# /grid/bin/crsctl set css misscount 100
CRS-4684: Successful set of parameter misscount to 100 for Cluster Synchronization Services.

This value allows the storage system on the preferred site to resume operations before the misscount timeout expires. The result will then be eviction only of the nodes at the site where the LUN paths were removed. Example below:

2024-09-12 09:50:59.352 [ONMD(681360)]CRS-1612: Network communication with node jfs13 (2) has been missing for 50% of the timeout interval.  If this persists, removal of this node from cluster will occur in 49.570 seconds
2024-09-12 09:51:10.082 [CRSD(682669)]CRS-7503: The Oracle Grid Infrastructure process 'crsd' observed communication issues between node 'jfs12' and node 'jfs13', interface list of local node 'jfs12' is '192.168.30.1:46039;', interface list of remote node 'jfs13' is '192.168.30.2:42037;'.
2024-09-12 09:51:24.356 [ONMD(681360)]CRS-1611: Network communication with node jfs13 (2) has been missing for 75% of the timeout interval.  If this persists, removal of this node from cluster will occur in 24.560 seconds
2024-09-12 09:51:39.359 [ONMD(681360)]CRS-1610: Network communication with node jfs13 (2) has been missing for 90% of the timeout interval.  If this persists, removal of this node from cluster will occur in 9.560 seconds
2024-09-12 09:51:47.527 [OHASD(680884)]CRS-8011: reboot advisory message from host: jfs13, component: cssagent, with time stamp: L-2024-09-12-09:51:47.451
2024-09-12 09:51:47.527 [OHASD(680884)]CRS-8013: reboot advisory message text: oracssdagent is about to reboot this node due to unknown reason as it did not receive local heartbeats for 10470 ms amount of time
2024-09-12 09:51:48.925 [ONMD(681360)]CRS-1632: Node jfs13 is being removed from the cluster in cluster incarnation 621596607

Note: Oracle Support strongly discourages altering with the misscount or disktimeout parameters to solve configuration problems. Changing these parameters can, however, be warranted and unavoidable in many cases, including SAN booting, virtualized, and storage replication configurations. If, for example, you had stability problems with a SAN or IP network that was resulting in RAC evictions you should fix the underlying problem and not charge the values of the misscount or disktimeout. Changing timeouts to address configuration errors is masking a problem, not solving a problem. Changing these parameters to properly configure a RAC environment based on design aspects of the underlying infrastructure is different and is consistent with Oracle support statements. With SAN booting, it is common to adjust misscount all the way up to 200 to match disktimeout.

Storage system failure

The result of a storage system failure is nearly identical to the result of losing the replication link. The surviving site should experience a roughly 15 second IO pause on writes. Once that 15 second period elapses, IO will resume on that site as usual.

The Oracle RAC node on the failed site will lose storage services and enter the same 200 second disktimeout countdown before eviction and subsequent reboot.

2024-09-11 13:44:38.613 [ONMD(3629)]CRS-1615: No I/O has completed after 50% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 99750 milliseconds.
2024-09-11 13:44:51.202 [ORAAGENT(5437)]CRS-5011: Check of resource "NTAP" failed: details at "(:CLSN00007:)" in "/gridbase/diag/crs/jfs13/crs/trace/crsd_oraagent_oracle.trc"
2024-09-11 13:44:51.798 [ORAAGENT(75914)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 75914
2024-09-11 13:45:28.626 [ONMD(3629)]CRS-1614: No I/O has completed after 75% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 49730 milliseconds.
2024-09-11 13:45:33.339 [ORAAGENT(76328)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 76328
2024-09-11 13:45:58.629 [ONMD(3629)]CRS-1613: No I/O has completed after 90% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 19730 milliseconds.
2024-09-11 13:46:18.630 [ONMD(3629)]CRS-1604: CSSD voting file is offline: /dev/mapper/grid2; details at (:CSSNM00058:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc.
2024-09-11 13:46:18.631 [ONMD(3629)]CRS-1606: The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc
2024-09-11 13:46:18.638 [ONMD(3629)]CRS-1699: The CSS daemon is terminating due to a fatal error from thread: clssnmvDiskPingMonitorThread; Details at (:CSSSC00012:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc
2024-09-11 13:46:18.651 [OCSSD(3631)]CRS-1652: Starting clean up of CRSD resources.

The SAN path state on the RAC node that has lost storage services looks like this:

oradata7 (3600a0980383041334a3f55676c697347) dm-20 NETAPP,LUN C-Mode
size=128G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| `- 34:0:0:18 sdam 66:96  failed faulty running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 33:0:0:18 sdaj 66:48  failed faulty running

The linux host detected the loss of the paths much quicker than 200 seconds, but from a database perspective the client connections to the host on the failed site will still be frozen for 200 seconds under the default Oracle RAC settings. Full database operations will only resume after the eviction is completed.

Meanwhile, the Oracle RAC node on the opposite site will record the loss of the other RAC node. It otherwise continues to operate as usual.

2024-09-11 13:46:34.152 [ONMD(3547)]CRS-1612: Network communication with node jfs13 (2) has been missing for 50% of the timeout interval.  If this persists, removal of this node from cluster will occur in 14.020 seconds
2024-09-11 13:46:41.154 [ONMD(3547)]CRS-1611: Network communication with node jfs13 (2) has been missing for 75% of the timeout interval.  If this persists, removal of this node from cluster will occur in 7.010 seconds
2024-09-11 13:46:46.155 [ONMD(3547)]CRS-1610: Network communication with node jfs13 (2) has been missing for 90% of the timeout interval.  If this persists, removal of this node from cluster will occur in 2.010 seconds
2024-09-11 13:46:46.470 [OHASD(1705)]CRS-8011: reboot advisory message from host: jfs13, component: cssmonit, with time stamp: L-2024-09-11-13:46:46.404
2024-09-11 13:46:46.471 [OHASD(1705)]CRS-8013: reboot advisory message text: At this point node has lost voting file majority access and oracssdmonitor is rebooting the node due to unknown reason as it did not receive local hearbeats for 28180 ms amount of time
2024-09-11 13:46:48.173 [ONMD(3547)]CRS-1632: Node jfs13 is being removed from the cluster in cluster incarnation 621516934

Cut the power to the mediator

The mediator service does not directly control storage operations. It functions as an alternate control path between clusters. It exists primarily to automate failover without the risk of a split-brain scenario. In normal operation, each cluster is replicating changes to its partner, and each cluster therefore can verify that the partner cluster is online and serving data. If the replication link failed, replication would cease.

The reason a mediator is required for safe automated operations is because it would otherwise be impossible for a storage clusters to be able to determine whether loss of bidirectional communication was the result of a network outage or actual storage failure.

The mediator provides an alternate path for each cluster to verify the health of its partner. The scenarios are as follows:

If a cluster can contact its partner directly, replication services are operational. No action required.
If a preferred site cannot contact its partner directly or via the mediator, it will assume the partner is either actually unavailable or was isolated and has taken its LUN paths offline. The preferred site will then proceed to release the RPO=0 state and continue processing both read and write IO.
If a non-preferred site cannot contact its partner directly, but can contact it via the mediator, it will take its paths offline and await the return of the replication connection.
If a non-preferred site cannot contact its partner directly or via an operational mediator, it will assume the partner is either actually unavailable or was isolated and has taken its LUN paths offline. The non-preferred site will then proceed to release the RPO=0 state and continue processing both read and write IO. It will assume the role of the replication source and will become the new preferred site.

If the mediator is wholly unavailable:

Failure of replication services for any reason will result in the preferred site releasing the RPO=0 state and resuming read and write IO processing. The non-preferred site will take its paths offline.
Failure of the preferred site will result in an outage because the non-preferred site will be unable to verify that the opposite site is truly offline and therefore it would not be safe for the nonpreferred site to resume services.

Restoring services

It's difficult to demonstrate what happens when power is restored after a failure or a replication link failure is repaired. SnapMirror active sync will automatically detect the presence of a faulty replication relationship and bring it back to an RPO=0 state. Once synchronous replication is reestablished, the paths will come online again.

In many cases, clustered applications will automatically detect the return of failed paths, and those applications will also come back online. In other cases, a host-level SAN scan may be required, or applications may need to be brought back online manually. It depends on the application and how it's configured, and in general such tasks can be easily automated. ONTAP itself is self-healing and should not require any user intervention to resume RPO=0 storage operations.

Another feature of SnapMirror active sync that is difficult to demonstrate is the speed of recovery. Unlike many competitors, ONTAP replication uses a pointer-based mechanism to track changes. The result is resynchronization after replication interruption is extremely fast. ONTAP is able to identify changed blocks and efficiently and quickly ship them to the remote system, even if replication was interrupted for days, weeks, or months. Many competitors use a slower and less efficient extent-based approach which requires reading and transferring much more data than merely those blocks which have changed. In some cases, the entire mirror must be rebuilt.

In addition, ONTAP's unique pointer-based technology preserves an intact copy of your data during resynchronization. The copy would be out of date, but it would exist. Resynchronization with many competing technologies results in a copy that is corrupt until the resynchronization is complete. This leaves a customer exposed to significant data loss if the last remaining copy of the data is damaged.

This is one of many examples of how ONTAP features not only deliver results, but also deliver a superior end result because ONTAP just works better.

SnapMirror active sync failover

Changing the preferred site requires a simple operation. IO will pause for a second or two as authority over replication behavior switches between clusters, but IO is otherwise unaffected.

GUI example:

sm.2.png

Example of changing it back via the CLI:

Cluster2::> snapmirror failover start -destination-path jfs_as2:/cg/jfsAA
[Job 9575] Job is queued: SnapMirror failover for destination "jfs_as2:/cg/jfsAA                      ".

Cluster2::> snapmirror failover show
Source    Destination                                          Error
Path      Path        Type     Status    start-time end-time   Reason
-------- -----------  -------- --------- ---------- ---------- ----------
jfs_as1:/cg/jfsAA
         jfs_as2:/cg/jfsAA
                      planned  completed 9/11/2024  9/11/2024
                                         09:29:22   09:29:32

The new destination path can be verified as follows:

Cluster1::> snapmirror show -destination-path jfs_as1:/cg/jfsAA
                            Source Path: jfs_as2:/cg/jfsAA
                       Destination Path: jfs_as1:/cg/jfsAA
                      Relationship Type: XDP
                Relationship Group Type: consistencygroup
                 SnapMirror Policy Type: automated-failover-duplex
                      SnapMirror Policy: AutomatedFailOverDuplex
                           Mirror State: Snapmirrored
                    Relationship Status: InSync

Summary

The examples above are designed to illustrate SnapMirror active sync functionality. As with most enterprise IT projects, there are few true best practices. The right configuration depends on business needs including RPO, RTO, and SLAs. With respect to disaster recovery, it also depends on which disaster scenarios are more likely than others to occur.

The next posts planned will include use of an Oracle RAC tiebreaker, use of uniform networking, and quantifying application IO resumption times. The foundation remains the same - SnapMirror active sync operating in active-active mode, an integrated RPO=0/RTO=0 SAN replication technology that is reliable, self-healing, easily configured, and easily managed.

If you want to learn more, you can visit the official ONTAP documentation for SnapMirror active sync, and if you'd really like to see it in action I recommend getting in touch with Neto and the rest of the CPOC (Customer Proof Of Concept) team. They've got all the cool hardware to simulate your real-world workloads.