Active-active data center (with Oracle!)
I've worked with Oracle customers on DR solutions for 15+ years. The perfect solution would, of course, be RPO=0 and RTO=0*, but not all applications can tolerate the write latency involved in an RPO=0 synchronous solution. Sometimes you have to settle for an RPO of 15 minutes or a slightly longer RTO.
Sometimes, however, RPO=0 and RTO=0 are required because the data is really that critical.
We've been able to do this with SnapMirror active sync (formerly known as SnapMirror Business Continuity) for a while, but now we can do it in symmetric active-active mode. You can now have two clusters in two completely different sites, each serving data, with identical performance characteristics, and you don't even need to extend the SAN across sites.
This is the foundation of what customers call "active-active data center". There is no primary site and DR site. There are just two sites. Half your database is running on site A, and the other half is running on site B. Each local storage system will service all read IO from its local copy of the data. Write IO will, of course, be replicated to the opposite site before being acknowledged, because that's how synchronous mirroring works. Symmetric storage IO means symmetric database responses and symmetric application behavior.
SnapMirror active sync in active-active mode is in tech preview now with select customers. Oracle RAC is not yet a supported configuration, but there's no technical reason it shouldn't work, and I wanted to be ready for this feature to become generally available. I've been cutting power and network links for the past couple weeks, and I haven't managed to crash my database yet.
*Note: There's really no such thing as RTO=0 because it takes a certain amount of time to know whether recovery procedures are even warranted. You don't want to have a total disaster failover triggered just because a single IO operation didn't complete on one second. I consider SnapMirror active sync to be an RTO=0 solution because the environment is already running at the opposite site. The lag time in resuming operations isn't because of failover itself, it's sometimes it takes at least 15-30 seconds, even under automation, to be sure that failover is required.
I'm developing reference architectures with and without a 3rd site Oracle RAC tiebreaker and plan to release some accompanying videos, but here's an overview of how it works. Take a look at the diagram and then continue reading to understand the value.
Architecture
This is a typical Oracle RAC configuration with a database called NTAP, with two instances, NTAP1 and NTAP2. The diagram might look complicated at first, but here's the key to understanding it:
SnapMirror active sync is invisible
From an Oracle and host point of view, this is just one set of LUNs on a single cluster. The replication is invisible. It's the same set of LUNs at both sites. I haven't even stretched the SAN across sites, although I could have done that if I wanted to. I'd rather not create a cross-site ISL if I don't have to.
When I installed RAC, I had a couple of hosts that each had a set of 3 LUNs to be used for quorum management. These hosts, jfs12 and jfs13, each see the same LUNs with the same serial numbers and the same data.
When I created the database, I created an 8-LUN ASM diskgroup for the datafiles and an 8-LUN ASM diskgroup for logs. It doesn't matter which host I use to make the database. They're both using the same LUNs.
Think of it as one single system with paths that happen to exist on two different sites. Any path either cluster leads to the same LUN.
SnapMirror active sync is symmetric
Database connections can now be made to either instance. If that instance needs to perform a read, the data will be retrieved from the local drives. Writes will be replicated to the opposite site before being acknowledged, of course, so site-to-site latency needs to be as low as possible.
It doesn't matter which site you're using. Database performance is the same, unless you intentionally used different controller models with differing performance limits. This is a valid choice. Maybe you want RPO=0/RTO=0 but one of your sites is designed to be just a temporary site, and doesn't require the same storage horsepower as the other site.
SnapMirror active sync is resilient
This is the part I'm still working on documenting. There's a mediator service that acts as a heartbeat to detect controller failures. The mediator isn't an active tiebreaker service, but it's the same idea. It works as an alternate communication channel for each cluster to check the health of the opposite cluster. For example, if the cluster on site B suddenly fails, the cluster on site A will lose the ability to contact cluster B either directly or via the mediator. That allows cluster A to release the mirroring and resume operations.
Overall, "it just works". For example, my initial tests involved simply cutting the power at one site. Here's what happened:
- One set of paths ceased responding, while the other set of paths remained available
- All write IO paused because it was no longer possible to replicate the writes
- After about 30 seconds, the surviving site considered the site with the power failure truly dead and broke the mirroring so the surviving site could resume operations
- The Oracle instance on the failed site continued to try to contact storage for a full 200 seconds. This is the default timeout setting with RAC. You can change it if required.
- After the 200 second expiration, the Oracle instance performed a self-reboot. It does this to help protect data from corruption due to a lingering IO operation stuck in a retry loop on the host.
- This also means that stalled transactions on the failed node were held for 200 seconds before being replayed on the surviving node. This is a good example of how the RTO of a storage system is not the only factor affecting failover times.
The recovery process was unexpectedly seamless:
- Power was restored to the failed storage system
- It took about 8 minutes to fully power up, self-test, boot, and resume clustered operations
- The surviving site detected the return of the other site.
- The mirror was asynchronously resynchronized to get the states of site A and site B really close together. This took about 5 minutes.
- The mirror then transitioned to synchronized state
- The Oracle server detected the presence of SAN paths
- The Oracle RAC process, which had been delaying the boot process, found usable RAC quorum devices
- The database instance came up again
That was an unexpected surprise. I expected more recovery work would be required, but it was just "turn the power back on" and everything went back to normal.
I've got more to do, including getting timings of various operations, collecting logs, tuning RAC, and especially writing up Oracle RAC quorum behavior. It's not complicated, but it's not well documented by Oracle.
Look for a lot more when the next version of ONTAP ships.