How would one automatically fail over to a snap mirror volume when primary volume fails?

eveares · ‎2018-05-10

Hi everyone, I am new here so please go easy on me.

At my work place we use NetApp On Command for our unified storage solution/SAN. We also have all our critical volumes mirrored to our DR facility using Snap Mirror.

Link between main and DR facility is 100Mbps. Local IP range at main site and DR site has different subnet and IP range, but are routable under normal conditions.

i.e. Main site = 10.10.1.xxx/255.255.248.0 DR site 10.10.2.xxx/255.255.255.0

In regards to Microsoft Hyper-V to which we use, currently our Hyper-V VM hosts (2012 R2) each have an iSCSi drive attached for the individual VM's of that host. The VM host's iSCSI drives are connected to volumes on our main NetApp SAN using Snap Drive.

During DR including during DR practice drills; we quiesce and break the volumes from our main site and then mount the iSCSI drives on the DR Hyper-V hosts from the mirrored volumes down in DR, what of course become writable once the mirror has been broken.

Now earlier today I was casually researching about Microsoft clustering services and it's integration with Hyper-V, but could not quite work out how you would integrate it with NetApp and how one could set up a system where should our main site (including the NetApp system) go offline, how one would automatically fail over Hyper-V hosts to DR without manually having to break mirror volumes and then manually mounting the iSCSI drives from the DR Hyper-V hosts.

In other simplified words:

1) How one would do live and automatic Hyper-V clustering between two physical sites where the iSCSI disks are stored and crossed synced on the NetApp SAN at both sites...

1a) ...and the backup DR site with the Hyper-V hosts would need to take over automatically and remain up should the main site and it's NetApp SAN go completely down?

Essentially a 2 way cross sync between NetApp volumes at both sites, with data on volumes at the main site taking precedence over newer data on volumes at DR site.

Consistency issues possible i fear?

Regards: Elliott.

aborzenkov · ‎2018-05-10

SnapMirror is asynchronous so any failover will mean data loss (and with 100Mb/s line it will mean significant data loss). This must be conscious decision of administrator (actually, such decision normally should be taken on higher business levels) after evaluating impact of data loss. If you want automated failover, you need to ensure synchronous replication and this is what MetroCluster does.

That said, ONTAP does not provide any means to automate SnapMirror failover; you will need to use some host-based solutions that monitors systems and initiates it. I remember some support for SnapMirror in Veritas cluster, but this was long ago for 7-Mode, not sure what current state is.