How does OCSR prevent split-brain scenarios?

timothyn · ‎2011-08-29

A user asked this question in response to Glenn's awesome introduction video, and I thought it was interesting enough for it's own conversation. Well that and you can't format comments in response to a video!

So how does OCSR prevent split-brain scenarios?

The short answer is that OCSR will only perform a failover when it detects a disaster, and the node it is running on is part of the windows cluster quorum.

For a more detailed answer let's take a look at a couple scenarios in the context of a sample configuration that is pretty typical of how we anticipate OCSR being deployed:

2 Sites
2 Windows Nodes per Site
1 File Share Witness (FSW) at a third site
1 MetroCluster controller at each site

Cluster quorum is achieved through votes. Each node gets one vote and the witness gets one vote. If a group of nodes is separated, the group must have N/2+1 votes (in this example 3 votes) to meet quorum or the cluster services are stopped.

So if site A loses connectivity, it will only have 2 votes, and the cluster applications will be shut down. Site B will still have access to the FSW and have 3 votes so it stays up. In that case OCSR can safely failover storage site B allowing WSFC to bring the failed applications online.

If both site A & site B both lose connectivity to each other and the FSW, BOTH sites will lose quorum and the cluster will shut down completely. In that case an administrator can manually "force quorum" at site A and bring the clustered services up and allowing OCSR to failover storage to site A.

This blog post has some good details on quroum models. Keep in mind that for multi-site clusters only "Node Majority" and "Node and File Share Majority" are supported by Microsoft (and thus OCSR).

aborzenkov · ‎2011-08-29

It was me who asked this question Unfortunately I am still confused.

Let's start with your configuration above. We have MetroCluster split brain. Both FC lines between sites are broken. On WFSC level everything is green - network works, heartbit bits, all 5 nodes know about each other. All applications continue to run (because both heads are still serving data each from own plexes),

Q1. How OCSR/WFSC react in this case? Does any automatic failover happen?

1 hour later site A goes down. Site B has node majority together with FSW.

Q2. How OCSR/WFSC react in this case? Does any automatic failover happen?

timothyn · ‎2011-08-29

Another great question!

Q1. How OCSR/WFSC react in this case? Does any automatic failover happen?

No. Assuming the storage is accessible from the widnows nodes, OCSR will not initiate automatic failover because both controllers are still online. If the storage is not available, then there isn't anything OCSR can do.

1 hour later site A goes down. Site B has node majority together with FSW.

Q2. How OCSR/WFSC react in this case? Does any automatic failover happen?

In this case Windows will failover applications to site B, and OCSR will failover storage to the controller at site B. Note that OCSR will not automatically heal the MetroCluster once connectivity is re-established, so if there is any data that needs to be retrieved from Site A an administrator has an opportunity to do that before manually re-establishing the mirrors (possibly using the MetroCluster Recovery Assistant in OCPM).

Regards,

Eric

aborzenkov · ‎2011-08-29

1 hour later site A goes down. Site B has node majority together with FSW.

Q2. How OCSR/WFSC react in this case? Does any automatic failover happen?

In this case Windows will failover applications to site B, and OCSR will failover storage to the controller at site B.

That is exactly what I do not buy. Effectively this means that customer applications all of a sudden lost 1 hour worth of data without customer even knowing it or having possibility to intervene. Having data available at another site is of little help here as soon as at least one transaction based on stale information takes place.

timothyn · ‎2011-08-29

It’s important to consider that the data is not lost unless the storage at site A is ultimately unrecoverable. The automated recovery doesn’t make that problem worse. But you are correct that in situations where any risk of running with stale data is unacceptable, automated recovery is probably not appropriate.

It is early in the product's history and we are already hard at work on 2.0, so what would you like to see in the scenario where you experience multiple sequential failures? There are some things that are outside our control, but perhaps we can alleviate the problem if not completely eliminate it.

Thanks for the feedback and discerning review!

Eric

aborzenkov · ‎2011-09-05

Sorry for delay.

No, it is not about multiple failures. Please understand - as soon as you lost access to one site you have no way to know what's going on there. Whether this was storage, cluster nodes, communication line failure; whether applications are still running, how much data had been processed. Even if Microsoft cluster will hopefully stop nodes in minority, split brain detection is not instantaneous; and today even several seconds is quite a long time for processing of a lot of requests.

So you have to assume that the described situation happens every time split brain occurs. There is no "multiple" and "single" failure. There is just failure

what would you like to see in the scenario where you experience multiple sequential failures?

Honestly? Nothing There is no way to automatically respond to site failure without risking data corruption. So what I'd expect in this case

- option to enable or disable automatic behaviour

- this option should default to disable

- documentation should explain possible consequences of enabling it

Look as example on EMC Cluster Enabler which does similar thing for SRDF or MV. This is exactly what they do.