2009-12-28 08:24 AM
In a Metrocluster configuration where both sites have an UPS and emergency power supplies. In case of a site power loss and also the emergency power supply fails, the controller and disk shelves can "run" on the UPS for a couple of minutes before this controller starts shutting down. Will the other controller in this Metrocluster takeover the controller which is shutting down?
2011-05-22 11:35 PM
I would like to up this thread.
My NetApp integrator recommand to supply my 7 shelfs via an UPS. In case of power outage, the controller will shutdown imediatly and the shelfs 5-10min later, giving the time to the other controller (the other one in our Metrocluster) notice the controller lost and make the takeover by itself. What do you think about this ?
2011-05-23 02:45 AM
Basically, you want to try to avoid that both the disks and the controller stop a the same time. This makes it easier (possible) for the OS to make decisions concerning a fail-over. When everything goes at the same time, you get a "split-brain" situation. This basically means that the partner disappears and the remaining controller has no information about what happened so it waits for human intervention before taking over the partners disks to avoid corrupting data... or having two mirror copies that can't be united without loss of data.
The situation that latecoere suggest is more ideal. Make sure that either the controller or the disks fail first (with perhaps 30 seconds head start) before the remaining element. This way, a fail-over will occur as expected, because there will still be enough information for the surviving head to make such decisions. Of the two, I guess I would prefer that the controller fails first.
2011-05-23 03:56 AM
There is UPS support in Ontap, but only for APC models: https://kb.netapp.com/support/index?page=content&id=3011918
If your UPS is not supported, you can also create your own scripts to initiate a cluster failover based on an snmp trap send out by the UPS.
2011-06-08 07:33 AM
Basically, you want to try to avoid that both the disks and the controller stop a the same time. This makes it easier (possible) for the OS to make decisions concerning a fail-over. When everything goes at the same time, you get a "split-brain" situation.
Just some nit-picking: No, you won't get into a split-brain situation. You never get into that with NetApp. You get in a situation where the filer cannot be 100% sure that a takeover will NOT result in a split brain.
"split brain" means you have two copies of your data active, one in each site. This *could* happen if the controller takes over automatically, but since it won't (you have to do "cf forcetakeover -d") you don't get into the split brain situation.
2011-06-08 10:08 AM
Now, if you take this out of context, then yeah, most of what you say is correct.
The point is, if you have the total sudden loss of one site (controller, disks, interconnect), the situation is, for the suriviving controller, indiscernible from a split-brain situation. The surviving controller does not have, nor can it get, any information about whether its partner is still running. Whether it is running or dead is not relevant here because the surviving controller makes the same decisions as it would if the partner were still running and all interconnects were severed.
Since the goal of the UPS setup was to transition to a working failover, the advice was to try to get one element, either the controller or the disks, to fail first or else a "split-brain" (software) procedure would be followed by the surviving controller and no failover would occur.
You have to have a lot of spare time to dig into a 17 day old post to "nit-pick" . "Never" is, by the way, a very long time...