Metrocluster failover during power maintenance

alex_kcube · ‎2011-01-24

Hi All,

We have MetroCluster Implement on two sites, the remote site will be shutdown during power maintenance, Would somebody provide some guideline for the primary site takeover and giveback after P.M.

Also I have question about MetroCluster, if the remote site will power down over 24 hrs, the primary site data wll still access by User, it will resync with primary site automatically after giveback cmd issue from primary site?? if yes, how much time for re-sync?

Thanks

Alex

thomas_glodde · ‎2011-01-24

Issue a "cf takeover" on the node which will live through the power outage. then switch off the head first, then the disks.

If maintenance is done, switch on the shelves, wait a minute, then switch on the head. After a minute or two the "cf status" will show you ready for giveback. You could already failback at this point but you should check a "sysconfig -r" first to see if the aggregates have resynced already. As soon as you issue a takeover the aggregate snapshot reserve will be risen to the max, only if that reserve wont fit after a certain amount of time, you need a full resync. A full resync takes a bit depending on the amount of data to resync.

Get in touch with netapp global support to have a valid action plan on your issue please!

mheimberg · ‎2011-01-24

Hi Alex

Are there any services at the remote site (I will call it "site B") that users from site A need access to? Then those services need to be redundant too, don't they?

In such a case I would simple issue a "cf takeover" from site A - so node A will provide all the services from node B. It is safe to powerdown this node then. As soon as you switch off the shelves from site B, node A should switch to the remaining plexes at side A and serving data from the mirror-plexes.

After maintenance power up the shelves first, resyncing should occur automatically. How long it takes is depending on the amount of data written. Eventually you will observe an increasing aggr snap reserve.

Me personally would wait for the resync to finish before powering up node B - but I have no reason for that, just a feeling.

Probably the giveback has to be forced then by "cf giveback -f".

If there are no services depending on site B then it is even easier. I simply would halt node B by "halt -f" (thus preventing a takeover during halt) and shut down the site. Whenv site A has no access to the mailbox disks of node B and no cluster interconnect, MetroCluster is just doing nothing but serving data from its local node / services.

Any agree / disagree to these remarks?

Mark

scottgelb · ‎2011-01-25

I would open a case or even consider a PS engagement to make the task seamless. This could be considered a disaster scenario with power shut down for 24 hours, and if a disaster the procedure is "cf forcetakeover -d" to force the primary node to take over the remote site and also cutover to the plexes on the primary side. forcetakeover -d is brute force and takes over the remote partner node and breaks the syncmirror so all is running at the prod site...but the remote site must be shut down and fenced off to prevent split brain...and might cause client issues.. The recommendation already posted to perform a regular takeover then power off the remote site and let the mirror plex pickup after power down at the remote site seems a less brute force solution and won't split brain since the remote site will stop serving data (already recommended which might be better), but I would run your plan with support to determine what procedure to take...