Solved: Re: Powering up a failed controller on its partner under hardware failure

lmunro_hug · ‎2012-05-30

Hi,

Can anyone confirm how you would power up a controller on its partner node if a graceful takeover did not take place? I can describe this in the following scenario below.

You have 2 controllers in a HA pair, controller_A and controller_B. You need to do some sort of maintenance or system move that requires both controllers to be shut down. While working on controller_B you damage the controller hardware that does not allow it to POST when powering on. Controller_A works fine and can boot successfully, so the question is how do you power on controller_A’s partner (controller_B) that has failed? Does this happen automatically after controller_A does not receive a heartbeat from controller_B?

If not can you do a partner, then boot_ontap from controller_A or similar?

Many Thanks
Luke

scottgelb · ‎2012-05-31

I was going to try on the FAS3240AE in our lab, except it has UCS boot luns on both nodes... our other SEs wouldn't like me halting a node...but I was able to get an old FAS2020A with ONTAP 7.3 and try it out. Andrey was right...forcetakeover does bring the partner node up. See console below... I did a halt -f on the partner node...simulating a node that didn't come up. Then cf forcetakeover worked. I then booted node2 and it came up waiting for giveback and I was able to cf giveback.

I wasn't sure about this until testing it...glad we have this community to learn and relearn what we forget

node1> cf status

node2 may be down, takeover disabled because of reason (partner halted in notakeover mode)

node1 has disabled takeover by node2 (interconnect error)

VIA Interconnect is down (link down).

node1> cf takeover

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf takeover -f

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf forcetakeover

cf forcetakeover may lead to data corruption; really force a takeover? y

cf: forcetakeover initiated by operator

node1(takeover)>

View solution in original post

scottgelb · ‎2012-05-31

Really good question. I don't know of a way to takeover a node that isn't up already. You would need to Rma the bad controller or use a spare controller to bring the partner up.

Anyone know a workaround? There may be a diag method to boot the partner for takeover but I haven't seen one. I did have a use case to do this when a controller failed after maintenance as you described and we waited for the Rma controller to show up and had no other workaround from support back then..

aborzenkov · ‎2012-05-31

“takeover -f” or “forcetakeover”?

Part of problem is, it is not known whether NVRAM is clean. So user is responsible for any potential data loss …

radek_kubka · ‎2012-05-31

Interesting one - we had this discussion with Luke the other day in a pub over "Storage Beers" .

So you are saying forced takeover should do the trick? Is mailbox on disks playing any role in it?

Regards,

Radek

aborzenkov · ‎2012-05-31

Well, I have not tried it myself and I do not have systems to test. But I expect that if previous state was clean shutdown of both partners, it should work. I appreciate of someone with hardware available could test and report.

scottgelb · ‎2012-05-31

I'll have to wrangle some hardware later. With the Partner not running it has nothing to takeover unless it boots the partner in takeover mode which I haven't seen. But thinking avoit it more... Metrocluster with a site failure does sort of do this of the partner is not accessible with takeover -d working off the syncmirror plex at the live site so might work non metrocluster too.

lmunro_hug · ‎2012-05-31

aborzenkov,

If the controllers were cleanly shutdown surely NVRAM would be flushed to disk? In this case would this be an issue?

Luke

aborzenkov · ‎2012-05-31

Yes, after clean shutdown it should be OK.

scottgelb · ‎2012-05-31

I was going to try on the FAS3240AE in our lab, except it has UCS boot luns on both nodes... our other SEs wouldn't like me halting a node...but I was able to get an old FAS2020A with ONTAP 7.3 and try it out. Andrey was right...forcetakeover does bring the partner node up. See console below... I did a halt -f on the partner node...simulating a node that didn't come up. Then cf forcetakeover worked. I then booted node2 and it came up waiting for giveback and I was able to cf giveback.

I wasn't sure about this until testing it...glad we have this community to learn and relearn what we forget

node1> cf status

node2 may be down, takeover disabled because of reason (partner halted in notakeover mode)

node1 has disabled takeover by node2 (interconnect error)

VIA Interconnect is down (link down).

node1> cf takeover

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf takeover -f

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf forcetakeover

cf forcetakeover may lead to data corruption; really force a takeover? y

cf: forcetakeover initiated by operator

node1(takeover)>

lmunro_hug · ‎2012-05-31

Scott,

Many thanks for trying this out, great news that you had a filer to try it on. Out of interest has anyone seen this scenario discribed in any NetApp documentation?

Luke

scottgelb · ‎2012-05-31

I have not seen this documented specifically but closely for metrocluster. I skimmed through the cluster guide (High Availability guide now) and it is documented for metrocluster with the "cf forcetakeover -d" method... so similar where the guide describes a down node and -d uses the syncmirror aggregates.

radek_kubka · ‎2012-05-31

Awesome stuff - thanks for testing this! 🙂

scottgelb · ‎2012-05-31

Where is the pub you guys had the discussion at? I could use a drink right now

radek_kubka · ‎2012-05-31

Scott, I would be more than happy to be your pub tour guide when you're on our side of the Pond! 😉

scottgelb · ‎2012-05-31

Hopefully Dublin at Insight in November.

Sent from my iPhone 4S