Subscribe
Accepted Solution

Powering up a failed controller on its partner under hardware failure

Hi,

Can anyone confirm how you would power up a controller on its partner node if a graceful takeover did not take place? I can describe this in the following scenario below.

You have 2 controllers in a HA pair, controller_A and controller_B. You need to do some sort of maintenance or system move that requires both controllers to be shut down. While working on controller_B you damage the controller hardware that does not allow it to POST when powering on. Controller_A works fine and can boot successfully, so the question is how do you power on controller_A’s partner (controller_B) that has failed? Does this happen automatically after controller_A does not receive a heartbeat from controller_B?

If not can you do a partner, then boot_ontap from controller_A or similar?

Many Thanks
Luke

Re: Powering up a failed controller on its partner under hardware failure

Really good question. I don't know of a way to takeover a node that isn't up already. You would need to Rma the bad controller or use a spare controller to bring the partner up.

Anyone know a workaround? There may be a diag method to boot the partner for takeover but I haven't seen one. I did have a use case to do this when a controller failed after maintenance as you described and we waited for the Rma controller to show up and had no other workaround from support back then..

Re: Powering up a failed controller on its partner under hardware failure

“takeover -f” or “forcetakeover”?

Part of problem is, it is not known whether NVRAM is clean. So user is responsible for any potential data loss …

Re: Powering up a failed controller on its partner under hardware failure

Interesting one - we had this discussion with Luke the other day in a pub over "Storage Beers" .

So you are saying forced takeover should do the trick? Is mailbox on disks playing any role in it?

Regards,

Radek

Re: Powering up a failed controller on its partner under hardware failure

aborzenkov,

If the controllers were cleanly shutdown surely NVRAM would be flushed to disk? In this case would this be an issue?

Luke

Re: Powering up a failed controller on its partner under hardware failure

Yes, after clean shutdown it should be OK.

Re: Powering up a failed controller on its partner under hardware failure

Well, I have not tried it myself and I do not have systems to test. But I expect that if previous state was clean shutdown of both partners, it should work. I appreciate of someone with hardware available could test and report.

Re: Powering up a failed controller on its partner under hardware failure

I'll have to wrangle some hardware later. With the Partner not running it has nothing to takeover unless it boots the partner in takeover mode which I haven't seen. But thinking avoit it more... Metrocluster with a site failure does sort of do this of the partner is not accessible with takeover -d working off the syncmirror plex at the live site so might work non metrocluster too.

Re: Powering up a failed controller on its partner under hardware failure

I was going to try on the FAS3240AE in our lab, except it has UCS boot luns on both nodes... our other SEs wouldn't like me halting a node...but I was able to get an old FAS2020A with ONTAP 7.3 and try it out.  Andrey was right...forcetakeover does bring the partner node up.  See console below... I did a halt -f on the partner node...simulating a node that didn't come up.  Then cf forcetakeover worked.  I then booted node2 and it came up waiting for giveback and I was able to cf giveback.

I wasn't sure about this until testing it...glad we have this community to learn and relearn what we forget  

node1> cf status

node2 may be down, takeover disabled because of reason (partner halted in notakeover mode)

node1 has disabled takeover by node2 (interconnect error)

VIA Interconnect is down (link down).

node1> cf takeover

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf takeover -f

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf forcetakeover

cf forcetakeover may lead to data corruption; really force a takeover? y

cf: forcetakeover initiated by operator

node1(takeover)>

Re: Powering up a failed controller on its partner under hardware failure

Scott,

Many thanks for trying this out, great news that you had a filer to try it on. Out of interest has anyone seen this scenario discribed in any NetApp documentation?

Luke