ONTAP Hardware

Powering up a failed controller on its partner under hardware failure

lmunro_hug
14,382 Views

Hi,

Can anyone confirm how you would power up a controller on its partner node if a graceful takeover did not take place? I can describe this in the following scenario below.

You have 2 controllers in a HA pair, controller_A and controller_B. You need to do some sort of maintenance or system move that requires both controllers to be shut down. While working on controller_B you damage the controller hardware that does not allow it to POST when powering on. Controller_A works fine and can boot successfully, so the question is how do you power on controller_A’s partner (controller_B) that has failed? Does this happen automatically after controller_A does not receive a heartbeat from controller_B?

If not can you do a partner, then boot_ontap from controller_A or similar?

Many Thanks
Luke

1 ACCEPTED SOLUTION

scottgelb
14,380 Views

I was going to try on the FAS3240AE in our lab, except it has UCS boot luns on both nodes... our other SEs wouldn't like me halting a node...but I was able to get an old FAS2020A with ONTAP 7.3 and try it out.  Andrey was right...forcetakeover does bring the partner node up.  See console below... I did a halt -f on the partner node...simulating a node that didn't come up.  Then cf forcetakeover worked.  I then booted node2 and it came up waiting for giveback and I was able to cf giveback.

I wasn't sure about this until testing it...glad we have this community to learn and relearn what we forget  

node1> cf status

node2 may be down, takeover disabled because of reason (partner halted in notakeover mode)

node1 has disabled takeover by node2 (interconnect error)

VIA Interconnect is down (link down).

node1> cf takeover

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf takeover -f

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf forcetakeover

cf forcetakeover may lead to data corruption; really force a takeover? y

cf: forcetakeover initiated by operator

node1(takeover)>

View solution in original post

14 REPLIES 14

scottgelb
14,355 Views

Really good question. I don't know of a way to takeover a node that isn't up already. You would need to Rma the bad controller or use a spare controller to bring the partner up.

Anyone know a workaround? There may be a diag method to boot the partner for takeover but I haven't seen one. I did have a use case to do this when a controller failed after maintenance as you described and we waited for the Rma controller to show up and had no other workaround from support back then..

aborzenkov
14,355 Views

“takeover -f” or “forcetakeover”?

Part of problem is, it is not known whether NVRAM is clean. So user is responsible for any potential data loss …

radek_kubka
14,355 Views

Interesting one - we had this discussion with Luke the other day in a pub over "Storage Beers" .

So you are saying forced takeover should do the trick? Is mailbox on disks playing any role in it?

Regards,

Radek

aborzenkov
14,355 Views

Well, I have not tried it myself and I do not have systems to test. But I expect that if previous state was clean shutdown of both partners, it should work. I appreciate of someone with hardware available could test and report.

scottgelb
14,355 Views

I'll have to wrangle some hardware later. With the Partner not running it has nothing to takeover unless it boots the partner in takeover mode which I haven't seen. But thinking avoit it more... Metrocluster with a site failure does sort of do this of the partner is not accessible with takeover -d working off the syncmirror plex at the live site so might work non metrocluster too.

lmunro_hug
14,355 Views

aborzenkov,

If the controllers were cleanly shutdown surely NVRAM would be flushed to disk? In this case would this be an issue?

Luke

aborzenkov
14,355 Views

Yes, after clean shutdown it should be OK.

scottgelb
14,381 Views

I was going to try on the FAS3240AE in our lab, except it has UCS boot luns on both nodes... our other SEs wouldn't like me halting a node...but I was able to get an old FAS2020A with ONTAP 7.3 and try it out.  Andrey was right...forcetakeover does bring the partner node up.  See console below... I did a halt -f on the partner node...simulating a node that didn't come up.  Then cf forcetakeover worked.  I then booted node2 and it came up waiting for giveback and I was able to cf giveback.

I wasn't sure about this until testing it...glad we have this community to learn and relearn what we forget  

node1> cf status

node2 may be down, takeover disabled because of reason (partner halted in notakeover mode)

node1 has disabled takeover by node2 (interconnect error)

VIA Interconnect is down (link down).

node1> cf takeover

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf takeover -f

cf: takeover cannot be performed because of reason (partner halted in notakeover mode)

node1> cf forcetakeover

cf forcetakeover may lead to data corruption; really force a takeover? y

cf: forcetakeover initiated by operator

node1(takeover)>

lmunro_hug
14,355 Views

Scott,

Many thanks for trying this out, great news that you had a filer to try it on. Out of interest has anyone seen this scenario discribed in any NetApp documentation?

Luke

scottgelb
8,501 Views

I have not seen this documented specifically but closely for metrocluster.  I skimmed through the cluster guide (High Availability guide now) and it is documented for metrocluster with the "cf forcetakeover -d" method... so similar where the guide describes a down node and -d uses the syncmirror aggregates. 

radek_kubka
8,500 Views

Awesome stuff - thanks for testing this! 🙂

scottgelb
8,499 Views

Where is the pub you guys had the discussion at?  I could use a drink right now

radek_kubka
8,499 Views

Scott, I would be more than happy to be your pub tour guide when you're on our side of the Pond! 😉

scottgelb
8,501 Views

Hopefully Dublin at Insight in November.

Sent from my iPhone 4S

Public