ONTAP Hardware
ONTAP Hardware
Hi,
Can anyone confirm how you would power up a controller on its partner node if a graceful takeover did not take place? I can describe this in the following scenario below.
You have 2 controllers in a HA pair, controller_A and controller_B. You need to do some sort of maintenance or system move that requires both controllers to be shut down. While working on controller_B you damage the controller hardware that does not allow it to POST when powering on. Controller_A works fine and can boot successfully, so the question is how do you power on controller_A’s partner (controller_B) that has failed? Does this happen automatically after controller_A does not receive a heartbeat from controller_B?
If not can you do a partner, then boot_ontap from controller_A or similar?
Many Thanks
Luke
Solved! See The Solution
I was going to try on the FAS3240AE in our lab, except it has UCS boot luns on both nodes... our other SEs wouldn't like me halting a node...but I was able to get an old FAS2020A with ONTAP 7.3 and try it out. Andrey was right...forcetakeover does bring the partner node up. See console below... I did a halt -f on the partner node...simulating a node that didn't come up. Then cf forcetakeover worked. I then booted node2 and it came up waiting for giveback and I was able to cf giveback.
I wasn't sure about this until testing it...glad we have this community to learn and relearn what we forget
node1> cf status
node2 may be down, takeover disabled because of reason (partner halted in notakeover mode)
node1 has disabled takeover by node2 (interconnect error)
VIA Interconnect is down (link down).
node1> cf takeover
cf: takeover cannot be performed because of reason (partner halted in notakeover mode)
node1> cf takeover -f
cf: takeover cannot be performed because of reason (partner halted in notakeover mode)
node1> cf forcetakeover
cf forcetakeover may lead to data corruption; really force a takeover? y
cf: forcetakeover initiated by operator
node1(takeover)>
Really good question. I don't know of a way to takeover a node that isn't up already. You would need to Rma the bad controller or use a spare controller to bring the partner up.
Anyone know a workaround? There may be a diag method to boot the partner for takeover but I haven't seen one. I did have a use case to do this when a controller failed after maintenance as you described and we waited for the Rma controller to show up and had no other workaround from support back then..
“takeover -f” or “forcetakeover”?
Part of problem is, it is not known whether NVRAM is clean. So user is responsible for any potential data loss …
Interesting one - we had this discussion with Luke the other day in a pub over "Storage Beers" .
So you are saying forced takeover should do the trick? Is mailbox on disks playing any role in it?
Regards,
Radek
Well, I have not tried it myself and I do not have systems to test. But I expect that if previous state was clean shutdown of both partners, it should work. I appreciate of someone with hardware available could test and report.
I'll have to wrangle some hardware later. With the Partner not running it has nothing to takeover unless it boots the partner in takeover mode which I haven't seen. But thinking avoit it more... Metrocluster with a site failure does sort of do this of the partner is not accessible with takeover -d working off the syncmirror plex at the live site so might work non metrocluster too.
aborzenkov,
If the controllers were cleanly shutdown surely NVRAM would be flushed to disk? In this case would this be an issue?
Luke
Yes, after clean shutdown it should be OK.
I was going to try on the FAS3240AE in our lab, except it has UCS boot luns on both nodes... our other SEs wouldn't like me halting a node...but I was able to get an old FAS2020A with ONTAP 7.3 and try it out. Andrey was right...forcetakeover does bring the partner node up. See console below... I did a halt -f on the partner node...simulating a node that didn't come up. Then cf forcetakeover worked. I then booted node2 and it came up waiting for giveback and I was able to cf giveback.
I wasn't sure about this until testing it...glad we have this community to learn and relearn what we forget
node1> cf status
node2 may be down, takeover disabled because of reason (partner halted in notakeover mode)
node1 has disabled takeover by node2 (interconnect error)
VIA Interconnect is down (link down).
node1> cf takeover
cf: takeover cannot be performed because of reason (partner halted in notakeover mode)
node1> cf takeover -f
cf: takeover cannot be performed because of reason (partner halted in notakeover mode)
node1> cf forcetakeover
cf forcetakeover may lead to data corruption; really force a takeover? y
cf: forcetakeover initiated by operator
node1(takeover)>
Scott,
Many thanks for trying this out, great news that you had a filer to try it on. Out of interest has anyone seen this scenario discribed in any NetApp documentation?
Luke
I have not seen this documented specifically but closely for metrocluster. I skimmed through the cluster guide (High Availability guide now) and it is documented for metrocluster with the "cf forcetakeover -d" method... so similar where the guide describes a down node and -d uses the syncmirror aggregates.
Awesome stuff - thanks for testing this! 🙂
Where is the pub you guys had the discussion at? I could use a drink right now
Scott, I would be more than happy to be your pub tour guide when you're on our side of the Pond! 😉
Hopefully Dublin at Insight in November.
Sent from my iPhone 4S