The takeover cannot be initiated because the storage failover is disabled.

Mashk · ‎2024-11-11

Apologies if I get the terminology all wrong. I've got two netapp filers configured as a cluster. It's ancient hardware, I migrated most of the servers hosted on there in Azure so we didn't bother renewing the support contract. One of the disks died a while ago but we had a spare which kicked in. And it ran without redundancy. As luck would have it, just before the migration project was finished another disk died and that bought down everything. I'm assuming the disk dying bought it down, and that no other hardware has failed. I would have thought the 2nd cluster would have taken over but it didn't.

The status on the node says;

I'm guessing that because the disk died in the 1st node, that bought it down. So I thought if I replaced the disk then assigned it to the downed node it would come into life. However, I can only assign the unsigned disk to the node it can see. I can't get to node 1 at all

So, I think my only option is to try and force the failover mode to the 2nd node in the cluster.

FASCLUS1::> cf status
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
FASCLUS1-01 ARBFASCLUS1-02 - Unknown
FASCLUS1-02 ARBFASCLUS1-01 false Waiting for FASCLUS1-01, Takeover
is not possible: Partner node halted
after disabling takeover

I think my only option would be to run cf forcetakeover from 02 node.

Is this a good idea? Anything else I can try?

GM · ‎2024-11-11

Hi

while you can potentially do some force failovers or try to recover the aggr (or whatever the cause), the go-to approach would be to try first and recover the failing node.

to troubleshoot the down-node, you need to access the down-node using a console cable, or the service processor via SSH (according to your screenshot, the IP is 192.168.99.205, I also suggest setting the SSH session to record output and increase the console line buffer). the user should be any privileged local one you have (admin ?). once you in, you can type "system log" to try and see what was the issues leading to it, and perhaps the reason for the failover disabled, and you can type "system console" so you can access the live console and see the current error, or boot process issues are and continue the troubleshooting. if you see it in loader (very common, it will say LOADER > ) you can simply type boot_ontap.

From there, it's a bit depended on what issues you see, if it's a disk/aggr thing, you can maybe unfail it.

If this attempt not successful, you can then try and recover it on the surviving node, however, be warned that with this path you can easily end-up taking this node down too, or end-up with some data loss when fiddling with the disks.

Mashk · ‎2024-11-12

Many thanks for the response. I'm guessing this is the culprit. two data disks in RAID group "/aggr1/plex0/rg0" are broken. Halting system now. I did replace one of disks but I was unable to assign it to the node as it was showing as down. I'm guessing when I assign the disk it will rebuild the RAID? Is there a way of doing this via command line? Or can I just bring up the cluster in a broken state and then assign the replacement disk via GUI?

GM · ‎2024-11-12

Hi

once again, do not go the disk replacement route, many times disks aren't really failed, and the go-to is first to try and unfail as many of them possible and only replace them one by one when the system is back to full HA (not under takeover).

Note that NetApp has a clear warning that replacing the disks when in multi-disk failure scenario can lead to permanent data loss.

https://mysupport.netapp.com/site/article?lang=en&page=%2Fon-prem%2Fontap%2FOHW%2FOHW-KBs%2FAggregate_is_offline_after_a_multi-disk_panic&type=solutio...

Mashk · ‎2024-11-12

Apologies, this is all new to me. So I've put a replacement disk in there. I'm presuming I'd need to take the replacement disk out and then put the failed disk in and then unfail it? How would I do this from the SP? And if it doesn't work how do I get it to accept the new disk?

Thanks

GM · ‎2024-11-12

Hi

So, I can only re-iterate that you need to verify that this article applies to you (by logging in to the console and review the log), and progress with the instructions. I’m also inclined to suggest that before you are taking any disk out you triple-check that it not already been put into a raid group. (e.g, in your case, still unowned).

Will also add that bay numbering could be confusing - you can (sometimes) use a set-led command to help identify the disks or the ones around it.

If you still struggle, I suggest either pay NetApp for one-off support, or to contact a 3rd party who can put the right contract/insurance in place before giving an advice. I’m afraid I don’t have such nor enough visibility of the issue to give out direct recommendations.

Mashk · ‎2024-11-15

Thanks GM. I unfailed the disk and it's now working. Everything is up and running, with it appears no data loss. I still have 1 broken disk, so I haven't got a spare. Going to order a 2nd one. Fingers crossed it keeps on going until I can complete all the decommission and migration work. Once again, thank you for your invaluable help

The takeover cannot be initiated because the storage failover is disabled.

New video on NetApp KB TV

New video on NetApp KB TV

New video on NetApp KB TV