Re: FAS2560 refusing failover

Jelle · ‎2023-07-11

Hello,

I recently got a FAS2650 controller. Not sure how the previous configuration is, but the two controllers just refuse to talk to each other.

It might have been part of an HA system, as it listed 4 nodes when showing failover, two of which it couldn't connect to.

Problem is of course that I don't have the admin password, so first I just wanted to reset the password.

I was unable because I kept getting "waiting for giveback", but I found a guide here: https://community.netapp.com/t5/ONTAP-Discussions/Reset-admin-Password-on-Clustered-ONTAP/m-p/125508

So basically take one controller offline, reset the password and repeat for the other controller.

However the second controller still booted up without the password reset, which is not what the guide says it should do.

Because I changed the password on one controller I wanted to do the failover using "storage failover giveback" but it kept failing, telling me it can't communicate with the node, but when showing nodes it does show up, but can't do giveback.

So basically I gave up on the whole thing and figured, the controller is not the most important thing in the world, I got a few of them, and I'd like to learn netapp. I have dealt with SAN many times from many different brands, just not a whole lot of netapp.

I figured I would just run a wipeconfig, but again both controllers was just hanging at waiting for giveback.

I took one controller out, did a wipeconfig, took that one out, put in the other and did a wipeconfig on that one as well. Also did set-defaults in loader.

Now they are both wiped and both boot up as clean new controllers, but I keep getting errors relating to communication between the two controllers:

Jul 11 11:33:59 [localhost:fmmb.disk.notAccsble:notice]: All Local mailbox disks are inaccessible.
Restoring parity from NVRAM
Jul 11 11:33:59 [localhost:cf.fm.notkoverClusterDisable:error]: Failover monitor: takeover disabled (restart)
Replaying WAFL log
Jul 11 11:33:59 [localhost:wafl.transition.cp.completed:notice]: Transition CP with reason replay, 00000000 for replaying=1,1 unmounting=0,0 total=2,1 volumes with a total of total=96 incoming=8 dirty buffers took 66ms with longest CP phases being CP_P3A_VOLINFO=40, CP_P2_FLUSH=16, CP_P1_CLEAN=4 on aggregate aggr0.
Jul 11 11:34:00 [localhost:wafl.transition.cp.completed:notice]: Transition CP with reason none, 00000000 for replaying=0,0 unmounting=0,0 total=2,1 volumes with a total of total=83 incoming=0 dirty buffers took 63ms with longest CP phases being CP_P3A_VOLINFO=43, CP_P2_FLUSH=16, CP_P2_FINISH=0 on aggregate aggr0.
Jul 11 11:34:00 [localhost:cf.fsm.backupMailboxError:error]: Failover monitor: partner mailbox error detected.
Jul 11 11:34:00 [localhost:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of partner disabled (Controller Failover takeover disabled).
Jul 11 11:34:00 [localhost:cf.fm.notkoverBadMbox:notice]: Failover monitor: uninitialized backup mailbox data detected
Jul 11 11:34:00 [localhost:kern.syslog.msg:notice]: domain xing mode: off, domain xing interrupt: false
Jul 11 11:34:00 [localhost:extCache.rw.log.open:notice]: WAFL external cache log could not be opened: aggregate aggr0, log ec_tagstore.
Jul 11 11:34:00 [localhost:extCache.rw.canceled:notice]: WAFL external cache reconstruct was canceled.
Jul 11 11:34:00 [localhost:clam.invalid.config:error]: Local node (name=unknown, id=0) is in an invalid configuration for providing CLAM functionality. CLAM cannot determine the identity of the HA partner.
Jul 11 11:34:01 [localhost:extCache.rw.terminated:notice]: WAFL external cache warming process terminated.
Jul 11 11:34:01 [localhost:extCache.rw.replay.canceled:notice]: WAFL external cache replay canceled for aggregate aggr0: Aggregate came online after timeout.
wrote key file "/tmp/rndc.key"
Jul 11 11:35:00 [localhost:monitor.globalStatus.critical:EMERGENCY]: Controller failover partner unknown. Controller failover not possible.

Of course both controllers start up with the setup wizard, but if they can't communicate together then I guess I'll just be setting up two individual controllers to try and fight over the drives in the box. I have 7 900GB drives in this one.

After resetting password I was able to show all licenses on the system, but I guess those are gone now. It's not the biggest of deals, I have other controllers. The purpose here is to simply learn.

I see controllers running ONTAP 9.9.1P12.

So my questions are:

What did I do wrong?

Why won't they communicate?

If this was HA, should I have cabled the whole thing up first before starting?

Is it even possible to find the original configuration? How it was cabled before?

Did I in fact lose the licenses?

Can I sanitize the hard drives with this?

Thanks for reading!

NetApp_SR · ‎2023-07-11

If you wiped the configuration then most likely the license keys have been lost. Below is a video link that you may find helpful. Make sure the cluster connections are cabled.

Installing a FAS2552 system with no attached storage
https://www.youtube.com/watch?v=j-3owEDZ-EQ

How to sanitize all disks in ONTAP 9.6+
https://kb.netapp.com/onprem/ontap/hardware/How_to_sanitize_all_disks_in_ONTAP_9.6

Amador · ‎2023-08-17

If the current information and configuration stored in the cluster is not important/relevant, I think the best way to "start from scratch" is How to repurpose an HA Pair with ONTAP 9 - NetApp Knowledge Base