Solved: FAS2552 issue after NVRAM replacement

ITLuke · ‎2024-03-26

HI all,

after an NVRAM module ECC error/failure of a controller on a 2552 in 7 mode (HA) and subsequent replacement from another controller (cleanly shutdown supposedly), the controller now boots up (the other is in takeover) but won't go in "waiting for giveback" and complains about ownership of a couple of disks which have been reserved by the HA partner:

Mar 26 17:11:06 [localhost:diskown.isEnabled:info]: software ownership has been enabled for this system

Reservation conflict found on this node's disks!

Local System ID: 537074349
Mar 26 17:11:06 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.00.22 that is owned by xxxxxxxxx and reserved by yyyyyyyyy.
WAFL CPLEDGER is enabled. Checklist = 0x7ff841ff

Press Ctrl-C for Maintenance menu to release disks.

add host 127.0.10.1: gateway 127.0.20.1
Mar 26 17:11:09 [localhost:wafl.memory.status:info]: 2004MB of memory is currently available for the WAFL file system.

NOTE: You have chosen to boot the diagnostics kernel.
Use the 'sldiag' command in order to run diagnostics on the system.

Mar 26 17:11:09 [localhost:dcs.framework.enabled:info]: The DCS framework is enabled on this node.

The system has booted in maintenance mode allowing the
Mar 26 17:11:09 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.00.12 that is owned by xxxxxxxxx and reserved by yyyyyyyyy.

following operations to be performed:
Mar 26 17:11:09 [localhost:snmp.link.up:info]: Interface 8 is upMar 26 17:11:09 [localhost:netif.linkUp:info]: Ethernet e0P: Link up.

? acorn
Mar 26 17:11:09 [localhost:snmp.link.up:info]: Interface 1 is up

acpadmin aggr
Mar 26 17:11:09 [localhost:netif.linkUp:info]: Ethernet e0a: Link up.

cna_flash disk
Mar 26 17:11:09 [localhost:snmp.link.up:info]: Interface 2 is up

disk_list disk_mung
Mar 26 17:11:09 [localhost:netif.linkUp:info]: Ethernet e0b: Link up.

disk_qual disk_shelf
diskcopy disktest
dumpblock environment
fcadmin fcstat
fctest fru_led
ha-config halt
help ifconfig
key_manager led_off
led_on nv8
raid_config sasadmin
sasstat scsi
sesdiag sldiag
storage stsb
sysconfig systemshell
ucadmin version
vmservices vol
vol_db vsa
xortest

Type "help <command>" for more details.

In a High Availablity configuration, you MUST ensure that the
partner node is (and remains) down, or that takeover is manually
disabled on the partner node, because High Availability
software is not started or fully enabled in Maintenance mode.

FAILURE TO DO SO CAN RESULT IN YOUR FILESYSTEMS BEING DESTROYED

NOTE: It is okay to use 'show/status' sub-commands such as
'disk show or aggr status' in Maintenance mode while the partner is up
Mar 26 17:11:14 [localhost:snmp.link.down:info]: Interface 7 is down.

Mar 26 17:11:14 [localhost:netif.linkDown:info]: Ethernet e0M: Link down, check cable.

Continue with boot?

After continuing the boot it enters maintenance mode and checking the various statuses I get this:

*> aggr status
Aggr State Status Options
aggr0 online raid_dp, aggr root
degraded
64-bit

*> aggr show
Mar 26 17:59:18 [localhost:fmmb.current.lock.disk:info]: Disk ?.? is a local HA mailbox disk.

aggr: No such command "show".
Mar 26 17:59:18 [localhost:fmmb.current.lock.disk:info]: Disk 0a.00.2 is a local HA mailbox disk.

The following commands are available; for more information
Mar 26 17:59:18 [localhost:fmmb.instStat.change:info]: missing lock disks, possibly stale mailbox instance on local side.

type "aggr help <command>"
Mar 26 17:59:18 [localhost:raid.mirror.vote.versionZero:debug]: raid: mirror info empty

clear_rpbits options rename snaprestore_cancel
Mar 26 17:59:18 [localhost:coredump.host.spare.none:info]: No sparecore disk was found for host 0.

Halting and Rebooting with diags I get this:

LOADER-A> boot_diags
Loading X86_64/freebsd/image2/kernel:0x100000/9578776 0xa22918/4044416 Entry at 0x8016e880
Loading X86_64/freebsd/image2/platform.ko:0xdfe000/786856 0xf9bea0/724152 0xebe1c0/45064 0x104cb58/49752 0xec91c8/110791 0xee428f/80654 0xef7da0/172160 0x1058db0/195312 0xf21e20/16 0xf21e30/2448 0x10888a0/7344 0xf22800/0 0xf22800/344 0x108a550/1032 0xf22958/1952 0x108a958/5856 0xf230f8/1648 0x108c038/4944 0xf23768/240 0x108d388/720 0xf23860/448 0xf5e860/14942 0xf9bda2/253 0xf622c0/136824 0xf83938/99434
Starting program at 0x8016e880
NetApp Data ONTAP 8.2.4P6 7-Mode
Root mount waiting for: usbus0
Root mount waiting for: usbus0
Root mount waiting for: usbus0
Root mount waiting for: usbus0
Copyright (C) 1992-2017 NetApp.
All rights reserved.
md1.uzip: 39168 x 16384 blocks
md2.uzip: 16640 x 16384 blocks
*******************************
* *
* Press Ctrl-C for Boot Menu. *
* *
*******************************
Error burning Mellanox sinai chipset firmware.
^Cqla_init_hw: CRBinit running ok: 8c633f
NIC FW version in flash: 5.4.9
Mar 26 17:53:56 [localhost:sasmon.disable.module:info]: SAS domain is not monitoring transient errors.

Mar 26 17:53:58 [localhost:cf.nm.nicTransitionUp:info]: HA interconnect: Link up on NIC 0.

qla_init_hw: CRBinit running ok: 8c633f
NIC FW version bundled: 5.4.56
qla_init_hw: CRBinit running ok: 8c633f
NIC FW version in flash: 5.4.9
Mar 26 17:54:03 [localhost:cf.rv.flush.handleExchange:info]: HA interconnect: Flushing is active.

qla_init_hw: CRBinit running ok: 8c633f
NIC FW version bundled: 5.4.56
Mar 26 17:54:05 [localhost:netif.linkDown:info]: Ethernet Wrench Port: Link down, check cable.

Mar 26 17:54:07 [localhost:snmp.link.down:info]: Interface 3 is down.

Mar 26 17:54:07 [localhost:netif.linkDown:info]: Ethernet e0c: Link down, check cable.

Mar 26 17:54:07 [localhost:snmp.link.down:info]: Interface 4 is down.

Mar 26 17:54:07 [localhost:netif.linkDown:info]: Ethernet e0d: Link down, check cable.

Mar 26 17:54:08 [localhost:snmp.link.down:info]: Interface 5 is down.

Mar 26 17:54:08 [localhost:netif.linkDown:info]: Ethernet e0e: Link down, check cable.

Mar 26 17:54:08 [localhost:snmp.link.down:info]: Interface 6 is down.

Mar 26 17:54:08 [localhost:netif.linkDown:info]: Ethernet e0f: Link down, check cable.

Mar 26 17:54:08 [localhost:diskown.isEnabled:info]: software ownership has been enabled for this system

Reservation conflict found on this node's disks!

Local System ID: xxxxxxxxx
Mar 26 17:54:08 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.00.0 that is owned by xxxxxxxxx and reserved by yyyyyyyyy.
WAFL CPLEDGER is enabled. Checklist = 0x7ff841ff

Press Ctrl-C for Maintenance menu to release disks.

add host 127.0.10.1: gateway 127.0.20.1
Mar 26 17:54:11 [localhost:wafl.memory.status:info]: 2004MB of memory is currently available for the WAFL file system.

NOTE: You have chosen to boot the diagnostics kernel.
Use the 'sldiag' command in order to run diagnostics on the system.

Mar 26 17:54:11 [localhost:dcs.framework.enabled:info]: The DCS framework is enabled on this node.

The system has booted in maintenance mode allowing the
following operations to be performed:

? acorn
acpadmin aggr
Mar 26 17:54:12 [localhost:snmp.link.up:info]: Interface 8 is up

cna_flash disk
Mar 26 17:54:12 [localhost:netif.linkUp:info]: Ethernet e0P: Link up.

disk_list disk_mung
Mar 26 17:54:12 [localhost:snmp.link.up:info]: Interface 1 is up

disk_qual disk_shelf
Mar 26 17:54:12 [localhost:netif.linkUp:info]: Ethernet e0a: Link up.

diskcopy disktest
Mar 26 17:54:12 [localhost:snmp.link.up:info]: Interface 2 is up

dumpblock environment
Mar 26 17:54:12 [localhost:netif.linkUp:info]: Ethernet e0b: Link up.

fcadmin fcstat
fctest fru_led
ha-config halt
help ifconfig
key_manager led_off
led_on nv8
raid_config sasadmin
sasstat scsi
sesdiag sldiag
storage stsb
sysconfig systemshell
ucadmin version
vmservices vol
vol_db vsa
xortest

Type "help <command>" for more details.

In a High Availablity configuration, you MUST ensure that the
partner node is (and remains) down, or that takeover is manually
disabled on the partner node, because High Availability
software is not started or fully enabled in Maintenance mode.

FAILURE TO DO SO CAN RESULT IN YOUR FILESYSTEMS BEING DESTROYED

NOTE: It is okay to use 'show/status' sub-commands such as
'disk show or aggr status' in Maintenance mode while the partner is up
Mar 26 17:54:17 [localhost:snmp.link.down:info]: Interface 7 is down.

Mar 26 17:54:17 [localhost:netif.linkDown:info]: Ethernet e0M: Link down, check cable.

Continue with boot?

So it now complains about another reserved disk a0.00.0 and no longer a0.00.22 and a0.00.12. Also it seems there is no defined mbox disk which I suppose is not good. The partner is in takeover and is working ok at the moment. Issuing a cf giveback doesn't work as the partner is not awaiting giveback (still need to get there hopefully).

Any advice is appreciated - thanks!

Gmox · ‎2024-03-27

Hi,

Sometimes, I need to use teraterm instead of putty to perform the Ctrl c (don't know why).

When you replace a NVRAM, you should have a message "system mistmatch ID" and need to override it (answering yes).

Is it something that you ve done?

View solution in original post

Gmox · ‎2024-03-27

Hello,

I would try on the down controller in maintenance mode to destroy the local mailbox and boot.

*> mailbox destroy local

*> halt

LOADER> boot_ontap

And check the boot

Gmox · ‎2024-03-27

Then if not wotking.

press ctrl c for boot menu and select option 6.

Update flash from backup config

ITLuke · ‎2024-03-27

Hi, thanks for the suggestions I did try a couple of times pressing ctrl+c at boot but was never able to get into that menu it would always proceed as per log. Also would the lack of mbox explain the apparently random disk reservations/ownerships warnings at every boot of the controller?

Anyhow I will try and let you know!

Gmox · ‎2024-03-27

Hi,

Sometimes, I need to use teraterm instead of putty to perform the Ctrl c (don't know why).

When you replace a NVRAM, you should have a message "system mistmatch ID" and need to override it (answering yes).

Is it something that you ve done?

ITLuke · ‎2024-03-27

I read about that on a pdf I found about the procedure but I never got that warning, only the ones you see in the log (where it is safe to say Y about having the downed HA partner to avoid corruption). I think it is because only the NVRAM was replaced while the pdf mentions transferring NVRAM, RAM and boot media which in my case did not happen (only the NVRAM was replaced on the faulty controller). The system ID is linked to the hardware/mainboard and as this was not changed maintaining the same boot media I figure the warning is not triggered - at least this is my understanding.

Gmox · ‎2024-03-27

Yes, I had the issue one time and using option 6 it s reboot asking for the override of the sysID.

Let me know if it worked.

ITLuke · ‎2024-03-27

Ok so I will try Teraterm next tme instead of Putty and see if it catches the ctrl+C when requested - I do see it appear on the terminal (in fact I press it several times) but it's as if it is ignored by the controller. Maybe the ctrl code is not correct.

ITLuke · ‎2024-04-01

Hi Gmox, I connected using another serial client remotely (a good old Cyclades serial Terminal and then from that through putty) and pressed ctrl+c after booting_ontap. It confirmed and entered the maintenance menu. However I did not do anyting and just pressed option 1 to boot normally and proceeded to boot; it did mention some disk not being released but it managed to continue by itself finding the relevant mbox disks and recover from the takeover. Here is the relevant part:

...

Reservation conflict found on this node's disks!

Local System ID: xxxxxxxxx
Apr 01 14:56:36 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.00.0 that is owned by xxxxxxxxx and reserved by yyyyyyyyy.
WAFL CPLEDGER is enabled. Checklist = 0x7ff841ff

Press Ctrl-C for Maintenance menu to release disks.

add host 127.0.10.1: gateway 127.0.20.1
Apr 01 14:56:38 [localhost:wafl.memory.status:info]: 11055MB of memory is currently available for the WAFL file system.

Apr 01 14:56:38 [localhost:dcs.framework.enabled:info]: The DCS framework is enabled on this node.

Apr 01 14:56:39 [localhost:snmp.link.up:info]: Interface 8 is up

Apr 01 14:56:39 [localhost:netif.linkUp:info]: Ethernet e0P: Link up.

Apr 01 14:56:39 [localhost:snmp.link.up:info]: Interface 1 is up

Apr 01 14:56:39 [localhost:netif.linkUp:info]: Ethernet e0a: Link up.

Apr 01 14:56:39 [localhost:snmp.link.up:info]: Interface 2 is up

Apr 01 14:56:39 [localhost:netif.linkUp:info]: Ethernet e0b: Link up.

Disk reservations have been released

Apr 01 14:56:41 [localhost:cf.nm.nicReset:warning]: HA interconnect: Initiating soft reset on card 0 due to rendezvous reset.

Apr 01 14:56:41 [localhost:cf.rv.notConnected:error]: HA interconnect: Connection for 'cfo_rv' failed.

Apr 01 14:56:41 [localhost:fmmb.current.lock.disk:info]: Disk 0a.00.0 is a local HA mailbox disk.

Apr 01 14:56:41 [localhost:fmmb.current.lock.disk:info]: Disk 0a.00.2 is a local HA mailbox disk.

Apr 01 14:56:41 [localhost:fmmb.instStat.change:info]: normal mailbox instance on local side.

Apr 01 14:56:41 [localhost:fmmb.current.lock.disk:info]: Disk 0b.00.1 is a partner HA mailbox disk.

Apr 01 14:56:41 [localhost:fmmb.current.lock.disk:info]: Disk 0b.00.3 is a partner HA mailbox disk.

Apr 01 14:56:41 [localhost:fmmb.instStat.change:info]: normal mailbox instance on partner side.

Apr 01 14:56:41 [localhost:cf.fm.partner:info]: Failover monitor: partner 'netapp02'

Apr 01 14:56:44 [localhost:snmp.link.down:info]: Interface 7 is down.

Apr 01 14:56:44 [localhost:netif.linkDown:info]: Ethernet e0M: Link down, check cable.

Waiting for cluster network link..(Press Ctrl-C to abort wait)Apr 01 14:57:06 [localhost:zapi.sf.up.ready:info]: ZAPI: system node stable after startup.

Received cluster network link status, proceeding...Waiting for giveback...(Press Ctrl-C to abort wait)Continuing boot...

Apr 01 14:57:14 [localhost:coredump.host.spare.none:info]: No sparecore disk was found for host 0.

Apr 01 14:57:14 [localhost:raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.

Apr 01 14:57:14 [localhost:raid.stripe.replay.summary:info]: Replayed 0 stripes.

Apr 01 14:57:14 [localhost:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID 'ba28b404-5156-4b7f-b9d2-8a7b3a29da40' was built in 0 msec, after scanning 0 inodes and restarting -1 times with a final result of starting.

Apr 01 14:57:15 [localhost:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID 'ba28b404-5156-4b7f-b9d2-8a7b3a29da40' was built in 49 msec, after scanning 69 inodes and restarting 14 times with a final result of success.

Apr 01 14:57:15 [localhost:cf.fm.launch:info]: Launching failover monitor

Apr 01 14:57:15 [localhost:cf.fm.partner:info]: Failover monitor: partner 'netapp02'

Apr 01 14:57:15 [localhost:cf.fm.discardNvram:notice]: Failover monitor: node was previously taken over, nvram may be discarded

Mon Apr 1 12:57:15 GMT [localhost:rc:notice]: The system was down for 608299 seconds
Apr 01 14:57:15 [localhost:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of netapp02 disabled (interconnect error).

cf.takeover.on_panic is already on
Apr 01 14:57:15 [netapp01:cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of by netapp02 disabled (interconnect error).

The boot continued and giveback was completed successfully. I am not sure why this time without doing anything it freed the disk reservation and proceeded to giveback automatically and discarded NVRAM automatically (as it should have). Maybe it was simply because I entered ctrl+C and then exited, but that doesn't really add up. Possibly the boot_ontap instead of the boot_primary, but that shouldn't make a difference either. Anyhow, it's up and running - cheers!