Issues with takeover - 7Mode

DANTAYLOR80 · ‎2017-03-23

Hello,

Just throwing out an issue to the community to see if anyone else has seen this before, have two FAS systems running 8.1.2.

When attempting to issue a takover of a node (NODE-B) it fails and effectively just reboots the node being taken over as it cant release disk reservations. Lots of messages in the messages log, I have picked out some of what I think are the most relevant ones to see if anyone has seen this issue before, I have removed a lot of rows which are duplicated to make this post easier to read:

[NODE-A:8:45:50disk.reserveFailed:error]: Disk reservation failed on 3a.25.10 CDB 0x5f:0001 - SCSI:illegal request (5 55 4)

WARNING!! fmdisk_reserve_disks unable to reserve any disks.

[NODE-B:raid.assim.rg.missingChild:error]: Aggregate partner:aggr0, rgobj_verify: RAID object 3 has only 8 valid children, expected 17.

[NODE-A:raid.label.io.readError:error]: Label read on Disk 3a.23.4 Shelf 23 Bay 4 [NETAPP X306_WKf:0001 OJN02TSSM NA00] S/N [WD-WCC1P1173571] failed with storage error disk does not exist. The system will stop using the disk for I/O operations.

[NODE-Arvation:raid.assim.mirror.noChild:ALERT]: Aggregate partner:aggr0, mirrorobj_verify: No operable plexes found.

[NODE-B:raid.fm.takeoverFail:error]: RAID takeover failed: Can't find partner root volume.
[NODE-A:moniar 18 0tor.globalStatus.ok:info]: This node is attempting to takeover NODE-B.

[NODE-A:cf.erveFairsrc.takeoverFail:ALERT]: Failover monitor: takeover during raid failed; takeover cancelled

[NODE-A:cf.fm.tsk reseakeoverFailed:error]: Failover monitor: takeover failed 'NODE-A_14:24:19_2017:02:11'

[NODE-A:cf.fm.givebackStarted:notice]: Failover monitor: giveba on 3a.ck started.

[NODE-A:callhome.sfo.takeSCSI:ilover.failed:ALERT]: Call home for CONTROLLER TAKEOVER FAILED

We have run aggr scrub manually on the affected aggregate but this hasnt found an issue. It would be reaassuring to know how we could verify the RAID objects are completed and are valid and to know everything it expects is present.

Any help appreciated.

andris · ‎2017-03-24

What FAS platform?

The first SCSI reservation issue is indicative of downrev disk FW.

8.1.2 is very old. The things I would do before testing again are:

- upgrade to 8.1.4P10 (at least).

- download disk and shelf firmware packages and install/update

- download latest compatible SP FW and update

- download and install the latest DQP file.

GidonMarcus · ‎2017-03-24

Hi

a bit doubting it's a software issue. and would avoid upgrade ontap and shelves modules until it's solved as you might end up in unstable stage.

is it cabled correctly? can you download and run configadvisor and see if there;s cabeling/modules issues?

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

AlexDawson · ‎2017-03-28

As the saying goes, por que no los dos? (why not both?)

ConfigAdvisor is a great tool to run at a first step
Doing a disk/shelf firmware upgrade run is important too
ONTAP upgrades on systems in inconsistent states is not recommended, except when it is, but by Engineering Escalation.

Can you halt both nodes in an outage? I assume the systems aren't in support anymore - otherwise I would suggest working through the case with the technical support centre.

xandervanegmond · ‎2017-04-16

One of the errors you are seeing is:

[NODE-B:raid.assim.rg.missingChild:error]: Aggregate partner:aggr0, rgobj_verify: RAID object 3 has only 8 valid children, expected 17.

What this means, is that only 8 disks are seen out of the expected 17 needed for that aggregate.

It seems that your partner node does not have access to all disks that it requires.

You should start by comparing the output of the sysconfig -a command on both nodes.

Most likely some cable is not connected correctly.

/Xander