ONTAP Hardware
ONTAP Hardware
I have a FAS 2554 with 2-nodes running CDoT 8.3.1. The HA status shows me everything is good. Interface groups, failover targets, etc all check out. I have a root aggregate on each node (partitioned as 8.3.1 is wont to do) and an aggregate for a SVM owned by node1. When I initiate a takeover of node1 by node2, everything works fine, node1 reboots normally, giveback works as expected.
However when I do a takeover of node2 by node1, node2 hangs on reboot. Even the SP goes offline. The only thing I can do is a power cable yank at the datacenter. Then, it finally boots into waiting for giveback mode and accepts a giveback.
Anyone ever seen this behavior? We even wiped the config and started over from scratch, but get the same issue when I do failover testing from node2 to node1. I don't see any config or cabling issues, so I'm wondering if there's a problem w/ the node or the interconnect hardware.
Here's some info, but I'm happy to provide anything else that might help someone troubleshoot.
Thanks,
Steve
cluster1::> storage failover show-takeover
Node Node Status Aggregate Takeover Status
---------- --------------------- -------------- -------------------------------
node1
In takeover.
- -
Warning: Unable to list entries on node node2. RPC: Port mapper
failure - RPC: Timed out
cluster1::> system node run -node node1 -command storage show fault
Enclosure Status: critical
Channel: 0a
Shelf: 0
Shelf Type: DS4246
Product Serial Number: SHJSG1504000090
Module Type: IOM6E
Disk Elements:
Element Status Status Bytes Status Descriptions
0 [Bay 0]: OK 01,00,00,00
1 [Bay 1]: OK 01,01,00,00
2 [Bay 2]: OK 01,02,00,00
3 [Bay 3]: OK 01,03,00,00
4 [Bay 4]: OK 01,04,00,00
5 [Bay 5]: OK 01,05,00,00
6 [Bay 6]: OK 01,06,00,00
7 [Bay 7]: OK 01,07,00,00
8 [Bay 8]: OK 01,08,00,00
9 [Bay 9]: OK 01,09,00,00
10 [Bay 10]: OK 01,0A,00,00
11 [Bay 11]: OK 01,0B,00,00
12 [Bay 12]: OK 01,0C,00,00
13 [Bay 13]: OK 01,0D,00,00
14 [Bay 14]: OK 01,0E,00,00
15 [Bay 15]: OK 01,0F,00,00
16 [Bay 16]: OK 01,10,00,00
17 [Bay 17]: OK 01,11,00,00
18 [Bay 18]: OK 01,12,00,00
19 [Bay 19]: OK 01,13,00,00
20 [Bay 20]: OK 01,14,00,00
21 [Bay 21]: OK 01,15,00,00
22 [Bay 22]: OK 01,16,00,00
23 [Bay 23]: OK 01,17,00,00
Power Supplies:
Element Status Status Bytes Status Descriptions
1: OK 01,00,00,A0 RQSTED ON
2: OK 01,00,00,A0 RQSTED ON
3: OK 01,00,00,A0 RQSTED ON
4: OK 01,00,00,A0 RQSTED ON
Fans:
Element Status Status Bytes Status Descriptions
1: OK 01,03,AC,A7
2: OK 01,03,66,A6
3: OK 01,03,AC,A7
4: OK 01,03,66,A6
5: OK 01,03,AC,A7
6: OK 01,03,66,A6
7: OK 01,03,AC,A7
8: OK 01,03,66,A6
Temperature Sensors:
Element Status Status Bytes Status Descriptions
1: OK 01,00,21,00
2: OK 01,00,2B,00
3: OK 01,00,2A,00
4: OK 01,00,38,00
5: OK 01,00,2A,00
6: OK 01,00,36,00
7: OK 01,00,2A,00
8: OK 01,00,36,00
9: OK 01,00,2B,00
10: OK 01,00,39,00
11: OK 01,00,30,00
12: OK 01,00,30,00
Enclosure Electronics:
Element Status Status Bytes Status Descriptions
1 [IOM6E A] : OK 01,00,01,80 REPORT
2 [IOM6E B] : OK 01,00,00,80
OPS Panel:
Element Status Status Bytes Status Descriptions
1: OK 01,00,00,00
Enclosure:
Element Status Status Bytes Status Descriptions
1: OK 01,00,02,00 FAIL
Voltage Sensors:
Element Status Status Bytes Status Descriptions
1: OK 01,00,FF,01
2: OK 01,00,BA,04
3: OK 01,00,FF,01
4: OK 01,00,BE,04
5: OK 01,00,FF,01
6: OK 01,00,BE,04
7: OK 01,00,FF,01
8: OK 01,00,BE,04
Current Sensors:
Element Status Status Bytes Status Descriptions
1: OK 01,00,82,01
2: OK 01,00,B7,02
3: OK 01,00,B7,00
4: OK 01,00,03,02
5: OK 01,00,11,01
6: OK 01,00,E8,01
7: OK 01,00,21,01
8: OK 01,00,3E,02
SAS Connectors:
Element Status Status Bytes Status Descriptions
1: OK 01,3F,FF,00
2: OK 01,03,FF,00
3: OK 01,3F,FF,00
4: OK 01,03,FF,00
Vendor Unique Element 83-IOM6E: (SAS)
Element Status Status Bytes Status Descriptions
1 [IOM6E A] : OK 01,08,00,00 MASTER
2 [IOM6E B] : OK 01,00,00,00
Vendor Unique Element 85-IOM6E: (ACP)
Element Status Status Bytes Status Descriptions
1 [IOM6E A] : OK 01,00,00,00
2 [IOM6E B] : CRITICAL 02,00,00,40 FAIL
Vendor Unique Element 88-IOM6E: (PCM)
Element Status Status Bytes Status Descriptions
1: OK 01,01,00,00
2: OK 01,01,07,80 PC SHELF FAULT RQSTD
Vendor Unique Element 8B-IOM6E: (ETHERNET)
Element Status Status Bytes Status Descriptions
1 [IOM6E A] : OK 01,01,00,00
2 [IOM6E B] : OK 01,01,00,00
Hi,
There can be various reasons for giveback to fail. Refer Why is the giveback of a clustered Data ONTAP aggregate vetoed?
Hello,
When node two hangs have you tried connecting a console cable to the management port and connecting to the CLI using Putty? I'm not 100% sure this will work with the SP offline, however doing this in the past I have come across issues with the boot loader which loads the ONTAP OS.
You can find more info on the boot loader commands in the NetApp Knowledge base but trying boot_ontap should attempt to load the OS and throw an error if none is visible right away.
All the best,
Ryan
Thanks for the responses. A directly-connected serial session to the SP fails, too. The output is frozen until the node is powered off and restarted.
As far as the giveback failing, there is nothing to giveback to until the node gets rebooted. Once that happens, the giveback functions as expected.
At this point, if the linked document above does not help I would recommend raising a support case with NetApp directly to determine the cause of this issue.
All the best,
Ryan
Thanks, Ryan. I've got a case open w/ NetApp. They've narrowed it down to either a hardware or a software issue.