Subscribe

Failover issue on FAS2554

I have a FAS 2554 with 2-nodes running CDoT 8.3.1. The HA status shows me everything is good. Interface groups, failover targets, etc all check out. I have a root aggregate on each node (partitioned as 8.3.1 is wont to do) and an aggregate for a SVM owned by node1. When I initiate a takeover of node1 by node2, everything works fine, node1 reboots normally, giveback works as expected.

 

However when I do a takeover of node2 by node1, node2 hangs on reboot. Even the SP goes offline. The only thing I can do is a power cable yank at the datacenter. Then, it finally boots into waiting for giveback mode and accepts a giveback.

 

Anyone ever seen this behavior? We even wiped the config and started over from scratch, but get the same issue when I do failover testing from node2 to node1. I don't see any config or cabling issues, so I'm wondering if there's a problem w/ the node or the interconnect hardware.

 

Here's some info, but I'm happy to provide anything else that might help someone troubleshoot.

 

Thanks,

Steve

 

 

cluster1::> storage failover show-takeover
Node       Node Status           Aggregate      Takeover Status
---------- --------------------- -------------- -------------------------------
node1
           In takeover.
                                 -              -

Warning: Unable to list entries on node node2. RPC: Port mapper
         failure - RPC: Timed out

 

 

cluster1::> system node run -node node1 -command storage show fault

Enclosure Status: critical
Channel: 0a
Shelf: 0
Shelf Type: DS4246
Product Serial Number: SHJSG1504000090
Module Type: IOM6E

Disk Elements:
Element Status                  Status Bytes  Status Descriptions
  0 [Bay  0]: OK                01,00,00,00   
  1 [Bay  1]: OK                01,01,00,00   
  2 [Bay  2]: OK                01,02,00,00   
  3 [Bay  3]: OK                01,03,00,00   
  4 [Bay  4]: OK                01,04,00,00   
  5 [Bay  5]: OK                01,05,00,00   
  6 [Bay  6]: OK                01,06,00,00   
  7 [Bay  7]: OK                01,07,00,00   
  8 [Bay  8]: OK                01,08,00,00   
  9 [Bay  9]: OK                01,09,00,00   
 10 [Bay 10]: OK                01,0A,00,00   
 11 [Bay 11]: OK                01,0B,00,00   
 12 [Bay 12]: OK                01,0C,00,00   
 13 [Bay 13]: OK                01,0D,00,00   
 14 [Bay 14]: OK                01,0E,00,00   
 15 [Bay 15]: OK                01,0F,00,00   
 16 [Bay 16]: OK                01,10,00,00   
 17 [Bay 17]: OK                01,11,00,00   
 18 [Bay 18]: OK                01,12,00,00   
 19 [Bay 19]: OK                01,13,00,00   
 20 [Bay 20]: OK                01,14,00,00   
 21 [Bay 21]: OK                01,15,00,00   
 22 [Bay 22]: OK                01,16,00,00   
 23 [Bay 23]: OK                01,17,00,00   

Power Supplies:
Element Status         Status Bytes  Status Descriptions
  1: OK                01,00,00,A0   RQSTED ON
  2: OK                01,00,00,A0   RQSTED ON
  3: OK                01,00,00,A0   RQSTED ON
  4: OK                01,00,00,A0   RQSTED ON

Fans:
Element Status         Status Bytes  Status Descriptions
  1: OK                01,03,AC,A7   
  2: OK                01,03,66,A6   
  3: OK                01,03,AC,A7   
  4: OK                01,03,66,A6   
  5: OK                01,03,AC,A7   
  6: OK                01,03,66,A6   
  7: OK                01,03,AC,A7   
  8: OK                01,03,66,A6   

Temperature Sensors:
Element Status         Status Bytes  Status Descriptions
  1: OK                01,00,21,00   
  2: OK                01,00,2B,00   
  3: OK                01,00,2A,00   
  4: OK                01,00,38,00   
  5: OK                01,00,2A,00   
  6: OK                01,00,36,00   
  7: OK                01,00,2A,00   
  8: OK                01,00,36,00   
  9: OK                01,00,2B,00   
 10: OK                01,00,39,00   
 11: OK                01,00,30,00   
 12: OK                01,00,30,00   

Enclosure Electronics:
Element Status         Status Bytes  Status Descriptions
  1 [IOM6E A]    : OK                01,00,01,80   REPORT
  2 [IOM6E B]    : OK                01,00,00,80   

OPS Panel:
Element Status         Status Bytes  Status Descriptions
  1: OK                01,00,00,00   

Enclosure:
Element Status         Status Bytes  Status Descriptions
  1: OK                01,00,02,00   FAIL

Voltage Sensors:
Element Status         Status Bytes  Status Descriptions
  1: OK                01,00,FF,01   
  2: OK                01,00,BA,04   
  3: OK                01,00,FF,01   
  4: OK                01,00,BE,04   
  5: OK                01,00,FF,01   
  6: OK                01,00,BE,04   
  7: OK                01,00,FF,01   
  8: OK                01,00,BE,04   

Current Sensors:
Element Status         Status Bytes  Status Descriptions
  1: OK                01,00,82,01   
  2: OK                01,00,B7,02   
  3: OK                01,00,B7,00   
  4: OK                01,00,03,02   
  5: OK                01,00,11,01   
  6: OK                01,00,E8,01   
  7: OK                01,00,21,01   
  8: OK                01,00,3E,02   

SAS Connectors:
Element Status         Status Bytes  Status Descriptions
  1: OK                01,3F,FF,00   
  2: OK                01,03,FF,00   
  3: OK                01,3F,FF,00   
  4: OK                01,03,FF,00   

Vendor Unique Element 83-IOM6E: (SAS)
Element Status         Status Bytes  Status Descriptions
  1 [IOM6E A]    : OK                01,08,00,00   MASTER
  2 [IOM6E B]    : OK                01,00,00,00   

Vendor Unique Element 85-IOM6E: (ACP)
Element Status         Status Bytes  Status Descriptions
  1 [IOM6E A]    : OK                01,00,00,00   
  2 [IOM6E B]    : CRITICAL          02,00,00,40   FAIL

Vendor Unique Element 88-IOM6E: (PCM)
Element Status         Status Bytes  Status Descriptions
  1: OK                01,01,00,00   
  2: OK                01,01,07,80   PC SHELF FAULT RQSTD

Vendor Unique Element 8B-IOM6E: (ETHERNET)
Element Status         Status Bytes  Status Descriptions
  1 [IOM6E A]           : OK                01,01,00,00   
  2 [IOM6E B]           : OK                01,01,00,00 

Re: Failover issue on FAS2554

Hi,

 

There can be various reasons for giveback to fail. Refer  Why is the giveback of a clustered Data ONTAP aggregate vetoed?

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Re: Failover issue on FAS2554

Hello, 

 

When node two hangs have you tried connecting a console cable to the management port and connecting to the CLI using Putty? I'm not 100% sure this will work with the SP offline, however doing this in the past I have come across issues with the boot loader which loads the ONTAP OS.

 

You can find more info on the boot loader commands in the NetApp Knowledge base but trying boot_ontap should attempt to load the OS and throw an error if none is visible right away.

 

 

All the best,

Ryan

Re: Failover issue on FAS2554

Thanks for the responses. A directly-connected serial session to the SP fails, too. The output is frozen until the node is powered off and restarted.

 

As far as the giveback failing, there is nothing to giveback to until the node gets rebooted. Once that happens, the giveback functions as expected.

Re: Failover issue on FAS2554

At this point, if the linked document above does not help I would recommend raising a support case with NetApp directly to determine the cause of this issue.

 

 

All the best,

Ryan 

Re: Failover issue on FAS2554

Thanks, Ryan. I've got a case open w/ NetApp. They've narrowed it down to either a hardware or a software issue. Smiley Frustrated