ONTAP Hardware

Controller Giveback issues

scott_mcisaac
8,283 Views

We are working on  FAS3240 (running 8.1.1P1 in 7-Mode) in which a controller has failed due to cache battery.  Currently node1 has taken over for the failed node.  The cache battery issue has been resolved, but we seem to be having some odd issues trying to do the giveback.  When the giveback happened, it seemed to succeed the other night, but the LUNs were not accessable on node2.  From there the controller was halted and everything came back online on the node1 in a failed over state

 

It seems there has been an "unsynchronized log" event going on for sometime.  Assuming this had to be related to  an interconnect issues we have swaped out the twinax interconnect cables, verified c0a --> c0a, and c0b --> c0b, but still seeing some odd behavior (same with previous cables).  It doesnt appear the nodes can communicate:

 

1: Error logs are showing a failed RDMA (cluster) connection repeatedly:

 

Wed Apr 27 00:16:07 PDT [node01:ems.engine.suppressed:debug]: Event 'ctrl.rdma.failConnect' suppressed 3807 times in last 601 seconds.

 

2: `ic status` is showing a flapping link 0 (and sometimes link 1):

 

node01(takeover)*> ic status

        Link 0: down

        Link 1: up

        IC RDMA connection : down

node01(takeover)*> ic status

        Link 0: up

        Link 1: up

        IC RDMA connection : down

node01(takeover)*> ic status

        Link 0: down

        Link 1: up

        IC RDMA connection : down

node01(takeover)*> ic status

        Link 0: up

        Link 1: up

        IC RDMA connection : down

node01(takeover)*> ic status

        Link 0: down

        Link 1: up

        IC RDMA connection : down

 

3: `ic stats error` shows disconnects

 

        Connection state :                  Disconnected

        Connections made :                      2

        Software resets :                       2

 

4: Dumping interconnect stats shows a large increment in connection errors

 

node01(takeover)*> ic dump hw_stats

    Connection Error :               64761483

    SQ Overrun Error :               70494545368

[snip]

[5 seconds later]

node01(takeover)*> ic dump hw_stats

    Connection Error :               64761514

    SQ Overrun Error :               70494545368

[snip]

 

5: Cannot send any data over the interconnect

 

node01(takeover)*> icbulk send 30 20 200

        0 messages sent,   0 MB/s

 

 

Other errors we were seeing on this prior to failure:

 

[interconnect stats]

Connection state :                  Established

        Connection attempts :                   2

        Connections made :                      2

        Connections lost:                       0

        Software resets :                       0

        Max connection time :                   0 msec

 

        --- EMS Error Stats: ---

        NIC transition for Port 0             Count:  0   Last Updated: <No Update>

        NIC transition for Port 1             Count:  0   Last Updated: <No Update>

        VI error for NIC 0                    Count:  0   Last Updated: <No Update>

        VI error for NIC 1                    Count:  0   Last Updated: <No Update>

        NIC reset                             Count:  0   Last Updated: <No Update>

        IC vi_if init failure                 Count:  0   Last Updated: <No Update>

        IC hearbeat failure                   Count:  0   Last Updated: <No Update>

        IC transfer timed out                 Count:1962   Last Updated: Mon Nov 30 01:18:20 PST 2015

        IC init failure                       Count:  0   Last Updated: <No Update>

        IC client init failure                Count:  0   Last Updated: <No Update>

        IC disabled                           Count:  0   Last Updated: <No Update>

        IC invalid source address             Count:  0   Last Updated: <No Update>

        IC misconfigured                      Count:  0   Last Updated: <No Update>

        IC ports corss connected              Count:  0   Last Updated: <No Update>

        IC ports loopback                     Count:  0   Last Updated: <No Update>

        RV version mismatch                   Count:  0   Last Updated: <No Update>

        RV partner not connected              Count:  0   Last Updated: <No Update>

        RV local not connected                Count:  0   Last Updated: <No Update>

        RV not connected                      Count:  0   Last Updated: <No Update>

        vi_if descriptor allocation failure   Count:  0   Last Updated: <No Update>

 

        Max_xig threshold:                    12

        Enable OFW status reads :             TRUE TRUE

 

        --- General Error Stats: ---

        VI Fatal error                     Count:  0   Last Updated: <No Update>

        Memory registration failure        Count:  0   Last Updated: <No Update>

        Connection down error              Count:25981756   Last Updated: Mon Dec 28 05:41:21 PST 2015

        Notify timeout error               Count:  0   Last Updated: <No Update>

        Bad descriptor error               Count:  0   Last Updated: <No Update>

        Bad descriptor id                  Count:  0   Last Updated: <No Update>

        No descriptor error                Count:  0   Last Updated: <No Update>

        Recv descriptor error              Count:  0   Last Updated: <No Update>

        Send descriptor error              Count:1213206   Last Updated: Mon Nov 30 01:18:20 PST 2015

        Send descriptor timeout            Count:3222   Last Updated: Mon Nov 30 01:18:20 PST 2015

        vi_if invalid packet               Count:  0   Last Updated: <No Update>

        vi_if invalid data                 Count:  0   Last Updated: <No Update>

        Kstat recv timeout                 Count:  0   Last Updated: <No Update>

        IC NV unsync                       Count:  0   Last Updated: <No Update>

 

 

Looking for any thoughts here ... 

1 REPLY 1

Jeff_Yao
8,231 Views

it looks like a hw issue/bug. try to reboot the whole cluster; if persist, you might need to replace the mb. so you might need to open a case.

Public