Controller Giveback issues

scott_mcisaac · ‎2016-04-27

We are working on FAS3240 (running 8.1.1P1 in 7-Mode) in which a controller has failed due to cache battery. Currently node1 has taken over for the failed node. The cache battery issue has been resolved, but we seem to be having some odd issues trying to do the giveback. When the giveback happened, it seemed to succeed the other night, but the LUNs were not accessable on node2. From there the controller was halted and everything came back online on the node1 in a failed over state

It seems there has been an "unsynchronized log" event going on for sometime. Assuming this had to be related to an interconnect issues we have swaped out the twinax interconnect cables, verified c0a --> c0a, and c0b --> c0b, but still seeing some odd behavior (same with previous cables). It doesnt appear the nodes can communicate:

1: Error logs are showing a failed RDMA (cluster) connection repeatedly:

Wed Apr 27 00:16:07 PDT [node01:ems.engine.suppressed:debug]: Event 'ctrl.rdma.failConnect' suppressed 3807 times in last 601 seconds.

2: `ic status` is showing a flapping link 0 (and sometimes link 1):

node01(takeover)*> ic status

Link 0: down

Link 1: up

IC RDMA connection : down

node01(takeover)*> ic status

Link 0: up

Link 1: up

IC RDMA connection : down

node01(takeover)*> ic status

Link 0: down

Link 1: up

IC RDMA connection : down

node01(takeover)*> ic status

Link 0: up

Link 1: up

IC RDMA connection : down

node01(takeover)*> ic status

Link 0: down

Link 1: up

IC RDMA connection : down

3: `ic stats error` shows disconnects

Connection state : Disconnected

Connections made : 2

Software resets : 2

4: Dumping interconnect stats shows a large increment in connection errors

node01(takeover)*> ic dump hw_stats

Connection Error : 64761483

SQ Overrun Error : 70494545368

[snip]

[5 seconds later]

node01(takeover)*> ic dump hw_stats

Connection Error : 64761514

SQ Overrun Error : 70494545368

[snip]

5: Cannot send any data over the interconnect

node01(takeover)*> icbulk send 30 20 200

0 messages sent, 0 MB/s

Other errors we were seeing on this prior to failure:

[interconnect stats]

Connection state : Established

Connection attempts : 2

Connections made : 2

Connections lost: 0

Software resets : 0

Max connection time : 0 msec

--- EMS Error Stats: ---

NIC transition for Port 0 Count: 0 Last Updated: <No Update>

NIC transition for Port 1 Count: 0 Last Updated: <No Update>

VI error for NIC 0 Count: 0 Last Updated: <No Update>

VI error for NIC 1 Count: 0 Last Updated: <No Update>

NIC reset Count: 0 Last Updated: <No Update>

IC vi_if init failure Count: 0 Last Updated: <No Update>

IC hearbeat failure Count: 0 Last Updated: <No Update>

IC transfer timed out Count:1962 Last Updated: Mon Nov 30 01:18:20 PST 2015

IC init failure Count: 0 Last Updated: <No Update>

IC client init failure Count: 0 Last Updated: <No Update>

IC disabled Count: 0 Last Updated: <No Update>

IC invalid source address Count: 0 Last Updated: <No Update>

IC misconfigured Count: 0 Last Updated: <No Update>

IC ports corss connected Count: 0 Last Updated: <No Update>

IC ports loopback Count: 0 Last Updated: <No Update>

RV version mismatch Count: 0 Last Updated: <No Update>

RV partner not connected Count: 0 Last Updated: <No Update>

RV local not connected Count: 0 Last Updated: <No Update>

RV not connected Count: 0 Last Updated: <No Update>

vi_if descriptor allocation failure Count: 0 Last Updated: <No Update>

Max_xig threshold: 12

Enable OFW status reads : TRUE TRUE

--- General Error Stats: ---

VI Fatal error Count: 0 Last Updated: <No Update>

Memory registration failure Count: 0 Last Updated: <No Update>

Connection down error Count:25981756 Last Updated: Mon Dec 28 05:41:21 PST 2015

Notify timeout error Count: 0 Last Updated: <No Update>

Bad descriptor error Count: 0 Last Updated: <No Update>

Bad descriptor id Count: 0 Last Updated: <No Update>

No descriptor error Count: 0 Last Updated: <No Update>

Recv descriptor error Count: 0 Last Updated: <No Update>

Send descriptor error Count:1213206 Last Updated: Mon Nov 30 01:18:20 PST 2015

Send descriptor timeout Count:3222 Last Updated: Mon Nov 30 01:18:20 PST 2015

vi_if invalid packet Count: 0 Last Updated: <No Update>

vi_if invalid data Count: 0 Last Updated: <No Update>

Kstat recv timeout Count: 0 Last Updated: <No Update>

IC NV unsync Count: 0 Last Updated: <No Update>

Looking for any thoughts here ...

Jeff_Yao · ‎2016-05-03

it looks like a hw issue/bug. try to reboot the whole cluster; if persist, you might need to replace the mb. so you might need to open a case.