We are working on FAS3240 (running 8.1.1P1 in 7-Mode) in which a controller has failed due to cache battery. Currently node1 has taken over for the failed node. The cache battery issue has been resolved, but we seem to be having some odd issues trying to do the giveback. When the giveback happened, it seemed to succeed the other night, but the LUNs were not accessable on node2. From there the controller was halted and everything came back online on the node1 in a failed over state
It seems there has been an "unsynchronized log" event going on for sometime. Assuming this had to be related to an interconnect issues we have swaped out the twinax interconnect cables, verified c0a --> c0a, and c0b --> c0b, but still seeing some odd behavior (same with previous cables). It doesnt appear the nodes can communicate:
1: Error logs are showing a failed RDMA (cluster) connection repeatedly:
Wed Apr 27 00:16:07 PDT [node01:ems.engine.suppressed:debug]: Event 'ctrl.rdma.failConnect' suppressed 3807 times in last 601 seconds.
2: `ic status` is showing a flapping link 0 (and sometimes link 1):
node01(takeover)*> ic status
Link 0: down
Link 1: up
IC RDMA connection : down
node01(takeover)*> ic status
Link 0: up
Link 1: up
IC RDMA connection : down
node01(takeover)*> ic status
Link 0: down
Link 1: up
IC RDMA connection : down
node01(takeover)*> ic status
Link 0: up
Link 1: up
IC RDMA connection : down
node01(takeover)*> ic status
Link 0: down
Link 1: up
IC RDMA connection : down
3: `ic stats error` shows disconnects
Connection state : Disconnected
Connections made : 2
Software resets : 2
4: Dumping interconnect stats shows a large increment in connection errors
node01(takeover)*> ic dump hw_stats
Connection Error : 64761483
SQ Overrun Error : 70494545368
[snip]
[5 seconds later]
node01(takeover)*> ic dump hw_stats
Connection Error : 64761514
SQ Overrun Error : 70494545368
[snip]
5: Cannot send any data over the interconnect
node01(takeover)*> icbulk send 30 20 200
0 messages sent, 0 MB/s
Other errors we were seeing on this prior to failure:
[interconnect stats]
Connection state : Established
Connection attempts : 2
Connections made : 2
Connections lost: 0
Software resets : 0
Max connection time : 0 msec
--- EMS Error Stats: ---
NIC transition for Port 0 Count: 0 Last Updated: <No Update>
NIC transition for Port 1 Count: 0 Last Updated: <No Update>
VI error for NIC 0 Count: 0 Last Updated: <No Update>
VI error for NIC 1 Count: 0 Last Updated: <No Update>
NIC reset Count: 0 Last Updated: <No Update>
IC vi_if init failure Count: 0 Last Updated: <No Update>
IC hearbeat failure Count: 0 Last Updated: <No Update>
IC transfer timed out Count:1962 Last Updated: Mon Nov 30 01:18:20 PST 2015
IC init failure Count: 0 Last Updated: <No Update>
IC client init failure Count: 0 Last Updated: <No Update>
IC disabled Count: 0 Last Updated: <No Update>
IC invalid source address Count: 0 Last Updated: <No Update>
IC misconfigured Count: 0 Last Updated: <No Update>
IC ports corss connected Count: 0 Last Updated: <No Update>
IC ports loopback Count: 0 Last Updated: <No Update>
RV version mismatch Count: 0 Last Updated: <No Update>
RV partner not connected Count: 0 Last Updated: <No Update>
RV local not connected Count: 0 Last Updated: <No Update>
RV not connected Count: 0 Last Updated: <No Update>
vi_if descriptor allocation failure Count: 0 Last Updated: <No Update>
Max_xig threshold: 12
Enable OFW status reads : TRUE TRUE
--- General Error Stats: ---
VI Fatal error Count: 0 Last Updated: <No Update>
Memory registration failure Count: 0 Last Updated: <No Update>
Connection down error Count:25981756 Last Updated: Mon Dec 28 05:41:21 PST 2015
Notify timeout error Count: 0 Last Updated: <No Update>
Bad descriptor error Count: 0 Last Updated: <No Update>
Bad descriptor id Count: 0 Last Updated: <No Update>
No descriptor error Count: 0 Last Updated: <No Update>
Recv descriptor error Count: 0 Last Updated: <No Update>
Send descriptor error Count:1213206 Last Updated: Mon Nov 30 01:18:20 PST 2015
Send descriptor timeout Count:3222 Last Updated: Mon Nov 30 01:18:20 PST 2015
vi_if invalid packet Count: 0 Last Updated: <No Update>
vi_if invalid data Count: 0 Last Updated: <No Update>
Kstat recv timeout Count: 0 Last Updated: <No Update>
IC NV unsync Count: 0 Last Updated: <No Update>
Looking for any thoughts here ...