ONTAP Hardware

RPC: Couldn't make connection

Simeonof
13,998 Views

Need some help with a FAS2520 2-node switchless cluster. Since some time one of the nodes appears in red (node down) in OnCommandSystemManager. Here's some commands output that I've managed to run (I'm a NetApp newbie, hope these make sense):

 

ntapcl-bul-sf::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ntapcl-bul-sf-01     false   true          false
ntapcl-bul-sf-02     true    true          false
2 entries were displayed.

 

 

ntapcl-bul-sf::*> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
ntapcl-bul-sf-01
               ntapcl-bul-sf- -        Unknown
               02
ntapcl-bul-sf-02
               ntapcl-bul-sf- false    In takeover, Auto giveback deferred
               01
2 entries were displayed.

ntapcl-bul-sf::*>

 

ntapcl-bul-sf::*> network port show                      

Warning: Unable to list entries for vifmgr on node "ntapcl-bul-sf-01": RPC: Couldn't make connection.

Node: ntapcl-bul-sf-02
                                                                       Ignore
                                                  Speed(Mbps) Health   Health
Port      IPspace      Broadcast Domain Link MTU  Admin/Oper  Status   Status
--------- ------------ ---------------- ---- ---- ----------- -------- ------
a0a       Default      Default          up   1500  auto/1000  healthy  false
e0M       Default      Default          up   1500  auto/1000  healthy  false
e0a       Default      -                up   1500  auto/1000  healthy  false
e0b       Default      -                up   1500  auto/1000  healthy  false
e0c       Default      Default          down 1500  auto/10    -        false
e0d       Cluster      Cluster          up   9000  auto/10000 healthy  false
e0e       Default      Default          down 1500  auto/10    -        false
e0f       Cluster      Cluster          up   9000  auto/10000 healthy  false
8 entries were displayed.

 

ntapcl-bul-sf::*> network interface show
            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Cluster
            ntapcl-bul-sf-01_clus1
                         up/-     169.254.49.169/16  ntapcl-bul-sf-01
                                                                   e0f     false
            ntapcl-bul-sf-01_clus2
                         up/-     169.254.163.32/16  ntapcl-bul-sf-01
                                                                   e0f     true
            ntapcl-bul-sf-02_clus1
                         up/up    169.254.87.210/16  ntapcl-bul-sf-02
                                                                   e0d     true
            ntapcl-bul-sf-02_clus2
                         up/up    169.254.190.147/16 ntapcl-bul-sf-02
                                                                   e0f     true
SVM_AL
            SVM_AL-ISCSI-1
                         up/-     172.16.6.32/16     ntapcl-bul-sf-01
                                                                   a0a     true
            SVM_AL-ISCSI-2
                         up/up    172.16.6.33/16     ntapcl-bul-sf-02
                                                                   a0a     true
            SVM_AL-MGMT  up/up    172.16.6.29/16     ntapcl-bul-sf-02
                                                                   a0a     true
                                                                               
            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
SVM_AL
            SVM_AL-NFS-1 up/up    172.16.6.30/16     ntapcl-bul-sf-02
                                                                   a0a     false
            SVM_AL-NFS-2 up/up    172.16.6.31/16     ntapcl-bul-sf-02
                                                                   a0a     true
SVM_BL
            SVM_BL-ISCSI-1
                         up/-     172.16.6.27/16     ntapcl-bul-sf-01
                                                                   a0a     true
            SVM_BL-ISCSI-2
                         up/up    172.16.6.28/16     ntapcl-bul-sf-02
                                                                   a0a     true
            SVM_BL-MGMT  up/up    172.16.6.24/16     ntapcl-bul-sf-02
                                                                   a0a     false
            SVM_BL-NFS-1 up/up    172.16.6.25/16     ntapcl-bul-sf-02
                                                                   a0a     false
            SVM_BL-NFS-2 up/up    172.16.6.26/16     ntapcl-bul-sf-02
                                                                   a0a     true
ntapcl-bul-sf
            ntapcl-bul-sf-01_mgmt1
                         up/-     172.16.6.20/16     ntapcl-bul-sf-01
                                                                   e0M     true

            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
ntapcl-bul-sf
            ntapcl-bul-sf-02_cluster_mgmt
                         up/up    172.16.6.19/16     ntapcl-bul-sf-02
                                                                   e0M     false
            ntapcl-bul-sf-02_mgmt1
                         up/up    172.16.6.21/16     ntapcl-bul-sf-02
                                                                   e0M     true
17 entries were displayed.

ntapcl-bul-sf::*>

 

 

I've tried to reboot the failed node from command line, but:

ntapcl-bul-sf::*> system node reboot -node ntapcl-bul-sf-01

Warning: Rebooting or halting node "ntapcl-bul-sf-01" in an HA-enabled cluster may result in client disruption or data access failure. To ensure continuity of service, use
         the "storage failover takeover" command.
         Are you sure you want to reboot node "ntapcl-bul-sf-01"? {y|n}: y

Warning: Unable to list entries on node ntapcl-bul-sf-01. RPC: Couldn't make connection
Error: command failed: RPC: Couldn't make connection

 

Looks like a communication issue, there is no ping to any of the IPs of the failed node - not from a workstation, not from the second node.

 

What would be the right thing to do next? Is it safe to turn the power off/on , or reconnect all network cables one by one? Or something else?

 

Thanks in advance for the help!

 

6 REPLIES 6

andris
13,940 Views

Open a case with NetApp Support.

 

The node appears to be down (not running ONTAP).  Connect to the serial console port directly or via the SSH to SP to reach it. Try to boot ONTAP from there and check the messages that result.

Simeonof
13,915 Views

Yep, connected via the serial port, then issued boot_ontap command and the result is not very good:

 

PANIC  : ECC error at DIMM-NV1: 94-04-1715-00000076,ADDR 0x4f0000000,(Node(0), CH(1), DIMM(0), Rank(0), Bank(0x0),Row(0x3000), Col(0x0)
 Uncorrectable Machine Check Error at CPU0. NB Error: STATUS(Val,UnCor,Enable,MiscV,AddrV,PCC,CErrCnt(0),RdECC,ErrCode(Channel Unkn, Read)ErrCode(0x9f))MISC(Synd(0xffffffff),Chan(0x1),DIMM(0),RTID(0xc0)), ADDR(0x4f0000000).
version: 9.1P2: Tue Feb 28 05:55:17 PST 2017
conf   : x86_64.optimize.nodar
cpuid = 0
Uptime: 6s
coredump: primary dumper unavailable.
coredump: secondary dumper unavailable.
System halting...

 

 

Looks like memory fault... Will contact support.

Miranto
2,530 Views

Hello Simeonof,

 

we have the same issue. do you mind to share how did you fix it?

 

Regards,

CHRISMAKI
2,426 Views

This thread is 6 years old, looks like they opened a support ticket. You should probably do the same if you currently have a node down.

torres91
1,996 Views

Hello

This is the problem DIMM-NV1: 94-04-1715-00000076 

 

There are failed DIMM in these node. 

 

In your case, you need to know what DIMM are failed if you have the same problem.

Miranto
1,957 Views

Hello,

 

the issue is fixed by removing the node, the memory and put them back.

 

Regards,

Miranto

Public