Solved: Re: RPC: Couldn't make connection

Simeonof · ‎2018-09-11

Need some help with a FAS2520 2-node switchless cluster. Since some time one of the nodes appears in red (node down) in OnCommandSystemManager. Here's some commands output that I've managed to run (I'm a NetApp newbie, hope these make sense):

ntapcl-bul-sf::*> cluster show
Node                 Health Eligibility   Epsilon
-------------------- ------- ------------ ------------
ntapcl-bul-sf-01     false   true          false
ntapcl-bul-sf-02     true    true          false
2 entries were displayed.

ntapcl-bul-sf::*> storage failover show
                              Takeover
Node           Partner        Possible State Description
-------------- -------------- -------- -------------------------------------
ntapcl-bul-sf-01
               ntapcl-bul-sf- -        Unknown
               02
ntapcl-bul-sf-02
               ntapcl-bul-sf- false    In takeover, Auto giveback deferred
               01
2 entries were displayed.

ntapcl-bul-sf::*>

ntapcl-bul-sf::*> network port show

Warning: Unable to list entries for vifmgr on node "ntapcl-bul-sf-01": RPC: Couldn't make connection.

Node: ntapcl-bul-sf-02
                                                                       Ignore
                                                  Speed(Mbps) Health   Health
Port      IPspace      Broadcast Domain Link MTU Admin/Oper Status   Status
--------- ------------ ---------------- ---- ---- ----------- -------- ------
a0a       Default      Default          up   1500 auto/1000 healthy false
e0M       Default      Default          up   1500 auto/1000 healthy false
e0a       Default      -                up   1500 auto/1000 healthy false
e0b       Default      -                up   1500 auto/1000 healthy false
e0c       Default      Default          down 1500 auto/10    -        false
e0d       Cluster      Cluster          up   9000 auto/10000 healthy false
e0e       Default      Default          down 1500 auto/10    -        false
e0f       Cluster      Cluster          up   9000 auto/10000 healthy false
8 entries were displayed.

ntapcl-bul-sf::*> network interface show
            Logical    Status     Network            Current       Current Is
Vserver     Interface Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Cluster
            ntapcl-bul-sf-01_clus1
                         up/-     169.254.49.169/16 ntapcl-bul-sf-01
                                                                   e0f     false
            ntapcl-bul-sf-01_clus2
                         up/-     169.254.163.32/16 ntapcl-bul-sf-01
                                                                   e0f     true
            ntapcl-bul-sf-02_clus1
                         up/up    169.254.87.210/16 ntapcl-bul-sf-02
                                                                   e0d     true
            ntapcl-bul-sf-02_clus2
                         up/up    169.254.190.147/16 ntapcl-bul-sf-02
                                                                   e0f     true
SVM_AL
            SVM_AL-ISCSI-1
                         up/-     172.16.6.32/16     ntapcl-bul-sf-01
                                                                   a0a     true
            SVM_AL-ISCSI-2
                         up/up    172.16.6.33/16     ntapcl-bul-sf-02
                                                                   a0a     true
            SVM_AL-MGMT up/up    172.16.6.29/16     ntapcl-bul-sf-02
                                                                   a0a     true

            Logical    Status     Network            Current       Current Is
Vserver     Interface Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
SVM_AL
            SVM_AL-NFS-1 up/up    172.16.6.30/16     ntapcl-bul-sf-02
                                                                   a0a     false
            SVM_AL-NFS-2 up/up    172.16.6.31/16     ntapcl-bul-sf-02
                                                                   a0a     true
SVM_BL
            SVM_BL-ISCSI-1
                         up/-     172.16.6.27/16     ntapcl-bul-sf-01
                                                                   a0a     true
            SVM_BL-ISCSI-2
                         up/up    172.16.6.28/16     ntapcl-bul-sf-02
                                                                   a0a     true
            SVM_BL-MGMT up/up    172.16.6.24/16     ntapcl-bul-sf-02
                                                                   a0a     false
            SVM_BL-NFS-1 up/up    172.16.6.25/16     ntapcl-bul-sf-02
                                                                   a0a     false
            SVM_BL-NFS-2 up/up    172.16.6.26/16     ntapcl-bul-sf-02
                                                                   a0a     true
ntapcl-bul-sf
            ntapcl-bul-sf-01_mgmt1
                         up/-     172.16.6.20/16     ntapcl-bul-sf-01
                                                                   e0M     true

            Logical    Status     Network            Current       Current Is
Vserver     Interface Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
ntapcl-bul-sf
            ntapcl-bul-sf-02_cluster_mgmt
                         up/up    172.16.6.19/16     ntapcl-bul-sf-02
                                                                   e0M     false
            ntapcl-bul-sf-02_mgmt1
                         up/up    172.16.6.21/16     ntapcl-bul-sf-02
                                                                   e0M     true
17 entries were displayed.

ntapcl-bul-sf::*>

I've tried to reboot the failed node from command line, but:

ntapcl-bul-sf::*> system node reboot -node ntapcl-bul-sf-01

Warning: Rebooting or halting node "ntapcl-bul-sf-01" in an HA-enabled cluster may result in client disruption or data access failure. To ensure continuity of service, use
the "storage failover takeover" command.
Are you sure you want to reboot node "ntapcl-bul-sf-01"? {y|n}: y

Warning: Unable to list entries on node ntapcl-bul-sf-01. RPC: Couldn't make connection
Error: command failed: RPC: Couldn't make connection

Looks like a communication issue, there is no ping to any of the IPs of the failed node - not from a workstation, not from the second node.

What would be the right thing to do next? Is it safe to turn the power off/on , or reconnect all network cables one by one? Or something else?

Thanks in advance for the help!

andris · ‎2018-09-12

Open a case with NetApp Support.

The node appears to be down (not running ONTAP). Connect to the serial console port directly or via the SSH to SP to reach it. Try to boot ONTAP from there and check the messages that result.

View solution in original post

andris · ‎2018-09-12

Open a case with NetApp Support.

The node appears to be down (not running ONTAP). Connect to the serial console port directly or via the SSH to SP to reach it. Try to boot ONTAP from there and check the messages that result.

Simeonof · ‎2018-09-13

Yep, connected via the serial port, then issued boot_ontap command and the result is not very good:

PANIC : ECC error at DIMM-NV1: 94-04-1715-00000076,ADDR 0x4f0000000,(Node(0), CH(1), DIMM(0), Rank(0), Bank(0x0),Row(0x3000), Col(0x0)
Uncorrectable Machine Check Error at CPU0. NB Error: STATUS(Val,UnCor,Enable,MiscV,AddrV,PCC,CErrCnt(0),RdECC,ErrCode(Channel Unkn, Read)ErrCode(0x9f))MISC(Synd(0xffffffff),Chan(0x1),DIMM(0),RTID(0xc0)), ADDR(0x4f0000000).
version: 9.1P2: Tue Feb 28 05:55:17 PST 2017
conf : x86_64.optimize.nodar
cpuid = 0
Uptime: 6s
coredump: primary dumper unavailable.
coredump: secondary dumper unavailable.
System halting...

Looks like memory fault... Will contact support.

Miranto · ‎2024-04-25

Hello Simeonof,

we have the same issue. do you mind to share how did you fix it?

Regards,

CHRISMAKI · ‎2024-04-26

This thread is 6 years old, looks like they opened a support ticket. You should probably do the same if you currently have a node down.

torres91 · ‎2024-05-09

Hello

This is the problem DIMM-NV1: 94-04-1715-00000076

There are failed DIMM in these node.

In your case, you need to know what DIMM are failed if you have the same problem.

Miranto · ‎2024-05-10

Hello,

the issue is fixed by removing the node, the memory and put them back.

Regards,

Miranto

wareer · ‎2025-06-02

This link should be helpful to you. It might be a problem with the dimm：https://kb-cn.netapp.com/on-prem/ontap/OHW/OHW-KBs/How_to_troubleshoot_correctable_memory_errors_on_62XX_32XX_25XX_and_22XX_series_systems

RPC: Couldn't make connection

I2A Registration is Open!