ONTAP Hardware
ONTAP Hardware
Need some help with a FAS2520 2-node switchless cluster. Since some time one of the nodes appears in red (node down) in OnCommandSystemManager. Here's some commands output that I've managed to run (I'm a NetApp newbie, hope these make sense):
ntapcl-bul-sf::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
ntapcl-bul-sf-01 false true false
ntapcl-bul-sf-02 true true false
2 entries were displayed.
ntapcl-bul-sf::*> storage failover show
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
ntapcl-bul-sf-01
ntapcl-bul-sf- - Unknown
02
ntapcl-bul-sf-02
ntapcl-bul-sf- false In takeover, Auto giveback deferred
01
2 entries were displayed.
ntapcl-bul-sf::*>
ntapcl-bul-sf::*> network port show
Warning: Unable to list entries for vifmgr on node "ntapcl-bul-sf-01": RPC: Couldn't make connection.
Node: ntapcl-bul-sf-02
Ignore
Speed(Mbps) Health Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status
--------- ------------ ---------------- ---- ---- ----------- -------- ------
a0a Default Default up 1500 auto/1000 healthy false
e0M Default Default up 1500 auto/1000 healthy false
e0a Default - up 1500 auto/1000 healthy false
e0b Default - up 1500 auto/1000 healthy false
e0c Default Default down 1500 auto/10 - false
e0d Cluster Cluster up 9000 auto/10000 healthy false
e0e Default Default down 1500 auto/10 - false
e0f Cluster Cluster up 9000 auto/10000 healthy false
8 entries were displayed.
ntapcl-bul-sf::*> network interface show
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Cluster
ntapcl-bul-sf-01_clus1
up/- 169.254.49.169/16 ntapcl-bul-sf-01
e0f false
ntapcl-bul-sf-01_clus2
up/- 169.254.163.32/16 ntapcl-bul-sf-01
e0f true
ntapcl-bul-sf-02_clus1
up/up 169.254.87.210/16 ntapcl-bul-sf-02
e0d true
ntapcl-bul-sf-02_clus2
up/up 169.254.190.147/16 ntapcl-bul-sf-02
e0f true
SVM_AL
SVM_AL-ISCSI-1
up/- 172.16.6.32/16 ntapcl-bul-sf-01
a0a true
SVM_AL-ISCSI-2
up/up 172.16.6.33/16 ntapcl-bul-sf-02
a0a true
SVM_AL-MGMT up/up 172.16.6.29/16 ntapcl-bul-sf-02
a0a true
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
SVM_AL
SVM_AL-NFS-1 up/up 172.16.6.30/16 ntapcl-bul-sf-02
a0a false
SVM_AL-NFS-2 up/up 172.16.6.31/16 ntapcl-bul-sf-02
a0a true
SVM_BL
SVM_BL-ISCSI-1
up/- 172.16.6.27/16 ntapcl-bul-sf-01
a0a true
SVM_BL-ISCSI-2
up/up 172.16.6.28/16 ntapcl-bul-sf-02
a0a true
SVM_BL-MGMT up/up 172.16.6.24/16 ntapcl-bul-sf-02
a0a false
SVM_BL-NFS-1 up/up 172.16.6.25/16 ntapcl-bul-sf-02
a0a false
SVM_BL-NFS-2 up/up 172.16.6.26/16 ntapcl-bul-sf-02
a0a true
ntapcl-bul-sf
ntapcl-bul-sf-01_mgmt1
up/- 172.16.6.20/16 ntapcl-bul-sf-01
e0M true
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
ntapcl-bul-sf
ntapcl-bul-sf-02_cluster_mgmt
up/up 172.16.6.19/16 ntapcl-bul-sf-02
e0M false
ntapcl-bul-sf-02_mgmt1
up/up 172.16.6.21/16 ntapcl-bul-sf-02
e0M true
17 entries were displayed.
ntapcl-bul-sf::*>
I've tried to reboot the failed node from command line, but:
ntapcl-bul-sf::*> system node reboot -node ntapcl-bul-sf-01
Warning: Rebooting or halting node "ntapcl-bul-sf-01" in an HA-enabled cluster may result in client disruption or data access failure. To ensure continuity of service, use
the "storage failover takeover" command.
Are you sure you want to reboot node "ntapcl-bul-sf-01"? {y|n}: y
Warning: Unable to list entries on node ntapcl-bul-sf-01. RPC: Couldn't make connection
Error: command failed: RPC: Couldn't make connection
Looks like a communication issue, there is no ping to any of the IPs of the failed node - not from a workstation, not from the second node.
What would be the right thing to do next? Is it safe to turn the power off/on , or reconnect all network cables one by one? Or something else?
Thanks in advance for the help!
Open a case with NetApp Support.
The node appears to be down (not running ONTAP). Connect to the serial console port directly or via the SSH to SP to reach it. Try to boot ONTAP from there and check the messages that result.
Yep, connected via the serial port, then issued boot_ontap command and the result is not very good:
PANIC : ECC error at DIMM-NV1: 94-04-1715-00000076,ADDR 0x4f0000000,(Node(0), CH(1), DIMM(0), Rank(0), Bank(0x0),Row(0x3000), Col(0x0)
Uncorrectable Machine Check Error at CPU0. NB Error: STATUS(Val,UnCor,Enable,MiscV,AddrV,PCC,CErrCnt(0),RdECC,ErrCode(Channel Unkn, Read)ErrCode(0x9f))MISC(Synd(0xffffffff),Chan(0x1),DIMM(0),RTID(0xc0)), ADDR(0x4f0000000).
version: 9.1P2: Tue Feb 28 05:55:17 PST 2017
conf : x86_64.optimize.nodar
cpuid = 0
Uptime: 6s
coredump: primary dumper unavailable.
coredump: secondary dumper unavailable.
System halting...
Looks like memory fault... Will contact support.
Hello Simeonof,
we have the same issue. do you mind to share how did you fix it?
Regards,
This thread is 6 years old, looks like they opened a support ticket. You should probably do the same if you currently have a node down.
Hello
This is the problem DIMM-NV1: 94-04-1715-00000076
There are failed DIMM in these node.
In your case, you need to know what DIMM are failed if you have the same problem.
Hello,
the issue is fixed by removing the node, the memory and put them back.
Regards,
Miranto