Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We have FAS2520 with two nodes, unfortunately the one node is down. below is the result of cluster show -node command
Node: XXXX-NODE01
Eligibility: true
Health: false
All shares and datastore are still available now.
how can we bring this node back up? and how to identify the root cause of this issue?
we have receive this alert when the issue happen: System Alert from SP of XXXX-NODE01 (REBOOT (watchdog reset)) CRITICAL
Thanks for your answer and help
Solved! See The Solution
1 ACCEPTED SOLUTION
Miranto has accepted the solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Miranto,
I mean take off the node physically from the Shelf, open the node and remove and insert the memory modules. Then insert the node again into the shelf.
Be carefull and take a look of the cabling before you extract the node, cause you will need to cabling it again when you insert it.
Regards,
Albert
10 REPLIES 10
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Miranto,
I recommend you to open a support case, and support could analyze all the logs and find the root case.
Meanwhile, you can connect from de SP/BMC of the down node to view the system console. Maybe you can see the error before analyze the logs.
Kind regards,
Albert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Albert, thanks for the reply.
Unfortunately our Hardware model is not supported by Netapp anymore, so we could not create a support case.
we have found below from sp_console_logs, and wondering if it is hardware issue
PANIC : ECC error at DIMM-NV1: 94-04-1533-00001357,ADDR 0x4f0000000,(Node(0), CH(1), DIMM(0), Rank(0), Bank(0x0),Row(0x3000), Col(0x0)
Uncorrectable Machine Check Error at CPU0. NB Error: STATUS(Val,OverF,UnCor,Enable,MiscV,AddrV,PCC,CErrCnt(0),RdECC,ErrCode(Channel Unkn, Read)ErrCode(0x9f))MISC(Synd(0x84cdc00),Chan(0x1),DIMM(0),RTID(0xc0)), ADDR(0x4f0000000).
version: 9.7P16: Fri Sep 10 18:35:49 EDT 2021
conf : x86_64.optimize.nodar
cpuid = 0
Uptime: 12s
coredump: primary dumper is not yet registered this early during system initialization. A coredump will not be available at this time.
coredump: secondary dumper is not yet registered this early during system initialization. A coredump will not be available at this time.
System halting...
Best Regards,
Miranto
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Miranto,
It seem to be a problem with one of the Memory DIMM. You can try to reseat the node on the Chassis. This will do a full poweroff of the node.
You can try to reseat the memory modules also.
Sometimes this work because the memory errors are reseted too.
The good gone will be to replace the failed Dimm, because if you solve now the problem will ocurr again.
Kind regards,
Albert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Albert,
what did you mean exactly by reseating node on the chassis? or reseating the memory module?
Best Regards,
Miranto
Miranto has accepted the solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Miranto,
I mean take off the node physically from the Shelf, open the node and remove and insert the memory modules. Then insert the node again into the shelf.
Be carefull and take a look of the cabling before you extract the node, cause you will need to cabling it again when you insert it.
Regards,
Albert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Albert,
I appreciate your help.
this process (taking off physically the node) required a downtime for the netapp cluster or can be done with the second node is running?
FILER::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
FILER-NODE01 false true false
FILER-NODE02 true true false
2 entries were displayed.
Best Regards,
Miranto
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Miranto,
The other node should have taken over so no disruption.
Check: storage failover show
And as your failed node is power off, the reseat can be done safely.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Gmox,
the issue was fixed by removing the memory and put it back.
i appreciate your help.
Best Regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good to hear.
Don't forget to monitor and check if errors occuring:
sensors show -node * -name *ECC* -hidden true
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
below the output of the command
FILER::> sensors show -node * -name *ECC* -hidden true
(system node environment sensors show)
There are no entries matching your query.
FILER::>
Regards,
Miranto
