Solved: FAS2520 node down

Miranto · ‎2024-04-23

Hello,

We have FAS2520 with two nodes, unfortunately the one node is down. below is the result of cluster show -node command

Node: XXXX-NODE01
Eligibility: true
Health: false

All shares and datastore are still available now.

how can we bring this node back up? and how to identify the root cause of this issue?

we have receive this alert when the issue happen: System Alert from SP of XXXX-NODE01 (REBOOT (watchdog reset)) CRITICAL

Thanks for your answer and help

Abeltran · ‎2024-04-25

Hi Miranto,

I mean take off the node physically from the Shelf, open the node and remove and insert the memory modules. Then insert the node again into the shelf.

Be carefull and take a look of the cabling before you extract the node, cause you will need to cabling it again when you insert it.

Regards,

Albert

View solution in original post

Abeltran · ‎2024-04-24

Hello Miranto,

I recommend you to open a support case, and support could analyze all the logs and find the root case.

Meanwhile, you can connect from de SP/BMC of the down node to view the system console. Maybe you can see the error before analyze the logs.

Kind regards,

Albert

Miranto · ‎2024-04-24

Hello Albert, thanks for the reply.

Unfortunately our Hardware model is not supported by Netapp anymore, so we could not create a support case.

we have found below from sp_console_logs, and wondering if it is hardware issue

PANIC : ECC error at DIMM-NV1: 94-04-1533-00001357,ADDR 0x4f0000000,(Node(0), CH(1), DIMM(0), Rank(0), Bank(0x0),Row(0x3000), Col(0x0)
Uncorrectable Machine Check Error at CPU0. NB Error: STATUS(Val,OverF,UnCor,Enable,MiscV,AddrV,PCC,CErrCnt(0),RdECC,ErrCode(Channel Unkn, Read)ErrCode(0x9f))MISC(Synd(0x84cdc00),Chan(0x1),DIMM(0),RTID(0xc0)), ADDR(0x4f0000000).
version: 9.7P16: Fri Sep 10 18:35:49 EDT 2021
conf : x86_64.optimize.nodar
cpuid = 0
Uptime: 12s
coredump: primary dumper is not yet registered this early during system initialization. A coredump will not be available at this time.
coredump: secondary dumper is not yet registered this early during system initialization. A coredump will not be available at this time.
System halting...

Best Regards,

Miranto

Abeltran · ‎2024-04-24

Hi Miranto,

It seem to be a problem with one of the Memory DIMM. You can try to reseat the node on the Chassis. This will do a full poweroff of the node.

You can try to reseat the memory modules also.

Sometimes this work because the memory errors are reseted too.

The good gone will be to replace the failed Dimm, because if you solve now the problem will ocurr again.

Kind regards,

Albert

Miranto · ‎2024-04-25

Hi Albert,

what did you mean exactly by reseating node on the chassis? or reseating the memory module?

Best Regards,

Miranto

Abeltran · ‎2024-04-25

Hi Miranto,

I mean take off the node physically from the Shelf, open the node and remove and insert the memory modules. Then insert the node again into the shelf.

Be carefull and take a look of the cabling before you extract the node, cause you will need to cabling it again when you insert it.

Regards,

Albert

Miranto · ‎2024-04-25

Hi Albert,

I appreciate your help.

this process (taking off physically the node) required a downtime for the netapp cluster or can be done with the second node is running?

FILER::*> cluster show

Node Health Eligibility Epsilon

-------------------- ------- ------------ ------------

FILER-NODE01 false true false

FILER-NODE02 true true false

2 entries were displayed.

Best Regards,

Miranto

Gmox · ‎2024-04-29

Hello Miranto,

The other node should have taken over so no disruption.

Check: storage failover show

And as your failed node is power off, the reseat can be done safely.

Miranto · ‎2024-04-29

Hello Gmox,

the issue was fixed by removing the memory and put it back.

i appreciate your help.

Best Regards,

Gmox · ‎2024-04-29

Good to hear.

Don't forget to monitor and check if errors occuring:

sensors show -node * -name *ECC* -hidden true

Miranto · ‎2024-04-29

below the output of the command

FILER::> sensors show -node * -name *ECC* -hidden true
(system node environment sensors show)
There are no entries matching your query.

FILER::>

Regards,

Miranto