ONTAP Discussions

FAS2520 node down

Miranto
905 Views

Hello,

 

We have FAS2520 with two nodes, unfortunately the one node is down.  below is the result of cluster show -node command

Node: XXXX-NODE01
Eligibility: true
Health: false

 

All shares and datastore are still available now.

how can we bring this node back up? and how to identify the root cause of this issue?

we have receive this alert when the issue happen: System Alert from SP of XXXX-NODE01 (REBOOT (watchdog reset)) CRITICAL

 

Thanks for your answer and help

 

1 ACCEPTED SOLUTION

Abeltran
604 Views

Hi Miranto,

 

I mean take off the node physically from the Shelf, open the node and remove and insert the memory modules. Then insert the node again into the shelf.

 

Be carefull and take a look of the cabling before you extract the node, cause you will need to cabling it again when you insert it.

 

Regards,

 

Albert

View solution in original post

10 REPLIES 10

Abeltran
815 Views

Hello Miranto,

 

I recommend you to open a support case, and support could analyze all the logs and find the root case.

 

Meanwhile, you can connect from de SP/BMC of the down node to view the system console. Maybe you can see the error before analyze the logs.

 

Kind regards,

 

Albert

Miranto
813 Views

Hello Albert, thanks for the reply.

Unfortunately our Hardware model is not supported by Netapp anymore, so we could not create a support case.

we have found below from sp_console_logs, and wondering if it is hardware issue

 

PANIC : ECC error at DIMM-NV1: 94-04-1533-00001357,ADDR 0x4f0000000,(Node(0), CH(1), DIMM(0), Rank(0), Bank(0x0),Row(0x3000), Col(0x0)
Uncorrectable Machine Check Error at CPU0. NB Error: STATUS(Val,OverF,UnCor,Enable,MiscV,AddrV,PCC,CErrCnt(0),RdECC,ErrCode(Channel Unkn, Read)ErrCode(0x9f))MISC(Synd(0x84cdc00),Chan(0x1),DIMM(0),RTID(0xc0)), ADDR(0x4f0000000).
version: 9.7P16: Fri Sep 10 18:35:49 EDT 2021
conf : x86_64.optimize.nodar
cpuid = 0
Uptime: 12s
coredump: primary dumper is not yet registered this early during system initialization. A coredump will not be available at this time.
coredump: secondary dumper is not yet registered this early during system initialization. A coredump will not be available at this time.
System halting...

 

Best Regards,

Miranto

Abeltran
806 Views

Hi Miranto,

 

It seem to be a problem with one of the Memory DIMM.  You can try to reseat the node on the Chassis.  This will do a full poweroff of the node.

 

You can try to reseat the memory modules also.

 

Sometimes this work because the memory errors are reseted too. 

 

The good gone will be to replace the failed Dimm, because if you solve now the problem will ocurr again.

 

Kind regards,

 

Albert

Miranto
606 Views

Hi Albert,

 

what did you mean exactly by reseating node on the chassis? or reseating the memory module?

 

Best Regards,

Miranto

Abeltran
605 Views

Hi Miranto,

 

I mean take off the node physically from the Shelf, open the node and remove and insert the memory modules. Then insert the node again into the shelf.

 

Be carefull and take a look of the cabling before you extract the node, cause you will need to cabling it again when you insert it.

 

Regards,

 

Albert

Miranto
603 Views

Hi Albert,

 

I appreciate your help.

this process (taking off physically the node) required a downtime for the netapp cluster  or can be done with the second node is running?

 

FILER::*> cluster show

Node                 Health  Eligibility   Epsilon

-------------------- ------- ------------  ------------

FILER-NODE01      false   true          false

FILER-NODE02      true    true          false

2 entries were displayed.

 

Best Regards,

Miranto

Gmox
404 Views

Hello Miranto,

 

The other node should have taken over so no disruption.

Check: storage failover show

 

And as your failed node is power off, the reseat can be done safely.

 

Miranto
399 Views

Hello Gmox,

 

the issue was fixed by removing the memory and put it back.

i appreciate your help.

 

Best Regards,

Gmox
392 Views

Good to hear.

 

Don't forget to monitor and check if errors occuring:

 

sensors show -node * -name *ECC* -hidden true

Miranto
391 Views

below the output of the command

 


FILER::> sensors show -node * -name *ECC* -hidden true
(system node environment sensors show)
There are no entries matching your query.

FILER::>

 

Regards,

Miranto

Public