ONTAP Hardware

NetApp FAS2552 One of the controllers keeps faulting every 2 days

anismerlin
1,452 Views

Hi Team 

we have an issue on one of our netapp appliances.

One of the controllers keeps faulting every 2 days. Only a hard reset seems to solve the issue temporarily.

Some of the resources do stay online but the rest do not failover to the other controller.

kindly Advise

9 REPLIES 9

cedric_renauld
1,402 Views

Oupsy !

Carefull ! you have minimum 2 hard drive out of order ..

You are near to lose your data ...

Call NetApp to change these disk out of order

 

After you have many other issue, due to mistake during install, we can see after you disk have been changed

client confirmed that disks was replaced 2 weeks ogo however  the log shows that disks still out of order 

attached the result of 

storage disk show -broken

JonathanGaudette
1,394 Views

Here's what Cedric was referring to:

4/30/2024 09:00:00  MOCO-STR-BKP2    EMERGENCY     monitor.shutdown.brokenDisk: two data disks in RAID group "/Aggr01_FSAS/plex0/rg0" are broken. Halting system now.

 

client confirmed that disks was replaced 2 weeks ago however  the log shows that disks still out of order 

attached the result of 

storage disk show -broken

torres91
618 Views

I had something similar with a costumer.  

I have some questions: Do you have spare disks?  and, what firmware version have your SP? 

 

In my incident with the FAS2552,  one of my aggregates have two disk failures and one controller gones down. We have a 3 spare disk, but CDOT (Ontap in 9s versions) doesn't take any spare disk, this due a bug in the Service Processor firmware version.  While, we waiting of arrival of disks for replacement we need to change raid time out, from 24 to 72. Check this commands:  

storage raid-options show

storage raid-options modify -node node1 -name raid.timeout 48 

link: https://kb.netapp.com/on-prem/ontap/hardware-KBs/Node_shutdown_with_monitor_shutdown_brokenDisk%3AEMERGENCY_error 

 

Then, if your controller keeps turning off, is possible that you have more failed disks. 

The Service Processor has no control over RAID or spare disks.

If you do not have any spare disks, ONTAP will not be able to start the reconstruct and will shut down until spare disks are added.

If you have unassigned disks, you need to assign them to the node.

To check for unassigned disks run "run -node * disk show -n"

If you have ADP (disk partitioning), and disk autoassign isn't working, you'll also need to assign the partitions created by ONTAP, as they will also be unassigned.

Unassigned partitions will also show up in "disk show -n"

Agreed, this sounds like the issue to me too.

If you review disk show.txt ouput I can see a one spare disk

 

1.1.17 3.63TB 1 17 FSAS spare Pool0 MOCO-STR-BKP2 

 

I think that Failed agregate has not yet rebuilding 

 

Need more data.
full sysconfig -r from both nodeshells
full disk show -n from either nodeshell

Public