NetApp FAS2552 One of the controllers keeps faulting every 2 days

anismerlin

Hi Team

we have an issue on one of our netapp appliances.

One of the controllers keeps faulting every 2 days. Only a hard reset seems to solve the issue temporarily.

Some of the resources do stay online but the rest do not failover to the other controller.

kindly Advise

cedric_renauld

Oupsy !

Carefull ! you have minimum 2 hard drive out of order ..

You are near to lose your data ...

Call NetApp to change these disk out of order

After you have many other issue, due to mistake during install, we can see after you disk have been changed

anismerlin

client confirmed that disks was replaced 2 weeks ogo however the log shows that disks still out of order

attached the result of

storage disk show -broken

JonathanGaudette

Here's what Cedric was referring to:

4/30/2024 09:00:00  MOCO-STR-BKP2    EMERGENCY     monitor.shutdown.brokenDisk: two data disks in RAID group "/Aggr01_FSAS/plex0/rg0" are broken. Halting system now.

anismerlin

client confirmed that disks was replaced 2 weeks ago however the log shows that disks still out of order

attached the result of

storage disk show -broken

torres91

I had something similar with a costumer.

I have some questions: Do you have spare disks? and, what firmware version have your SP?

In my incident with the FAS2552, one of my aggregates have two disk failures and one controller gones down. We have a 3 spare disk, but CDOT (Ontap in 9s versions) doesn't take any spare disk, this due a bug in the Service Processor firmware version. While, we waiting of arrival of disks for replacement we need to change raid time out, from 24 to 72. Check this commands:

storage raid-options show

storage raid-options modify -node node1 -name raid.timeout 48

link: https://kb.netapp.com/on-prem/ontap/hardware-KBs/Node_shutdown_with_monitor_shutdown_brokenDisk%3AEMERGENCY_error

Then, if your controller keeps turning off, is possible that you have more failed disks.

JonathanGaudette

The Service Processor has no control over RAID or spare disks.

If you do not have any spare disks, ONTAP will not be able to start the reconstruct and will shut down until spare disks are added.

If you have unassigned disks, you need to assign them to the node.

To check for unassigned disks run "run -node * disk show -n"

If you have ADP (disk partitioning), and disk autoassign isn't working, you'll also need to assign the partitions created by ONTAP, as they will also be unassigned.

Unassigned partitions will also show up in "disk show -n"

AlexDawson

Agreed, this sounds like the issue to me too.

torres91

If you review disk show.txt ouput I can see a one spare disk

1.1.17 3.63TB 1 17 FSAS spare Pool0 MOCO-STR-BKP2

I think that Failed agregate has not yet rebuilding

JonathanGaudette

Need more data.
full sysconfig -r from both nodeshells
full disk show -n from either nodeshell