Cluster lacks spare disks

KernelPanicfbsd · ‎2022-10-10

Hello, we have a local failover cluster containing two nodes:

DC1-NAPP-C1-N1

DC1-NAPP-C1-N2

OnCommand / OCUM are complaining that DC1-NAPP-C1-N2 does not have enough spare disks (I have attached the disk details of both nodes to this post) - I think its because DC1-NAPP-C1-N2 uses 1.63 TB SAS drives in its root aggregate and does not have a drive of this type in its hot-spare pool - happy to be corrected on this as I'm still not sure what the spare disk logic is.

I'm unsure how we've ended up in this situation as all failed disks are replaced with the exact same drive type and presumably the replacement disk goes in the spare pool, in any case how can I fix this? - from what I can tell I would need to reassign a disk from partner DC1-NAPP-C1-N1 but there are no available spares of that type to give - I'd need to convert a data disk to a spare - is that possible without losing data?

Thanks for any help.

Ontapforrum · ‎2022-10-10

At least one matching or appropriate hot spare available for each kind of disk per node is recommended.

In your case, it appears this particular one -10K 1.6TB has no spares. Unfortunately, this particular disk-type has been utilized in this particular LARGE aggregate "Aggr1_SAS_N1" (which is composed of 4-raid-groups). I observed, there are 3-rgs of 23, and the 4th-rg size is 19, so probably the 4th was added later to further increase the size of the aggregate. I guess, before this last rg was added (rg3), it was all ok, this is where I believe, somehow 'spare' consideration was completely missed. Is it ?

It's a not a good situation to be in honestly. One cannot just take-away 'data' disk from the raid-group, unless you reduce the raid-protection from raid-dp to raid-4, which will reduce the redundancy. I don't know whether you can afford to move the volumes from a smaller aggr to another aggregate and then destroy it and re-build it with lesser rg-size. Also, just glancing through the list, I don't find any appropriate capacity/performance disk to be used for 'mixing'. I really don't know what to suggest here, let's see if anyone has some advice for you. In the meantime, you can read the following kbs.

What sparing criteria does the 'Low number of spare disks' healthcheck use?
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Systems/FAS_Systems/What_sparing_criteria_does_the_'Low_number_of_spare_disks'_healthche...

How to reduce the number of data disks in an Aggregate / Volume / RAID group
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/How_to_reduce_the_number_of_data_disks_in_an_Aggregate_%2F%2F_Volume_%...

KernelPanicfbsd · ‎2022-10-12

Hello, thanks for getting back to me; yes we have had extra disk shelves added in the past but that was approx. 4 years ago - this lack of spare disks issue I think has only been occurring in the last 3 months. As I say, I've no idea how we've managed to get ourselves in this situation.

This has already caused a problem for us as there were two disks in the same RAID group that failed (actually one didn't fail, it was just put into maintenance for a very long time) and because there were no spare disks of that type then the node shut itself down after 24 hours!

Ontapforrum · ‎2022-10-12

I can think of this plan.

(Though it's just a theory, have a read through and our experts here will be able to correct me if there is anything that is a showstopper or something I overlooked)

Plan A: Add another shelf (half-loaded).

Or,

Plan B: (Workaround)

1) First step, Transfer 1x15K disk from N-1 to N-2

2) Now N-2, has 3x15K disks as spares.

3) Observation: N-2, has no Raid groups of 15K disks, so it doesn't require any of those 3 disks.

4) Now, N-1, has 1x10K spare & 1x15K spare (After the step 1) which satisfies 'each disk type/node' spare criteria for N-1.

5) Now N-2, has 3x15K disks as 'spare', move the root aggregates (1OK disk:0d.03.0,0a.00.0,0d.03.1) to 3x15k disks.
How to non-disruptively migrate a node's root volume/aggregate onto new disks in ONTAP 9?
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/How_to_non-disruptively_migrate_a_node's_root_volume%2F%2Faggregate_on...

6) Now N-2, after the move, you get 3x10K spares (However, there will be no '15k' spare for the newly created root_aggr), however, you will have higher capacity 10K drive in case a drive fails in root-aggregate, and you need to make sure 'disk RPM mixing option is enabled'). The rule for spare is: Hot spare drives should be equal to or larger than the capacities of the drives that they are protecting.

7) Next, move 1x10K to N-1.

😎 Now both N-1 & N-2 has 2x10K spares. I think, this should make it safer at least and may stop the alerts.

KernelPanicfbsd · ‎2022-10-17

Hello, that's great thanks for the reply, I think if we have to we'd go with plan B - however there might be a reason as to why we're down a spare disk - I noticed the following message in the syslogs:

adapterName="0c", debug_string="One or more (1) PHYs on expander 5:00a098:000e7ee:3f are in a bad state."

After further investigation I was able to see that the system can't see the disk in shelf 13, bay 21 despite the fact it it physically present (but with no lights on it), the disks are listed as 4.13.20 to 4.13.22.

I've logged it with our support, will update here once I have a resolution.