I should clarify that that state information has been lost, as I backtracked to restore the original state in an effort to understand what was going on. It was interesting as destroying the aggregates also triggered an inconsistent inventory alert as the data partitions were deleted.
This is now with our vendor-partner support to understand how the situation arose or to explain any shortcomings in my understanding of what is / was happening. They are suggesting that a data aggregate was pulling it in at the time (which would have been a user error if so) but that doesn't explain why the spare root partition was affected - unless it also gets marked as unavailable while the data partition is being added to an aggregate?
I noted something, which I found interesting (un-related to this but educational): You must leave a disk with both the root and data partition available as spare for every node.
Original Owner: c1-01 Pool0 Shared HDD Spares Local Local Data Root Physical Disk Type RPM Checksum Usable Usable Size --------------------------- ----- ------ -------------- -------- -------- -------- 1.0.1 BSAS 7200 block 753.8GB 73.89GB 828.0GB
Parent topic: Managing aggregates
Theory you were told by the support is believable :
This is the spare disk available on my cluster-node, to be added to data_aggr:
Basically, root-usable is 'zero' as expected , probably when it was pulled in to be added to the data_aggr, it was in that state where, it got assigned and it showed no data usable & root same time. Makes sense...:)
Original Owner: Pool0 Partitioned Spares Local Local Data Root Physical Disk Type Class RPM Checksum Usable Usable Size Status ---------------- ------ ----------- ------ -------------- -------- -------- -------- -------- 3.1.22 SSD solid-state - block 1.72TB 0B 3.49TB zeroed
[raid.disk.replace.job.start:notice]: Starting disk replacement of disk 3.1.33 with disk 3.0.11 = Is also a function of RRR [Rapid Raid Recovery] in ONTAP. By default this option is 'On' [raid.disk.copy.auto.enable] on ONTAP systems.
This option determines the action taken when a disk reports a predictive failure:it is possible to predict that a disk will fail soon based on a pattern of recovered errors that have occurred on the disk. In such cases, the disk reports a predictive failure to Data ONTAP. If this option is set to on, Data ONTAP initiates Rapid RAID Recovery to copy data from the failing disk to a spare disk.
Thanks for the pointer to the pre-emptive failure detection. I hadn't considered that at all but it does seem to fit the evidence.
I'll put the two offending disks into maintenance test and see what happens. If that doesn't pick-up anything, I'll create an aggregate with those disks to see if that triggers their replacement again.
Just thought I would provide some additional context. The system was new and so disk failures were very far down the consideration list - this was a mistake of course . When the system alerted that there was a shortage of spares almost immediately after some aggregates were created, the focus was on the spare drive and why it was not available. It mysteriously showed 0B/0B as usuable.
What we now know happened, was that the system was pre-emptively swapping out suspect disks and had thus siliently started to use the spare drives. It would have helped had ONTAP removed each disk from the spares list but this is not done until completion. It just showed up as a spare but with no usuable space.
An "aggr status -r" would have quickly showed what was going on but because disk failure was not even contemplated, this easy diagnostic avenue wasn't even tried. This was some kind of "broken spare" situation.
At that point, the newly created aggreates using partitioned drives were destroyed in an attempt to revert the changes. Two things then happened - the disk replacements stopped becasuse the aggregates were gone (so they never got marked as failed) and the HA partner node alerted that there was an inconsistent inventory as a result of many partitioned drives being unpartitioned. The latter was a red-herring and just added noise to what was really a simple situation.
Undoing the aggregates mysteriously restored the spares and once zeroed all was well again but it also removed any "live" diagnostic information. The lesson is be prepared for disk failure even on new systems and always perform a broad information gathering exercise first lest you funnel yourself down a dead-end.
The NetApp partner has agreed to provide replacement disks in advance so that we don't have to wait for for the devices to fail. The copying processes takes 17 Hrs and the disks are replaced sequentially in an aggregate so it was going to take a couple of days for all the automatic failures to kick in.