ONTAP Discussions

Unusable Spare - ONTAP 9.6P2

acuk
6,261 Views

Anyone seen something like this before ?  It is a shared (partitioned) disk which is fine but there is no usable space on it.

 

storage aggregate show-spare-disks -owner-name mynode

                                                             Local    Local
                                                              Data     Root Physical

Disk             Type   Class          RPM Checksum         Usable   Usable     Size Status
---------------- ------ ----------- ------ -------------- -------- -------- -------- --------

3.0.11           FSAS   capacity      7200 block                0B       0B   8.91TB zeroed

 

(other correct spares removed from output)

1 ACCEPTED SOLUTION

Ontapforrum
6,184 Views

Hi,

 

Thanks for this update.

 

I am still trying to get my head around this issue you raised, but it's interesting.

 

I am sure, there must be some events in the event logs about whatever transitional-state that made that disk look like that.

 

I was reading about spare paritions:
https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-psmg%2FGUID-1C0DF65F-4EB1-4729-A0FC-A2A8A6278664.html

 

I noted something, which I found interesting (un-related to this but educational):
You must leave a disk with both the root and data partition available as spare for every node.

Original Owner: c1-01
Pool0
Shared HDD Spares
Local Local
Data Root Physical
Disk Type RPM Checksum Usable Usable Size
--------------------------- ----- ------ -------------- -------- -------- --------
1.0.1 BSAS 7200 block 753.8GB 73.89GB 828.0GB

Parent topic: Managing aggregates


Theory you were told by the support is believable :

 

For example:

This is the spare disk available on my cluster-node, to be added to data_aggr:


Basically, root-usable is 'zero' as expected , probably when it was pulled in to be added to the data_aggr, it was in that state where, it got assigned and it showed no data usable & root same time. Makes sense...:)

 

Original Owner:
Pool0
Partitioned Spares
Local Local
Data Root Physical
Disk Type Class RPM Checksum Usable Usable Size Status
---------------- ------ ----------- ------ -------------- -------- -------- -------- --------
3.1.22 SSD solid-state - block 1.72TB 0B 3.49TB zeroed

 

View solution in original post

7 REPLIES 7

Ontapforrum
6,213 Views

Hi, 

 

Could you give us this output:

::>storage aggregate show-spare-disks -original-owner node -is-disk-shared true

::>storage disk show
::>node run -node <node>
>sysconfig -a

>disk show -v

acuk
6,204 Views

 

I should clarify that that state information has been lost, as I backtracked to restore the original state in an effort to understand what was going on.  It was interesting as destroying the aggregates also triggered an inconsistent inventory alert as the data partitions were deleted. 

 

This is now with our vendor-partner support to understand how the situation arose or to explain any shortcomings in my understanding of what is / was happening.  They are suggesting that a data aggregate was pulling it in at the time (which would have been a user error if so) but that doesn't explain why the spare root partition was affected - unless it also gets marked as unavailable while the data partition is being added to an aggregate?

 

Ontapforrum
6,185 Views

Hi,

 

Thanks for this update.

 

I am still trying to get my head around this issue you raised, but it's interesting.

 

I am sure, there must be some events in the event logs about whatever transitional-state that made that disk look like that.

 

I was reading about spare paritions:
https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-psmg%2FGUID-1C0DF65F-4EB1-4729-A0FC-A2A8A6278664.html

 

I noted something, which I found interesting (un-related to this but educational):
You must leave a disk with both the root and data partition available as spare for every node.

Original Owner: c1-01
Pool0
Shared HDD Spares
Local Local
Data Root Physical
Disk Type RPM Checksum Usable Usable Size
--------------------------- ----- ------ -------------- -------- -------- --------
1.0.1 BSAS 7200 block 753.8GB 73.89GB 828.0GB

Parent topic: Managing aggregates


Theory you were told by the support is believable :

 

For example:

This is the spare disk available on my cluster-node, to be added to data_aggr:


Basically, root-usable is 'zero' as expected , probably when it was pulled in to be added to the data_aggr, it was in that state where, it got assigned and it showed no data usable & root same time. Makes sense...:)

 

Original Owner:
Pool0
Partitioned Spares
Local Local
Data Root Physical
Disk Type Class RPM Checksum Usable Usable Size Status
---------------- ------ ----------- ------ -------------- -------- -------- -------- --------
3.1.22 SSD solid-state - block 1.72TB 0B 3.49TB zeroed

 

acuk
6,165 Views

@Ontapforrum wrote:

You must leave a disk with both the root and data partition available as spare for every node.


Yes. That disk was configured to be the nominated spare but it seemed to be unexpectedly unavailable (0B/0B).

This is what gave me cause to look at it - why wasn't the spare available for use ?

 

Having now had time to look at the logs, unknown to me the following happened:

 

raid.disk.replace.job.start:notice]: Starting disk replacement of disk 3.1.33 with disk 3.0.11.

 

So while I was busy creating some aggregates, the spare disk got brought into play.  It remains to be seen why disk 3.1.33 needed replacing and why there wasn't seemingly any notification about it.

 

Ontapforrum
6,097 Views

[raid.disk.replace.job.start:notice]: Starting disk replacement of disk 3.1.33 with disk 3.0.11 = Is also a function of RRR [Rapid Raid Recovery] in ONTAP. By default this option is 'On' [raid.disk.copy.auto.enable] on ONTAP systems.


This option determines the action taken when a disk reports a predictive failure:it is possible to predict that a disk will fail soon based on a pattern of recovered errors that have occurred on the disk. In such cases, the disk reports a predictive failure to Data ONTAP. If this option is set to on, Data ONTAP initiates Rapid RAID Recovery to copy data from the failing disk to a spare disk.

 

acuk
6,085 Views

 

Thanks for the pointer to the pre-emptive failure detection.  I hadn't considered that at all but it does seem to fit the evidence.

I'll put the two offending disks into maintenance test and see what happens. If that doesn't pick-up anything,  I'll create an aggregate with those disks to see if that triggers their replacement again.

 

Thank-you.

 

acuk
5,966 Views

 

Just thought I would provide some additional context.  The system was new and so disk failures were very far down the consideration list - this was a mistake of course .  When the system alerted that there was a shortage of spares almost immediately after some aggregates were created, the focus was on the spare drive and why it was not available. It mysteriously showed 0B/0B as usuable.

 

What we now know happened, was that the system was pre-emptively swapping out suspect disks and had thus siliently started to use the spare drives.  It would have helped had ONTAP removed each disk from the spares list but this is not done until completion.  It just showed up as a spare but with no usuable space.

 

An "aggr status -r" would have quickly showed what was going on but because disk failure was not even contemplated, this easy diagnostic avenue wasn't even tried.   This was some kind of "broken spare" situation.

 

At that point, the newly created aggreates using partitioned drives were destroyed in an attempt to revert the changes.  Two things then happened - the disk replacements stopped becasuse the aggregates were gone (so they never got marked as failed) and the HA partner node alerted that there was an inconsistent inventory as a result of many partitioned drives being unpartitioned. The latter was a red-herring and just added noise to what was really a simple situation. 

 

Undoing the aggregates mysteriously restored the spares and once zeroed all was well again but it also removed any "live" diagnostic information.  The lesson is be prepared for disk failure even on new systems and always perform a broad information gathering exercise first lest you funnel yourself down a dead-end.

 

The NetApp partner has agreed to provide replacement disks in advance so that we don't have to wait for for the devices to fail.  The copying processes takes 17 Hrs and the disks are replaced sequentially in an aggregate so it was going to take a couple of days for all the automatic failures to kick in.

Public