ONTAP Hardware

Alert SPARES_LOW after media error threshold but no disk in broken list

alessice
2,415 Views

Hi,

 

tonight one of our FAS (2554 20x4TB + 4x400GB SSD) send us an alert for SPARES_LOW.

 

Investigating we found that a disk (0b.00.13) had a lot of media error until the threshold, so ONTAP (9.3) started the copy/recovery to spare disk 0b.00.17.

 

Investigating more in deep we found that no disk are marked as broken and, probably, only a partition (data) of disk 13 was "failed and copied to disk 17:

 

cluster3::> storage disk show -broken
There are no entries matching your query.

 

cluster3::> storage aggregate show-spare-disks

Original Owner: cluster3-node01
Pool0
Root-Data Partitioned Spares
Local Local
Data Root Physical
Disk Type Class RPM Checksum Usable Usable Size Status
---------------- ------ ----------- ------ -------------- -------- -------- -------- --------
1.0.17 FSAS capacity 7200 block 0B 61.58GB 3.64TB zeroed


cluster3::> storage disk show
Usable Disk Container Container
Disk Size Shelf Bay Type Type Name Owner
---------------- ---------- ----- --- ------- ----------- --------- --------

Info: This cluster has partitioned disks. To get a complete list of spare disk capacity use "storage aggregate show-spare-disks".
This cluster has storage pools. To view available capacity in the storage pools use "storage pool show-available-capacity".
1.0.0 372.4GB 0 0 SSD shared sp_NFS cluster3-node02
1.0.1 372.4GB 0 1 SSD shared sp_NFS cluster3-node01
1.0.2 372.4GB 0 2 SSD shared sp_NFS cluster3-node02
1.0.3 372.4GB 0 3 SSD spare Pool0 cluster3-node01
1.0.4 3.63TB 0 4 FSAS shared - cluster3-node02
1.0.5 3.63TB 0 5 FSAS shared aggr0_01, sata_data_1
cluster3-node01
1.0.6 3.63TB 0 6 FSAS shared aggr0_02, sata_data_2
cluster3-node02
1.0.7 3.63TB 0 7 FSAS shared aggr0_01, sata_data_1
cluster3-node01
1.0.8 3.63TB 0 8 FSAS shared aggr0_02, sata_data_2
cluster3-node02
1.0.9 3.63TB 0 9 FSAS shared aggr0_01, sata_data_1
cluster3-node01
1.0.10 3.63TB 0 10 FSAS shared aggr0_02, sata_data_2
cluster3-node02
1.0.11 3.63TB 0 11 FSAS shared aggr0_01, sata_data_1
cluster3-node01
1.0.12 3.63TB 0 12 FSAS shared aggr0_02, sata_data_2
cluster3-node02
1.0.13 3.63TB 0 13 FSAS shared aggr0_01, sata_data_1
cluster3-node01
1.0.14 3.63TB 0 14 FSAS shared aggr0_02, sata_data_2
cluster3-node02
1.0.15 3.63TB 0 15 FSAS shared aggr0_01, sata_data_1
cluster3-node01
1.0.16 3.63TB 0 16 FSAS shared aggr0_02, sata_data_2
cluster3-node02
1.0.17 3.63TB 0 17 FSAS shared sata_data_1
cluster3-node01
1.0.18 3.63TB 0 18 FSAS shared aggr0_02, sata_data_2
cluster3-node02
1.0.19 3.63TB 0 19 FSAS shared aggr0_01, sata_data_1
cluster3-node01
1.0.20 3.63TB 0 20 FSAS shared aggr0_02, sata_data_2

1.0.21 3.63TB 0 21 FSAS shared aggr0_01, sata_data_1
cluster3-node01
1.0.22 3.63TB 0 22 FSAS shared aggr0_02, sata_data_2
cluster3-node02
1.0.23 3.63TB 0 23 FSAS shared aggr0_01, sata_data_1
cluster3-node01
24 entries were displayed.

 

So, disk 13 is not failed but still in RAID, disk 17 is still spare but only for root partition, and no disk is "broken".

 

But almost every 2 minutes we see in the events: raid.shared.disk.exchange for disk 13.

 

Probably the support will send us a new disk for swap with disk 13, but how to mark failed before to extra from shelf?

Thanks

 

Here some logs from Events:

 

01:20:16 disk.ioMediumError: Medium error on disk 0b.00.13: op 0x28:070b6800:0200 sector 118188507 SCSI:medium error - Unrecovered read error - If the disk is in a RAID group, the subsystem will attempt to reconstruct unreadable data (3 11 0 0) (377) Disk 0b.00.13 Shelf 0 Bay 13 [NETAPP X477_WVRDX04TA07 NA02] S/N [ XXXXXX ] UID [50000C0F:01FEFD04:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]

 

01:27:56 sas.adapter.debug: adapterName="0a", debug_string="Starting powercycle on device 0b.00.13"

 

01:28:35 raid.disk.timeout.recovery.read.err: Read error on Disk /sata_data_1/plex0/rg0/0b.00.13P1 Shelf 0 Bay 13 [NETAPP X477_WVRDX04TA07 NA02] S/N [ XXXXXX ] UID [60000C0F:01FEFD04:500A0981:00000001:00000000:00000000:00000000:00000000:00000000:00000000], block #19214411 during aggressive timeout recovery

 

01:28:35 shm.threshold.allMediaErrors: shm: Disk 0b.00.13 has crossed the combination media error threshold in a 10 minute window.

 

01:29:21 raid.rg.diskcopy.start: /sata_data_1/plex0/rg0: starting disk copy from 0b.00.13P1 to 0b.00.17P1. Reason: Disk replace was started..

 

01:31:00 raid.rg.spares.low: /sata_data_1/plex0/rg0
01:31:00 callhome.spares.low: Call home for SPARES_LOW

 

01:31:01 monitor.globalStatus.nonCritical: There are not enough spare disks.


====== Events show about every 2 minutes ======

raid.shared.disk.exchange: Received shared disk state exchange Disk 0b.00.13 Shelf 0 Bay 13 [NETAPP X477_WVRDX04TA07 NA02] S/N [ XXXXXX ] UID [50000C0F:01FEFD04:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000], event NONE, state prefailing, substate 0x8000, partner state prefailing, partner substate 0x8000, failure reason testing, sick reason RAID_FAIL, offline reason NONE, online reason NONE, partner dblade ID 81c49d52-fa30-11e4-a69c-951e08312b64, host 1 persistent 0, spare on unfail 0, awaiting done 0, awaiting prefail abort 0, awaiting offline abort 0, pool partitioning 0


raid.shared.disk.exchange: Received shared disk state exchange Disk 0b.00.13 Shelf 0 Bay 13 [NETAPP X477_WVRDX04TA07 NA02] S/N [ XXXXXX] UID [50000C0F:01FEFD04:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000], event PREFAIL_DONE, state prefailing, substate 0x8000, partner state prefailing, partner substate 0x10000, failure reason testing, sick reason RAID_FAIL, offline reason NONE, online reason NONE, partner dblade ID 81c49d52-fa30-11e4-a69c-951e08312b64, host 1 persistent 0, spare on unfail 0, awaiting done 1, awaiting prefail abort 0, awaiting offline abort 0, pool partitioning 0

1 ACCEPTED SOLUTION

andris
2,375 Views

The disk copy of the 0.13 P1 data partition will probably take a while...

Then ONTAP will copy the sibling 0.13 P2 root partition (probably to your .17 P2 spare) to completely fail the disk.

At that point, disk 0.13 can be safely replaced.

View solution in original post

2 REPLIES 2

andris
2,376 Views

The disk copy of the 0.13 P1 data partition will probably take a while...

Then ONTAP will copy the sibling 0.13 P2 root partition (probably to your .17 P2 spare) to completely fail the disk.

At that point, disk 0.13 can be safely replaced.

alessice
2,373 Views

Yes, the disk copy was in progress because some hours later the disk is failed and NetApp support call me for arrange the delivery of a new disk.

 

I missed to run "sysconfig -r" to check the disk copy status.

Thanks

Public