Multiple "Block recommended for reassignment on Disk"

tomasz_golebiewski · ‎2014-02-16

Hello,

since U'r support is closed and I cannot even open new low priority case with NetApp I wanna ask is it ok to have multiple errors:

Sun Feb 16 19:22:15 CET [filer: raid.rg.scrub.media.recommend.reassign.err:info]: Block recommended for reassignment on Disk /aggr1/plex0/rg0/0d.01.2 Shelf 1 Bay 2 [NETAPP X410_HVIPC288A15 NA02] S/N [XXXXXXXX], block #2280307 during scrub

I belive ONTAP should predict failure and mark the drive offline, but it still tries to save the drive every week during scrub.

drive: X410_HVIPC288A15, fw: NA02

OS: ONTAP 7.3.6

inside shelf DS4243: IOM3, fw: 0172

When I send manual autosupport, in section "Disk defect list for disks that have reported errors" it says "Disk 0d.01.2 grown defect list has 396 entries" so we're aware of data consistency..

In one scrub there are 4 up to 39 media errors reported on single drive.

scottgelb · ‎2014-02-16

You can manually fail the drive or disk replace the drive. Also from priv set advanced check disk shm_stats for the storage health monitor counters. Or all into support for the recommendation (probably disk replace and rma of the drive).

tomasz_golebiewski · ‎2014-02-17

Yes, we know about manually setting drive as failed.

However, this should be done by ONTAP anyway.

Are these values correct? There are pretty .. unbelievable.

scottgelb · ‎2014-02-17

Sometimes we have to manually fail disks but agree it should catch these when errors can't be corrected. It may be a threshold not caught in this ontap release but fixed in a newer release. Definitely worth calling in a case to see if a higher release will fail this drive or why it isn't being failed.

Sent from my iPhone 5

tomasz_golebiewski · ‎2014-02-17

There is nothing new related to media errors detection in ONTAP version 7.3.7.

https://library.netapp.com/ecm/ecm_download_file/ECMP1134331

But there is a little information about failure thresholds:

https://library.netapp.com/ecmdocs/ECMM1278110/html/mgmtsag/4raid22.htm

however, it just says:

More than twenty-five media errors (that are not related to disk scrub activity) occurring on a disk within a ten-minute period

and THESE ARE media errors during scrub activity.

scottgelb · ‎2014-02-17

Agreed...based on the counters it is a candidate for failure. Definitely worth having support look into it and could be an existing BURT fixed in a newer release of ONTAP or drive firmware.

tomasz_golebiewski · ‎2014-02-27

The drive has failed.