2014-03-14 10:35 AM - last edited on 2015-05-11 10:04 AM by alissa
I have a disk that is periodically throwing not ready errors and threw a SAS bus error yesterday. The filer has not failed the disk yet, but it throws a clump of not ready errors every few hours. I can live with occasional not ready errors but not a SAS error (it also triggered an autosupport):
2>Mar 13 16:02:16 [esd-filer-1b:callhome.hm.sas.alert.major:CRITICAL
I'd tried "disk maint start" but nothing happens:
> disk maint start -d 1c.02.17
*** You are about to mark the following file system disk(s) for copy, ***
*** which will eventually result in them being removed from service ***
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
data 1c.02.17 1c 2 17 SA:B - BSAS 7200 423111/866531584 423946/868242816
Do you want to continue? y
disk maint: Maximum number of disks testing 1c.02.17
> disk maint status
I have 5 spares and my options appear to be set correctly:
|disk.maint_center.allowed_entries 1||(value might be overwritten in takeover)|
|disk.maint_center.enable||on||(value might be overwritten in takeover)|
|disk.maint_center.max_disks 84||(value might be overwritten in takeover)|
|disk.maint_center.rec_allowed_entries 5||(value might be overwritten in takeover)|
|disk.maint_center.spares_check on||(value might be overwritten in takeover)|
|disk.recovery_needed.count 5||(value might be overwritten in takeover)||
(I don't know what this is but I think it's a cluster param)
Meanwhile I will try to fail the disk to a spare and swap it out, it would be nice to maint test the disk, which includes a power cycle and might either mark the disk as truly bad or clear the problem (or maybe crash the SAS bus, who knows...). I don't have any remote hands to physically pull it and reseat it or I'd just do that.
Does anyone know what the "disk maint: Maximum number of disks testing" message means?
2014-03-17 01:41 PM
To follow up: I swapped the disk with a spare with the "disk replace". When I tried to zero the flaky disk, now a spare, the Netapp failed it.
So, zeroing the disk will serve the same purpose as running maintenance checks, if the read errors are persistent enough.