I have a disk that is periodically throwing not ready errors and threw a SAS bus error yesterday. The filer has not failed the disk yet, but it throws a clump of not ready errors every few hours. I can live with occasional not ready errors but not a SAS error (it also triggered an autosupport):
2>Mar 13 16:02:16 [esd-filer-1b:callhome.hm.sas.alert.major:CRITICAL]: Call home for SAS Connectivity Monitor: DualPathToDiskShelf_Alert[50:05:0c:c1:02:
I'd tried "disk maint start" but nothing happens:
> disk maint start -d 1c.02.17
*** You are about to mark the following file system disk(s) for copy, ***
*** which will eventually result in them being removed from service ***
Disk /aggr1/plex0/rg2/1c.02.17
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
data 1c.02.17 1c 2 17 SA:B - BSAS 7200 423111/866531584 423946/868242816
***
Do you want to continue? y
disk maint: Maximum number of disks testing 1c.02.17
> disk maint status
[nothing]
I have 5 spares and my options appear to be set correctly:
disk.maint_center.allowed_entries 1 |
(value might be overwritten in takeover) |
|
disk.maint_center.enable |
on |
(value might be overwritten in takeover) |
disk.maint_center.max_disks 84 |
(value might be overwritten in takeover) |
|
disk.maint_center.rec_allowed_entries 5 |
(value might be overwritten in takeover) |
|
disk.maint_center.spares_check on |
(value might be overwritten in takeover) |
|
disk.recovery_needed.count 5 |
(value might be overwritten in takeover) |
(I don't know what this is but I think it's a cluster param)
|
Meanwhile I will try to fail the disk to a spare and swap it out, it would be nice to maint test the disk, which includes a power cycle and might either mark the disk as truly bad or clear the problem (or maybe crash the SAS bus, who knows...). I don't have any remote hands to physically pull it and reseat it or I'd just do that.
Does anyone know what the "disk maint: Maximum number of disks testing" message means?
Thanks,w