"Disk maint start" fails with "disk maint: Maximum number of disks testing ...."

WSANDERSATFLEXERA · ‎2014-03-14

I have a disk that is periodically throwing not ready errors and threw a SAS bus error yesterday. The filer has not failed the disk yet, but it throws a clump of not ready errors every few hours. I can live with occasional not ready errors but not a SAS error (it also triggered an autosupport):

2>Mar 13 16:02:16 [esd-filer-1b:callhome.hm.sas.alert.major:CRITICAL]: Call home for SAS Connectivity Monitor: DualPathToDiskShelf_Alert[50:05:0c:c1:02:

I'd tried "disk maint start" but nothing happens:

> disk maint start -d 1c.02.17
*** You are about to mark the following file system disk(s) for copy, ***
*** which will eventually result in them being removed from service ***
Disk /aggr1/plex0/rg2/1c.02.17

      RAID Disk Device          HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)    Phys (MB/blks)
      --------- ------          ------------- ---- ---- ---- ----- --------------    --------------
      data      1c.02.17        1c    2   17 SA:B   - BSAS 7200 423111/866531584 423946/868242816
***
Do you want to continue? y
disk maint: Maximum number of disks testing 1c.02.17

> disk maint status

[nothing]

I have 5 spares and my options appear to be set correctly:

disk.maint_center.allowed_entries 1	(value might be overwritten in takeover)
disk.maint_center.enable	on	(value might be overwritten in takeover)
disk.maint_center.max_disks 84	(value might be overwritten in takeover)
disk.maint_center.rec_allowed_entries 5	(value might be overwritten in takeover)
disk.maint_center.spares_check on	(value might be overwritten in takeover)
disk.recovery_needed.count 5	(value might be overwritten in takeover)	(I don't know what this is but I think it's a cluster param)

Meanwhile I will try to fail the disk to a spare and swap it out, it would be nice to maint test the disk, which includes a power cycle and might either mark the disk as truly bad or clear the problem (or maybe crash the SAS bus, who knows...). I don't have any remote hands to physically pull it and reseat it or I'd just do that.

Does anyone know what the "disk maint: Maximum number of disks testing" message means?

Thanks,w

WSANDERSATFLEXERA · ‎2014-03-17

To follow up: I swapped the disk with a spare with the "disk replace". When I tried to zero the flaky disk, now a spare, the Netapp failed it.

So, zeroing the disk will serve the same purpose as running maintenance checks, if the read errors are persistent enough.