Community

Subscribe
Highlighted

"Disk maint start" fails with "disk maint: Maximum number of disks testing ...."

[ Edited ]

I have a disk that is periodically throwing not ready errors and threw a SAS bus error yesterday. The filer has not failed the disk yet, but it throws a clump of not ready errors every few hours. I can live with occasional not ready errors but not a SAS error (it also triggered an autosupport):

 

2>Mar 13 16:02:16  [esd-filer-1b:callhome.hm.sas.alert.major:CRITICAL]: Call home for SAS  Connectivity Monitor: DualPathToDiskShelf_Alert[50:05:0c:c1:02:

 

I'd tried "disk maint start" but nothing happens:

 

> disk maint start -d 1c.02.17
*** You are about to mark the following file system disk(s) for copy,  ***
*** which will eventually result in them being removed from service    ***
  Disk /aggr1/plex0/rg2/1c.02.17

 

      RAID Disk Device          HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------          ------------- ---- ---- ---- ----- --------------    --------------
      data      1c.02.17        1c    2   17  SA:B   -  BSAS  7200 423111/866531584  423946/868242816
***
Do you want to continue? y
disk maint: Maximum number of disks testing 1c.02.17

> disk maint status

[nothing]

 

I have 5 spares and my options appear to be set correctly:

disk.maint_center.allowed_entries 1      (value might be overwritten in takeover)  
disk.maint_center.enable on     (value might be overwritten in takeover)
disk.maint_center.max_disks  84     (value might be overwritten in takeover)  
disk.maint_center.rec_allowed_entries 5      (value might be overwritten in takeover)  
disk.maint_center.spares_check on     (value might be overwritten in takeover)  
disk.recovery_needed.count   5      (value might be overwritten in takeover)

(I don't know what this is but I think it's a cluster param)

 

Meanwhile I will try to fail the disk to a spare and swap it out, it would be nice to maint test the disk, which includes a power cycle and might either mark the disk as truly bad or clear the problem (or maybe crash the SAS bus, who knows...). I don't have any remote hands to physically pull it and reseat it or I'd just do that.

 

Does anyone know what the "disk maint: Maximum number of disks testing" message means?

 

Thanks,w

Re: "Disk maint start" fails with "disk maint: Maximum number of disks testing ...."

To follow up: I swapped the disk with a spare with the "disk replace". When I tried to zero the flaky disk, now a spare, the Netapp failed it.

So, zeroing the disk will serve the same purpose as running maintenance checks, if the read errors are persistent enough.