ONTAP Hardware

"Disk maint start" fails with "disk maint: Maximum number of disks testing ...."


I have a disk that is periodically throwing not ready errors and threw a SAS bus error yesterday. The filer has not failed the disk yet, but it throws a clump of not ready errors every few hours. I can live with occasional not ready errors but not a SAS error (it also triggered an autosupport):


2>Mar 13 16:02:16  [esd-filer-1b:callhome.hm.sas.alert.major:CRITICAL]: Call home for SAS  Connectivity Monitor: DualPathToDiskShelf_Alert[50:05:0c:c1:02:


I'd tried "disk maint start" but nothing happens:


> disk maint start -d 1c.02.17
*** You are about to mark the following file system disk(s) for copy,  ***
*** which will eventually result in them being removed from service    ***
  Disk /aggr1/plex0/rg2/1c.02.17


      RAID Disk Device          HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------          ------------- ---- ---- ---- ----- --------------    --------------
      data      1c.02.17        1c    2   17  SA:B   -  BSAS  7200 423111/866531584  423946/868242816
Do you want to continue? y
disk maint: Maximum number of disks testing 1c.02.17

> disk maint status



I have 5 spares and my options appear to be set correctly:

disk.maint_center.allowed_entries 1      (value might be overwritten in takeover)  
disk.maint_center.enable on     (value might be overwritten in takeover)
disk.maint_center.max_disks  84     (value might be overwritten in takeover)  
disk.maint_center.rec_allowed_entries 5      (value might be overwritten in takeover)  
disk.maint_center.spares_check on     (value might be overwritten in takeover)  
disk.recovery_needed.count   5      (value might be overwritten in takeover)

(I don't know what this is but I think it's a cluster param)


Meanwhile I will try to fail the disk to a spare and swap it out, it would be nice to maint test the disk, which includes a power cycle and might either mark the disk as truly bad or clear the problem (or maybe crash the SAS bus, who knows...). I don't have any remote hands to physically pull it and reseat it or I'd just do that.


Does anyone know what the "disk maint: Maximum number of disks testing" message means?





To follow up: I swapped the disk with a spare with the "disk replace". When I tried to zero the flaky disk, now a spare, the Netapp failed it.

So, zeroing the disk will serve the same purpose as running maintenance checks, if the read errors are persistent enough.

NetApp on Discord Image

We're on Discord, are you?

Live Chat, Watch Parties, and More!

Explore Banner

Meet Explore, NetApp’s digital sales platform

Engage digitally throughout the sales process, from product discovery to configuration, and handle all your post-purchase needs.

NetApp Insights to Action
I2A Banner