Aggregate Showing Failed Disks

jlavetan · ‎2018-09-13

Just replaced a drive, but one of our aggregates is still showing failed disks. How can we get the status back to normal? We have plenty of spares.

RAID Group /aggr2_sas_clp_lcl_fas8020b/plex0/rg1 (double degraded, block checksums, raid_dp)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     dparity  3.33.7                       0   SAS    15000  546.9GB  547.7GB (normal)
     parity   3.32.8                       0   SAS    15000  546.9GB  547.1GB (normal)
     data     3.33.8                       0   SAS    15000  546.9GB  547.7GB (normal)
     data     3.32.9                       0   SAS    15000  546.9GB  547.1GB (normal)
     data     3.33.9                       0   SAS    15000  546.9GB  547.7GB (normal)
     data     FAILED                       -   -          -  546.9GB        - (failed)
     data     3.33.10                      0   SAS    15000  546.9GB  547.7GB (normal)
     data     3.32.11                      0   SAS    15000  546.9GB  547.1GB (normal)
     data     3.33.11                      0   SAS    15000  546.9GB  547.7GB (normal)
     data     FAILED                       -   -          -  546.9GB        - (failed)
     data     3.33.12                      0   SAS    15000  546.9GB  547.7GB (normal)
     data     3.32.13                      0   SAS    15000  546.9GB  547.1GB (normal)
     data     3.33.13                      0   SAS    15000  546.9GB  547.7GB (normal)
     data     3.32.14                      0   SAS    15000  546.9GB  547.1GB (normal)
     data     3.33.14                      0   SAS    15000  546.9GB  547.7GB (normal)

 Pool0
  Spare Pool

                                                             Usable Physical
 Disk             Type   Class          RPM Checksum           Size     Size Status
 ---------------- ------ ----------- ------ -------------- -------- -------- --------
 2.22.17          SAS    performance  10000 block           836.9GB  838.4GB zeroed
 2.22.19          SAS    performance  10000 block           836.9GB  838.4GB zeroed
 2.23.9           SAS    performance  10000 block           836.9GB  838.4GB zeroed
 3.30.22          SAS    performance  15000 block           546.9GB  547.1GB zeroed
 3.31.2           SAS    performance  15000 block           546.9GB  547.7GB zeroed
 3.32.12          SAS    performance  15000 block           546.9GB  547.7GB zeroed

Original Owner: clp-lcl-fas8020b
 Pool0
  Spare Pool

                                                             Usable Physical
 Disk             Type   Class          RPM Checksum           Size     Size Status
 ---------------- ------ ----------- ------ -------------- -------- -------- --------
 2.20.18          SAS    performance  10000 block           836.9GB  838.4GB zeroed
 2.20.23          SAS    performance  10000 block           836.9GB  838.4GB zeroed
 2.21.17          SAS    performance  10000 block           836.9GB  838.4GB zeroed
 3.32.5           SAS    performance  15000 block           546.9GB  547.1GB zeroed
 3.32.7           SAS    performance  15000 block           546.9GB  547.1GB zeroed
 3.32.10          SAS    performance  15000 block           546.9GB  547.7GB zeroed
 3.33.23          SAS    performance  15000 block           546.9GB  547.7GB zeroed
 1.10.9           SSD    solid-state      - block           186.1GB  186.3GB zeroed
14 entries were displayed.

Noorain02 · ‎2018-09-17

Have you tried manual unfailing the disk ? The storage disk unfail command can be used to unfail it.

Following command (in advanced privilage level) shall help you to unfail and make it spare disk.

cluster1::*> storage disk unfail -disk <disk path name> - Disk Name -s true

jlavetan · ‎2018-09-28

After the case was escalated with Netapp, this was resolved.

Ended up being locks preventing giveback.

storage failover show-giveback
               Partner
Node           Aggregate         Giveback Status
-------------- ----------------- ---------------------------------------------
<node>
               CFO Aggregates    Done
               aggr2_sas_fas8020b
                                 Failed: Operation was vetoed by
                                 lock_manager. Giveback vetoed: Giveback
                                 cannot proceed because non-continuously
                                 available (non-CA) CIFS locks are present on
                                 the volume. Gracefully close the CIFS
                                 sessions over which non-CA locks are
                                 established. Use the "vserver cifs session
                                 file show -hosting-aggregate <aggregate
                                 list> -continuously-available No" command to
                                 view the open files that have CIFS sessions
                                 with non-CA locks established.  <aggregate
                                 list> is the list of aggregates sent home as
                                 a result of the giveback operation. If lock
                                 state disruption for all existing non-CA
                                 locks is acceptable, retry the giveback
                                 operation by specifying "-override-vetoes
                                 true".  Warning: Overriding vetoes to
                                 perform a giveback can be disruptive.

Once I overrode vetoes, the aggregate started rebuilding.

storage failover giveback -ofnode <node> -override-vetoes true

liuh · ‎2024-12-22

hi

You need to replace the faulty disk with a good disk. Run the disk show -n command to check the disk that does not have a home location, add the disk that does not have a home location to the node where the faulty disk resides, and run the command to check whether the faulty disk still exists