ONTAP Hardware

How to replace multiple disks in the same RAID?

alessice
7,947 Views

Hi,

I have a FAS2554 (cDOT 8.3.2P12) with 20x4TBx4 + 400SSD. Now 3 SATA disks (X477_HMKPX04TA07) have too many logged errors every day so NetApp Support have shipped me 3 new disks. Also because these disks have an issue described here:

 

https://kb.netapp.com/app/answers/answer_view/a_id/1091466

 

Now, since these disks are on the same RAID-DP Group but are not in fail or prefail, what is the right procedure to replace them?

 

NetApp Support tell me via email to remove one disk, wait 1 minutes, but no more than 2 minutes, and insert the new one. But with this method the RAID will be degraded and I must wait for rebuild, that I think will be long.

 

The NetApp onsite Support engineer said me that, yes is possibile to remove and reinsert a disk, also more that one disk at the same time, but if I have time to wait the best procedure is to prefail single disk and wait for the disk become "fail", with command:

 

storage disk fail 1.0.13

wait for RAID rebuild complete and after replace it and start with another. What is the command to monitor the status of RAID rebuilding?

 

Since we have to replace 3 disks, and I want to do the more safe procedure, what is your procedure suggestion?

 

Thanks

1 ACCEPTED SOLUTION

andris
7,858 Views

The best way to think about it...

"replace" allows you to control which spare disk is used when replacing the active (failing) disk.

"fail" selects any available matching spare (we're assuming you have one available).

 

If you don't mind using an existing spare to start replacing the first failing disk,  then just start serially "replacing", removing bad drive, inserting new drive (and ensuring it gets assigned and become spare).

If you really want to replace all three bad drives with your new replacements, you'll need to swap an existing spare with one of your shiny new drives, first - then use "replace".

View solution in original post

9 REPLIES 9

aborzenkov
7,934 Views

Use “disk replace” to copy content of each disk to spare without degrading RAID. Then replace disk which is now not part of RAID.

alessice
7,925 Views

Thanks,

 

I understand that have two possibility:

 

"storage disk fail" and "storage disk replace" that almost doing the same things, use RAID Rapid Recovery to copy dato to a spare disk. The first at the end set the source disk to "fail" and the second to a "spare".

 

The required time to move data to spare is the same for both?

 

At the end of both is sufficient to remove the disk, replace with the new and ONTAP do the necessary action automatically? Or since with "replace" the old disk become a spare and I must remove ownership or do similar things?

 

Sorry but this is the first time I replace disks in NetApp.

Thanks

aborzenkov
7,902 Views

@alessice wrote:

"storage disk fail" and "storage disk replace" that almost doing the same things


No, they do not (at least they did not in the past). Fail actually fails disk and its content must be rebuilt from other disks in RAID. group. Replace copies content of disk to another disk. It means less load, it goes faster and you do not lose redundancy during this process.

 

Spare or failed disks can simply be pulled out (for rotational disks leave in enclosure for half a minute to spin down before finally removing them). You may need to assign replacement disk manually, it depends on your current settings.

alessice
7,875 Views

@aborzenkov wrote:

@alessice wrote:

"storage disk fail" and "storage disk replace" that almost doing the same things


No, they do not (at least they did not in the past). Fail actually fails disk and its content must be rebuilt from other disks in RAID. group. Replace copies content of disk to another disk. It means less load, it goes faster and you do not lose redundancy during this process.


Reading the NetApp KB I think that now "fail" and "replace" do the same, copy data from source disk into spare without putting the RAID group in a degraded state:

 

The procedure to be performed to pre-fail a drive covers the following steps:

  • Copying the contents to an available spare
  • Failing the questionable drive administratively, without putting the RAID group in a degraded state

 

https://kb.netapp.com/app/answers/answer_view/a_id/1035874/~/how-to-remove-and-replace-a-data-disk-that-is-about-to-fail-

 

https://kb.netapp.com/app/answers/answer_view/a_id/1032432/~/how-to-pre-fail-a-disk-in-data-ontap-8.1.x-cluster-mode-

paul_stejskal
7,896 Views

This is kind of complicated one. I would do the following: 1) Check which disk is worst by either looking at event logs or statit. By statit I mean use that command to check performance for the drives and see which is slowest of the 3. 2) do one disk at a time. It may take quite a few hours. 

 

https://kb.netapp.com/app/answers/answer_view/a_id/1029897/~/how-to-assess-disk-level-response-times-

 

If you have to do two at once then do it but I would see if you can do one at once. 

alessice
7,873 Views

@paul_stejskal wrote:

This is kind of complicated one. I would do the following: 1) Check which disk is worst by either looking at event logs or statit. By statit I mean use that command to check performance for the drives and see which is slowest of the 3. 2) do one disk at a time. It may take quite a few hours. 

 

https://kb.netapp.com/app/answers/answer_view/a_id/1029897/~/how-to-assess-disk-level-response-times-

 

If you have to do two at once then do it but I would see if you can do one at once. 


Thanks @paul_stejskal 

 

for sure I will swap one disk at once. I have already find the "worst" disk and will be the first.

 

Do you suggest to use "fail" or "replace"?

 

Official NetApp KB suggest always "fail":

https://kb.netapp.com/app/answers/answer_view/a_id/1035874/~/how-to-remove-and-replace-a-data-disk-that-is-about-to-fail-

 

Thanks

andris
7,859 Views

The best way to think about it...

"replace" allows you to control which spare disk is used when replacing the active (failing) disk.

"fail" selects any available matching spare (we're assuming you have one available).

 

If you don't mind using an existing spare to start replacing the first failing disk,  then just start serially "replacing", removing bad drive, inserting new drive (and ensuring it gets assigned and become spare).

If you really want to replace all three bad drives with your new replacements, you'll need to swap an existing spare with one of your shiny new drives, first - then use "replace".

alessice
7,697 Views

I have choose to fail disks with:

 

storage disk fail 1.0.13

 

and the first disk was copied to spare in 16 hours. When the process was complete the disk have the organge led on and I swap the disk. After some minutes the new disk was visibile as partitioned spare with:

 

storage aggregate show-spare-disks

 

So I start a new "disk fail" that is now in progress.

Thanks to all.

alessice
7,550 Views

I have done the three disks replacement, now the last changed disk is the new spare.

 

All works fine.

Thanks

Public