RAID rebuid time for 144GB disks with RG size 27

naing_lin · ‎2010-02-16

Please let me ask a question here. There was a disk failed in the Filer and spare low. The reconstruction time was long about 7 hours and performance impact while rebuild. The option raid.recon struct.perf_impact is set to medium. There is a single Raid Group and formed as Raid_DP with 144GBx27 disks within a Raid Group size of 27.

There is one KB (Solution ID# 18300) in NetApp that mentioned about roughly rebuild time for 500GB or 1TB ATA disks is about 10hrs.

May I know how long does it take to complete the RAID reconstruction for 144GB disks with RG size 27, and why does performace was seriously impact?

Please kindly help me an answer and your help is really appreciated.

Thanks and Regards,

Naing

lwei · ‎2010-02-16

Hi Naing,

Roughly, it should be no more than a couple hours. However, the reconstruction time is also dependant upon the number of back-end loops. What's the controller type and configuration?

Thanks,

Wei

mcope · ‎2010-02-16

The type of disk failure, the storage controller model, and the version of Data ONTAP will also impact rebuild times.

Most disk failures are 'soft' failures where too many blocks are flagged as bad. In Data ONTAP 7.1 and newer, these failed drives use Rapid RAID Recovery to copy the good blocks to a spare drive. This significantly speeds up recovery time (up to 4x faster in some cases).

A truly dead drive from hardware failure requires reconstruction of all data from parity. On smaller systems, especially busy ones, this will take more time because the reconstruction process has to wait for the system load to drop to low - medium in order to not impose additional performance overhead. Large RAID groups also take longer to calculate parity because there are more blocks to calculate parity for.

naing_lin · ‎2010-02-16

The Controller model is FAS3160 and DS14-MK2 Shelves. In our case, the reconstruction takes about 7 hours to complete and performance impact. How much average MB/s can we get for reconstruction?

Thanks and Regards.

Naing.

localizedinsanity · ‎2010-02-16

One thing that can affect this is that in RAID-DP (as you have to be running to get to 27 disks in a RG) I have been told that the first disk failure/rebuild is low priority regardless of the raid.recon struct.perf_impact setting. This certainly explains some behaviors I have seen on raid rebuilds of RAID-DP configurations. On the second disk failure in the same RG the priority gets bumped up on both rebuilds to the setting above.

lwei · ‎2010-02-16

Thanks for the info. But, how many back-end loops? That's important to the reconstruction time.

Thanks,

Wei

radek_kubka · ‎2010-02-18

Hi Wei,

I've been watching this thread silently for a while.

How do back-end loops impact reconstruction time though? Is it so throughput-heavy? And do you mean loops to the affected RAID group (unlikely there is more than one) or all loops on the system?

Regards,

Radek

lwei · ‎2010-02-18

Hi Radek,

The back-end FC-AL loops play an important role in the reconstruction time and application performance while rebuilding. Let's assume all 27 disks are on one 4Gbps loop (unlikely but let's use it as an example). They have to share the loop's bandwidth, 400MB/s. So, that'll be ~14MB/s per disk on average. To rebuild a 144GB disk, it'll take about 10,000 seconds, or roughly 2.8 hours. But that's full speed. When you have applications running and throttle down the rebuild speed, it'll probably take much longer. If you add another loop, you can speed up things significantly, though I wouldn't say doubling the speed.

Regards,

Wei

rkaramchedu1 · ‎2010-02-18

Wei

You mentioned:

They have to share the loop's bandwidth, 400MB/s

Isn't the FC loop - 4Gbps ~ 500MB/s? or is my understanding incorrect?

Thx

lwei · ‎2010-02-18

It's just a rough estimate. I'm usually on the conservative side. However, if you can get 500MB/s, more power to you -Wei

lionetti · ‎2012-04-17

Actually you can only get 400MB per second because 4Gb/s FC uses 8b10b encoding which mean that you divide by 10 instead of by 8 when converting bits to bytes. So the speed of 4Gb/s FC (actually 4.25Ghz speed) is actually closer to 400MB/s instead of 500MB/s/

lwei · ‎2012-04-17

Good point. Thanks Chris. -Wei

naing_lin · ‎2010-02-18

Hello,

Lets say If would like to reduce the impact on reconstruction? how many choices do I have rather than below:

1). Create another Aggr with Raid Group sizes of 16 (optimal RAID size) and copy the data there. That needs migration.

2). Set the reconstruction speed to low. That will longer the rebuild time.

Please let me know is there any other work around or choices?

Best Regards,

Naing

naing_lin · ‎2010-03-02

Hello,

Sorry for late reply.

Yes! Those above mentions are very helpful to me. I congratz all of the replies especially Wei's explanation. Thank you all!

Best Regards,

Naing.

lwei · ‎2010-03-02

Hi Naing, you are very welcome! Thanks, -Wei