Please let me ask a question here. There was a disk failed in the Filer and spare low. The reconstruction time was long about 7 hours and performance impact while rebuild. The option raid.recon struct.perf_impact is set to medium. There is a single Raid Group and formed as Raid_DP with 144GBx27 disks within a Raid Group size of 27.
There is one KB (Solution ID# 18300) in NetApp that mentioned about roughly rebuild time for 500GB or 1TB ATA disks is about 10hrs.
May I know how long does it take to complete the RAID reconstruction for 144GB disks with RG size 27, and why does performace was seriously impact?
Please kindly help me an answer and your help is really appreciated.
The back-end FC-AL loops play an important role in the reconstruction time and application performance while rebuilding. Let's assume all 27 disks are on one 4Gbps loop (unlikely but let's use it as an example). They have to share the loop's bandwidth, 400MB/s. So, that'll be ~14MB/s per disk on average. To rebuild a 144GB disk, it'll take about 10,000 seconds, or roughly 2.8 hours. But that's full speed. When you have applications running and throttle down the rebuild speed, it'll probably take much longer. If you add another loop, you can speed up things significantly, though I wouldn't say doubling the speed.
Actually you can only get 400MB per second because 4Gb/s FC uses 8b10b encoding which mean that you divide by 10 instead of by 8 when converting bits to bytes. So the speed of 4Gb/s FC (actually 4.25Ghz speed) is actually closer to 400MB/s instead of 500MB/s/
One thing that can affect this is that in RAID-DP (as you have to be running to get to 27 disks in a RG) I have been told that the first disk failure/rebuild is low priority regardless of the raid.recon struct.perf_impact setting. This certainly explains some behaviors I have seen on raid rebuilds of RAID-DP configurations. On the second disk failure in the same RG the priority gets bumped up on both rebuilds to the setting above.
The type of disk failure, the storage controller model, and the version of Data ONTAP will also impact rebuild times.
Most disk failures are 'soft' failures where too many blocks are flagged as bad. In Data ONTAP 7.1 and newer, these failed drives use Rapid RAID Recovery to copy the good blocks to a spare drive. This significantly speeds up recovery time (up to 4x faster in some cases).
A truly dead drive from hardware failure requires reconstruction of all data from parity. On smaller systems, especially busy ones, this will take more time because the reconstruction process has to wait for the system load to drop to low - medium in order to not impose additional performance overhead. Large RAID groups also take longer to calculate parity because there are more blocks to calculate parity for.