ONTAP Hardware

Raid group size recommendation

janjacobbakker
73,008 Views

Hi Folks,

I'm trying to design a storage solution.

A FAS3020 with 3 shelves full with 42x 300GB FC disks.

Default the Raid Group size is 16 (14 + 2 spare) de max raid group size is 28.

Does anyone have some best practice information? The NOW site hasn't much info on that.

Kind regards

71 REPLIES 71

paulstringfellow
7,311 Views

Hi Eric,

You’re spot on…it’s a balance between needs of capacity/performance versus risk aversion…

And as you say for the cost of one disk it may not be worth the risk…however for some people especially in the 2020 kind of end of the scale, they need the capacity equally…

But you’re right it’s a matter of opinion really and as you said based on risk assessment…

The lovely flexibility of NetApp ☺

Sadly Paul also based in the UK…it’s the other guy who’s in Cyprus!!!

jwhite
7,311 Views

Hello,

NetApp revised the best practice sparing policy last year to make it more logical and applicable to the range of configurations we see in our customer base.  There is only a single configuration in which we recommend using only a single spare drive - and that is a FAS2000 series (aka "entry" level systems) that is using only internal drives (no external storage attached).  As some have pointed out already, the number of spares to keep on hand varies depending on what you are concerned about with the configuration.  Here is the updated spares policy - the "official" NetApp best practice sparing policy:

------------------------------------------------------------------------------------------------------------------

HOW MANY HOT SPARES SHOULD I KEEP IN MY STORAGE CONFIGURATION?

Recommendations for spares vary by configuration and situation. In the past, NetApp has based spares recommendations strictly on the number of drives attached to a system. This is certainly an important factor, but it's not the only consideration. NetApp storage systems are deployed in a wide range of configurations. This warrants defining more than a single approach to determining the appropriate number of spares to maintain in your storage configuration.

Depending on the requirements of your storage configuration, you can choose to tune your spares policy toward one of the following approaches:

  • Minimum spares: In configurations where drive capacity utilization is a key concern, you might want to use only the minimum number of spares. This option allows you to survive the most basic failures. If multiple failures occur, it might be necessary to manually intervene to make sure of continued data integrity.
  • Balanced spares: This configuration is the middle ground between minimum and maximum. It assumes that you will not encounter the worst-case scenario and provides sufficient spares to handle most failure scenarios.
  • Maximum spares: This option makes sure that enough spares are on hand to handle a failure situation that demands the maximum number of spares that could be consumed by the system at a single time. The term maximum doesn't mean that the system might not operate with more than this recommended number of spares. You can always add additional hot spares within the spindle limits as you deem appropriate.

In the table below, consider each "approach" as the starting number of spares that is then modified by the "special considerations" as appropriate.

For RAID-DP configurations, consult the following table for the recommended number of spares.

Recommended Number of Spares

Minimum

Balanced

Maximum

Two per Controller

Four per Controller

Six per Controller

Special Considerations

Entry platforms

Entry-level platforms using only internal drives can be reduced to using a minimum of one hot spare.

RAID groups

Systems containing only a single RAID group do not warrant maintaining more than two hot spares for the system.

Maintenance Center

Maintenance Center requires a minimum of two spares to be present in the system.

>48-hour lead time

For remotely located systems, there is an increased chance that they might encounter multiple failures and completed reconstructions before manual intervention can occur. Spares recommendations should be doubled for these systems.

>1,200 drives

For systems using more than 1,200 drives, an additional two hot spares should be added to the recommendations for all three approaches.

<300 drives

For systems using less than 300 drives, you can reduce spares recommendations for a balanced or maximum approach by two.

Selecting any one of the three approaches (minimum, balanced, or maximum) is considered to be the best practice recommendation within the scope of your system requirements. The majority of storage architects will probably choose the balanced approach, although customers who are extremely sensitive to data integrity might warrant taking a maximum spares approach. Given that entry platforms use small numbers of drives, a minimum spares approach would be reasonable for those configurations.

Additional notes about hot spares:

  • Spares recommendations are for each drive type installed in the system.
  • Larger capacity drives can serve as spares for smaller capacity drives (they will be downsized).
  • Slower drives replacing faster drives of the same type affect RAID group and aggregate performance. For example, if a 10k rpm SAS drive (DS2246) replaces a 15k rpm SAS drive (DS4243), this results in a suboptimal configuration.
  • Although FC and SAS drives are equivalent from a performance perspective, the resiliency features of the storage shelves in which they are offered are very different. By default, Data ONTAP uses FC and SAS drives interchangeably. This can be prevented by setting the RAID option raid.disktype.enable on.

NetApp does not discourage administrators from keeping cold spares on hand. NetApp recommends removing a failed drive from a system as soon as possible, and keeping cold spares on hand can speed the replacement process for those failed drives. However, cold spares are not a replacement for keeping hot spares installed in a system.

Cold spares can replace a failed part (speeding the return/replace process), but hot spares serve a different purpose: to respond in real time to drive failures by providing a target drive for RAID reconstruction or rapid RAID recovery actions. It's hard to imagine an administrator running into a lab to plug in a cold spare when a drive fails. Cold spares are also at greater risk of being “dead on replacement,” as drives are subjected to the increased possibility of physical damage when not installed in a system. For example, handling damage from electrostatic discharge can occur when retrieving a drive to install in a system.

Given the different purpose of cold spares versus hot spares, you should never consider cold spares as a substitute for maintaining hot spares in your storage configuration.

The RAID option raid.min_spare_count can be used to specify the minimum number of spares that should be available in the system. This is effective for Maintenance Center users, because when set to the value 2 it notifies the administrator if the system falls out of Maintenance Center compliance. NetApp recommends setting this value to the resulting number of spares that you should maintain for your system (based on this spares policy) so that the system notifies you when you have fallen below the recommended number of spares.

aflores_ibm
7,311 Views

I would add some details to the information provided below. You may set the

"raid.min_spare_count" to 0, 1, 2 or more. but if you do so, I'd recommend

changing the following as well: "raid.timeout". This option is usually set

to 24 which represent the numbers of hours before the system preemptively

auto-shutdown once the system no longer meets the raid /disk options set.

In other words, if your number of available spares[aggr status -s|vol

status -s] go below the number of required spares

then you will have your system running in degraded mode until you meet you

the stated requirements. If you are unable to satisfy those requirements

before the time limit has passed, the system will

auto-shutdown to prevent any potential data loss.

That been said, you should calculate your own requirements based on:

Type of disks

Type of RAID

Size of RAID

Data Risk Assessment::

Can the system suffer a shutdown without impact to business:

Yes :: how long? at what time of the day/night?

No :: -> critical system

What type of hardware warranty and support exist or need to be setup: 24/7

- 4 hours || 8am - 5pm, business day only [might still be

critical]

These are only high overview. At this point, the risk(s) would have to be

identified and a series of contingencies provide for review and approval by

the stakeholders based on the initial requirements stated by the

stakeholders.

Hope this helps as well.

Regards,

Allain Flores

Storage Consultant

Enterprise Storage Management - CDC

IBM Global Services

This transmission may contain information that is privileged, confidential

and/or exempt from disclosure under applicable law. If you are not the

intended recipient, you are hereby notified that any disclosure, copying,

distribution, or use of the information contained herein (including any

reliance thereon) is STRICTLY PROHIBITED. If you received this transmission

in error, please immediately contact the sender and destroy the material in

its entirety, whether in electronic or hard copy format.

Please consider the environment before printing this e-mail or any other

document

From: jwhite <xdl-communities@communities.netapp.com>

To: Allain Flores/Markham/Contr/IBM@IBMCA

Date: 09/28/2011 12:54 PM

Subject: "Raid group size recommendation" [NetApp Community > Products &

Solutions]

Re: Raid group size recommendation

created by jwhite in Products & Solutions - View the full discussion

Hello,

NetApp revised the best practice sparing policy last year to make it more

logical and applicable to the range of configurations we see in our

customer base. There is only a single configuration in which we recommend

using only a single spare drive - and that is a FAS2000 series (aka

"entry" level systems) that is using only internal drives (no external

storage attached). As some have pointed out already, the number of spares

to keep on hand varies depending on what you are concerned about with the

configuration. Here is the updated spares policy - the "official" NetApp

best practice sparing policy:

aborzenkov
6,652 Views
In other words, if your number of available spares[aggr status -s|vol

status -s] go below the number of required spares then you will have your system running in degraded mode until you meet you the stated requirements. If you are unable to satisfy those requirements before the time limit has passed, the system will auto-shutdown to prevent any potential data loss.

Sorry, but this is incorrect. Degraded mode means - raid group without protection (i.e. single disk missing in RAID4 or two disks missing in RAID_DP). Number of spare disks does not contribute to degraded status, and system will not shutdown if number of spares is low.

jwhite
6,653 Views

That is very true --- although the system will nag you about being below the minimum spare count it will not shut down the system because you don't have enough spares.  Degraded Mode describes a system that has one or more failed drives and decribes the fact that system resources are being used to repair the drive (be it a Rapid RAID Recovery or RAID reconstruction).  Degraded Aggregate describes an aggregate that contains one or more failed drives.  Degraded RAID group describes a RAID group that contains one or more failed drives.  That is the common usage of "Degraded" as it pertains to the storage subsystem today.

aflores_ibm
6,652 Views

Sorry,

Got sidetrack on projects.

Clarification on raid.timeout from the command manual:

raid.timeout

Sets the time, in hours, that the system will run after a single disk

failure in a RAID4 group or a two disk failure in a RAID-DP group has

caused the system to go into degraded mode or double degraded mode

respectively. The default is 24, the minimum acceptable value is 0 and the

largest acceptable value is 4,294,967,295. If the raid.timeout option is

specified when the system is in degraded mode or in double degraded mode,

the timeout is set to the value specified and the timeout is restarted. If

the value specified is 0, automatic system shutdown is disabled.

I'd bring attention to ht last sentence in regards to the automatic system

shutdown...

Regards,

Allain Flores

Storage Consultant

Enterprise Storage Management - CDC

IBM Global Services

This transmission may contain information that is privileged, confidential

and/or exempt from disclosure under applicable law. If you are not the

intended recipient, you are hereby notified that any disclosure, copying,

distribution, or use of the information contained herein (including any

reliance thereon) is STRICTLY PROHIBITED. If you received this transmission

in error, please immediately contact the sender and destroy the material in

its entirety, whether in electronic or hard copy format.

Please consider the environment before printing this e-mail or any other

document

From: aborzenkov <xdl-communities@communities.netapp.com>

To: Allain Flores/Markham/Contr/IBM@IBMCA

Date: 09/29/2011 12:18 AM

Subject: "Raid group size recommendation" [NetApp Community > Products &

Solutions]

Re: Raid group size recommendation

created by aborzenkov in Products & Solutions - View the full discussion

In other words, if your number of available spares[aggr status -s|

vol

status -s] go below the number of required spares then you will have

your system running in degraded mode until you meet you the stated

requirements. If you are unable to satisfy those requirements before

the time limit has passed, the system will auto-shutdown to prevent

any potential data loss.

Sorry, but this is incorrect. Degraded mode means - raid group without

protection (i.e. single disk missing in RAID4 or two disks missing in

RAID_DP). Number of spare disks does not contribute to degraded status,

and system will not shutdown if number of spares is low.

  1. of replies to the post:

Discussion thread has 64 replies. Click here to read all the replies.

Original Post:

Hi Folks, I'm trying to design a storage solution. A FAS3020 with 3

shelves full with 42x 300GB FC disks. Default the Raid Group size is 16

(14 + 2 spare) de max raid group size is 28. Does anyone have some best

practice information? The NOW site hasn't much info on that. Kind regards

Reply to this message by replying to this email -or- go to the message on

NetApp Community

Start a new discussion in Products & Solutions by email or at NetApp

Community

Stay Connected:

Facebook

Twitter

LinkedIn

YouTube

Community

© 2011 NetApp | Privacy Policy | Unsubscribe | Contact

Us

495 E. Java Drive, Sunnyvale, CA 94089 USA

ventap1111
6,651 Views

I'll try to do your question justice!   I was looking at this from two aspects: Performance, and long-term capacity.   While the system does indeed have 42 disks today, tomorrow it may have a need for additional capacity. So, by choosing a 15disk raid-group, I'm assuring myself not only maximum efficient RG design, I'm also committing to the maximum amount of space.

JDMARINOALR
6,651 Views

Hi There...

wondering if anyone has updated data with 900GB SAS drives.  I am looking to create a 23 disk aggregate on a 3210, running 8.0.3.From the aggregate max(50tbs), this should be a good config. I don't really want to waste 4 disks in this config to parity.

thanks!

radek_kubka
6,651 Views

Hi John,

There are no massive changes re RG size recommendations:

Theoretically it is possible to have a RAID-DP aggregate with 23x (or even 28x) 900GB drives in one RG - however, the best practice suggests to keep RG size no bigger than 20.

Regards,
Radek

Public