Aggregate Best Practices

(Cross-posted on the Toasters list as well).

We're revisiting how we set up our aggregates and I want to see how others out there do it.  Specifically, what strategies do you use for ensuring certain key applications or environments get the performance they need in a shared environment.

Typically we'll create large aggregates based on a homogeneous disk type.  15K SAS disks in one aggregate, SATA in another.  In some cases when it's only a single type of disk, we'd have 60 15K disks in one aggregate and 60 in other (assigned to each controller respectively).

The idea here is that more spindles gives us the most performance.  However, some applications/workloads are more important than others, and some can be "bullies" impacting the important stuff.  Ideally we'd try and keep our OLTP random workloads on one filer and heavy sequential workloads on another (maybe dedicated).

We've also been discussing creating multiple, smaller aggregates that we then assign to specific workloads guaranteting those spindles for those workloads.  Lower possible maximum performance, but better protection against "bullies"[1].

I also know ONTAP has some I/O QoS options.  I'm less inclined to go that direction however.

Our workloads tend to be ESX VM's using the filers as NFS datastores.

We have the usual budgetary / purchasing cycle constraints, so trying to minimize pain for as long as possible until we can add resources.

How do folks out there handle this?



[1] Controller is obviously still a shared resource.

Re: Aggregate Best Practices

Hi Ray,

You will find that NetApp controllers do a good job of handling mixed workloads. Certainly having dedicated aggregates can help keep IO segregated and reserved spindles for specific applications. However, in the testing results I have read, we found that doing this didn't necessarily improve performance. In some cases it actually hurt performance.

QoS with FlexShare is a good solution for solving a specific problem. Using QoS as a rule tends to create more problems than it solves. As your environment grows, you may find that QoS actually hinders performance. The recommendation is only to use QoS to solve a specific issue and only when NetApp support recommends it.

The thing that will help performance the most is increasing the amount of FlashCache in the controller. This allows all volumes and aggregrates to read from "memory" most of the time. If you identify issues down the line you can migrate volumes off to other controllers to balance the workload to move them away from "bullies" or just move the offending volume. Clustered ONTAP allows you to do this non-disruptively and would be something to consider.

Another good method to meet the performance needs of active volumes is FlashPools. These are a made up of flash drives and traditional spinning disks. Moving those "bullies" onto the FlashPool will allow you to alleviate the impact of those volumes on other workloads that share the aggregate.

The most important message here is to not over think things too much. Use our recommendations (I hate the phrase "Best Practice") to configure the storage to begin with, then address issues as they arise. We spend a lot of time testing systems with a variety of workloads and base our recommendations on that experience as well as experience of our customers. Doing too much to try and avoid a problem may actually create a problem down the road.