Aggregate size and "overhead" and % free rules of thumb.

What is the best rule of thumb for the total usage of your aggregates?

I know you need a % free for upgrading to DOT 7.3+ but there used to also be a rule of thumb for not using 100% of your aggregate.

Is this still true?

What is a good % free for your aggregates?

With large aggregates (16TB+) if you keep 5% free that's a huge amount of stranded storage.  Is this on the radar to "clean up"?

Re: Aggregate size and "overhead" and % free rules of thumb.

This is what I've heard:

- less than 80% full recommended

- never go above 90% full

At the end of a day an aggregate is a file system, so filling it up totally is not a good thing.


Re: Aggregate size and "overhead" and % free rules of thumb.

For a rule of thumb I agree.

For the 7.3 upgrade issues, as long as your aggregate snap reserve is at 3% or higher (default is 5% aggregate snap reserve but I usually set to 3%) you'll be fine.

It is almost more of an art than a science -- as the aggregate gets fuller you'll have performance degradation. On a SATA aggregate where performance may not be a big deal to begin with, you could probably run it pretty full.

Re: Aggregate size and "overhead" and % free rules of thumb.

Some things to keep in mind on this topic.

From what I understand, aggregate snap reserve does *not* count as free space for the puroses of the 7.3 upgrade.  The free space needs to be in the active space of the aggregate.  The 3% rule is just a rule of thumb, I believe the actual calculation is more like 0.5% of the total volume space before the upgrade.  If you send AutoSupports, the Upgrade Advisor from Tech. Support (or the NOW site if you are a Premium Customer) does the real calculation for you.

As far as aggregate space, yes the 80% / 90% rule is pretty good, but that's not as easy to cacluate as you might think unless you are entirely thin provisioned, which most customers are not (most for perfectly good reasons).  But let's assume you are not thin provisioned.  Just looking at df -A doesn't really tell you the whole story and may be way too conservative since just by creating an empty volume, that space will move to the "used" column on df.  Yet, for the purposes of performance that space is not used since those blocks are not actually allocated yet.  So a more accurate way to compute it in my mind would be to add up the used space for all of the volumes of that aggregate, then compare *that* # to the total space of the aggregate given in df -A, then do the math to figure out the %.  I suppose as a worst case, you could leave 10-20% of your aggregate unallocated, but as stated aggregates get bigger (and 8.0 will allow them to be much bigger), that stings and may not be necessary.  This is why some customers are taking on the burden of monitoring their aggregate space and are moving to thin provisioned volumes because of the efficiences that can be gained here.  Now you know exactly how much space you are really using in your aggregate and can grow it accordingly.  The caveat is that it's up to you to ensure that it doesn't fill stop writes to all of the volumes.

Just some thoughts.

Re: Aggregate size and "overhead" and % free rules of thumb.

I guess that's my biggest issue with the dark magic that is aggregate management.  Aggregates are getting bigger, but we're not seeing much more efficiency in how they are managed or utilized.  If you;ve got a 16TB aggregate today you're losing between 1.6 and 3.2 TB if  you go by the NetApp rule of thumb.  Netapp is a very conservative company and I run hot aggregates up to 95% usage.  That's still 800GB of, for all intensive purposes, wasted space.  I;ll shelve the DOT 8.0 discussion as it's an entirely different animal on our march to GX.

I guess this question sprang up because of the fact that storage reporting is a huge PITA when needing to look at Aggregate allocated size (the space given to volumes), Aggregate actual size (how big you aggregate is after carving up raid groups etc), and the amount of space actually used in each volume.  You need to subtract the used volume space from the allocated aggregate space to get actual frame utilization.  Then you need to take into account the multiple TB wasted space on top of your aggregate reserve.  Don't even get me started on DFM's reporting when it comes to mirrored disk.

I think there must be a better way to carve out the storage.

I like how they took the spares out of the aggregate size, maybe they can take the disks requred for the reserve and the "overhead" space and re-designate that as true overheard and allow aggregates to fill to 100%.


Re: Aggregate size and "overhead" and % free rules of thumb.

No one is saying you can't run your aggregates at 100%.  In fact, I would imagine many people who don't thin provision do exactly just that.  If you do, then you may want to watch your volume usages with respect to performance.

And that's the fundamental issue.  If your application requires performance (and not all do, think archives, etc.) then you want to have free blocks.  This isn't a NetApp thing, it's a storage thing.  I don't care who's storage you buy, as the # of free blocks drop, new writes will begin to fragment which slows them down, and thus tends to slow down the subsequent reading of those same fragmented blocks.  It's a matter of physics.  As the price of technologies like SSDs come down, this may become less of an issue as seek speeds greatly increase, but right now the price point of those is pretty high so most customers I've seen who are even willing to consider going that route reserve it specific apps that need it and aren't deploying it widely.

But like I eluded to earlier, if your application can live with low performance, than you can worry less about this issue.  This is one of the bad things about the use of so-called "Best Practices".  You typically don't, and in some cases, can't deploy them everywhere.  It sounds like a good, simple idea, but there are cases where it makes sense to have a set of practices based on your needs rather than optimal performance, optimal utkilization, etc.

Overall, I believe NetApp has a pretty good utlization story with lots of functionality that can help drive it up.  From RAID-DP, FlexVols, and snapshots to the newer things such as thin provisioning, PAM cards, and deduplication, most customers get more out of their NetApp storage than other solutions even with the overhead associated with making those features work.  Are there improvements that can be done?  Sure.  But even today, if one takes advantage of the features in place, it's a pretty good deal, IMHO.

Re: Aggregate size and "overhead" and % free rules of thumb.

I totally agree with your comments, hence my statement that NetApp is often very conservative and that I run my aggregates at 95% as a rule (never had performance issues).

</shifting gears>

I guess I'd just like to see this reported in a better way.  When you're coming up a TB short on all your aggregates x many dozens of aggregates x 8+PB of allocated storage your management doesn't care about block overhead, they just say give me all that free space and give it to this dept.

Re: Aggregate size and "overhead" and % free rules of thumb.

I think the information you're looking for can be found using

aggr show_space which shows how much has been allocated and how much has actually been used on a volume by volume basis, including a handy summary at the end vis

Aggregate                       Allocated            Used           Avail
Total space                   729722752KB     320823404KB    1171989080KB
Snap reserve                  100262520KB      60113024KB      40149496KB
WAFL reserve                  222805604KB      23863076KB     198942528KB