VMware Solutions Discussions

Reallocation: aggr vs volume.

jasonczerak

I'm a little confused here on where it's approtate to use this

When running some the reallocation on an aggr, I've noticed that it goes though each volume and does what I believe to be the volume version of the reallocation command. Is this true? If I set a schedule to kick off a a weekly reallocation on an aggr it will blanket the relayout of blocks first at hte free space at the aggr then block optimize each volume?

Or should I be setting up schedule on a per volume level? there are a handful of volumes that I think could use some daily optimizations. Some may even work well with read_realloc as well

Where I officaly got confused was this with the reallocate command. This "note" doesn't seem to be in any of the pdf/online docs and contradicts them.


reallocate start -A [-o] [-i interval] <aggr_name>
  NOTE: -A is for aggregate (freespace) reallocation.
        Do NOT use -A after growing an aggregate if you wish to
        optimize the layout of existing data; instead use
            reallocate start -f /vol/<volname>
        for each volume in the aggregate.

Looking at the volume status' I see "online,raid_dp,redirect,active_redirect"  for each volume. This would lead me to believe that it's doing "everthing"

DataOnTap 7.3.1.1 (and 7.3.3RC1 on 2 array's soon to be deployed about when 7.3.3 is hopfully GA)

33 REPLIES 33

martinj

Based on coversations I've had with people way smarter than I am, the general rule of thumb is

Only run reallocate -A when you value write performance above all else, and even then you'd be well advised to contact the Global Support Center or your local NetApp performance SE first for advice.

There are edge cases where it can be used effectively for other workloads, however you're almost always better off running reallocate -p. Even though this does does not address freespace fragmentation directly, for most situations it does a fair to good job of leaving behind lots of good areas for the write allocator to work with.

If you're seeing suboptimal write performance, you're better off checking some other things first (like misaligned I/O) before you start running reallocate -A

Regards

John

erick_moore

Also it is important to note the differences between aggregate reallocation and volume reallocation.  All aggregate reallocation does is make free space in the aggregate contiguous.  This is different from volume level reallocation where data blocks are optimized across an aggregate.  For example when you add new disk to an aggregate you generally should run volume level reallocations against all the volumes in that aggregate.  Doing that will distribute your existing data evenly across the new disk, sort of a leveling process so you don't end up with hot disks.  Doing an aggregate reallocate after adding new disk would basically do nothing since there is already new contiguous free space in your aggregate on the newly added disks.  Does that make sense?

erick.moore@sxc.com

Doing an aggregate rescan after adding new disk would basically do nothing since there is already new contiguous free space in your aggregate on the newly added disks.

I don't think that's correct. It's important to reallocate after adding disks to make sure that you're contigious space is spread evenly across spindles. The last thing you want is 10 disks full and 4 disks empty. The data needs to be contigious when looked at from the aggr level, not on the disk itself - you don't want to read 10 blocks off 4 disks when you could be reading 1 block off 40. Which of  course brings up the question of statit results again - are chain reads according to spindle (I suspect they are since it's in the per disk data output).

Also I was reading the 7.3x upgrade guide and it mentions that deduped blocks are not reallocated. Bummer, I can see why (can you imagine the overhead if you had to look at an extra 16 dimensions when doing a reallocate) but it's still a shame. I wonder if it will be part of OT8.

Actually it is correct.  An aggregate reallocation is different from a volume reallocation.  Aggregate level reallocate will not optimize placement of the blocks in the volume, or level the data in volumes across the aggreagte. This is from the man page for reallocation concerning aggregate level reallocation: "Do not use -A after growing an aggregate if you wish to optimize the layout of existing data; instead use `reallocate start -f /vol/<volname>' for each volume in the aggregate."  Reallocating your aggregate only results in the creation of contiguous free space in the aggregate.

Thanks,

E-

anthonyfeigl

Let’s try to understand this.

I can see between Erick and Jeremy that you guys are both in the right direction.

My understanding of basic RAID is that data needs to be laid out one time in a contiguous fashion.

Doing so will create better performance for data that is accessed from said RAID set.

An AGGR is a really big flat file system, and as a volume is created, it takes a chuck of the file system.

So the first basic question is this. 

Does a volumes block(s) span an AGGR from beginning to end like HDS and EVA SAN (virtualized across) or do they have a selected "chuck" of space that is fixed in the AGGR configuration.  As I writing this, I am considering the fact that Flex vols can be resized on the fly and that seems to be counterintuitive to what I stated.

Does anyone know how the Flex Vol is actually laid out as data on an AGGR?

Is anyone else wondering this?

Anthony

I think flexvols are just another level of virtualization on top of aggr - initially aggrs where volumes and flexvols are just a 2nd layer on top of that. Remember qtrees? They where ugly and not really that useful so everything got bumped a bit.

I think

I remember someone talking about this at NetApp back in the day - also it would make sense since there is obviously a difference between the real block layout and what it looks like from a snapshot point of view. (-p) which is why you can reallocate but still use vsm with out resync'ing everything.

Lets say you add 4 new disk to a 60 disk aggregate.  Lets also say you have 1 volume sitting on that aggregate.  Now if you do an aggregate reallocate it will level out free space in the aggregate, but all you have is free space on those new disk!  Aggregate reallocation will not touch the existing data blocks of the volume that sits on the aggregate.  Now when you go to update that volume where do those new writes go?  The answer is to the 4 new disks.  You have just created a hot spot since writes for that volume will go to the new disks since those disks are the areas with the most contiguous free space.  In order to level performance you need to reallocate your volume.  Doing that will now distribute the data across all 64 disks.

An aggregate reallocate is different from volume reallocate.  This is what I have been trying (poorly it seems) to explain.  I hope this helps clarify.

"Aggregate reallocation will not touch the existing data blocks of the volume that sits on the aggregate."

I am missing something, if an aggr reallocate does not change any existing blocks than what does it do?

I think it does move existing blocks, basically filling in any empty areas (similar to a database shrink). It does not lay out the file systems in any optimized way since it has no knowledge of what a file system is but it does move the data from the full disks to the empty ones. Isn't that the point of the command?

I agree, the aggregate reallocate is confusing.  The purpose is to make free space contiguous in the aggregate for writes.  Imagine that you have an aggregate that has 2 volumes on it.  These volumes are heavily utilized, and you fill your aggregate up to 88% full.  Now though after years of reading and writing there isn't a lot of contiguous free space in the aggregate.  The NetApp now has to work a lot harder to find areas in the aggregate to write data.  By performing an aggregate reallocation you will help your write performance by moving all the aggregate free space to a contiguous location.

This is directly from the manual: "Do not use -A after growing an aggregate if you wish to optimize the layout of existing data; instead use `reallocate start -f /vol/<volname>' for each volume in the aggregate. "

Do NOT use "reallocate -A" (aggregate level reallocation) if you want to optimize existing data.  I think I need to do a blog post on this topic as it is easily one of the most misunderstood aspects of WAFL.

erick_moore

jasonczerak

So, what exactly is reallocate -A doing when it is showing the volumes in the status? First it seems it I'd processing the aggr, then every volume. All volumes get the active_redirect status the redirect status. Apperently this is the same status as the "wafl scan redirect /vol/volame" command  some of our aggrs have 30 volumes. Some less. I'm looking for a way to keep a schedule to keep the data optimized as much as possible since we have a rather 9-5 peak for ourmain workloads what I've found is arrg and volume wafl scan redirect consume one core very well, then artificially adds latency filer wide. Even on a 6080 the reallocate command can cause serious problems. I have and issue open with support on this.

erick_moore

There really is no need to perform an aggregate reallocate unless you have run into some very specific performance issues.  In fact you are probably introducing problems into your environment by frequently running aggregate reallocations.  The active_redirect you see on volumes in the aggregate means that your reallocation has not finished.  You will see reduced read performance to volumes in an aggregate that have this status.  Once the reallocation is finished the volumes will display "redirect".  This is normal and does not indicate any downgraded performance issues.  How frequently are you running aggregate reallocate?  I suspect you may be stepping on the aggregate reallocate.

If you go into priv set advanced you can run the command: wafl scan status  This will show you which volumes are actively redirecting blocks.  You will see something like this on volumes that are actively re-organizing free space.

11058                       redirect     13 of 17 volume(s) processed.

11248    container block reclamation     block 1508 of 5355

You should not run another aggregate reallocate until these processes have stopped.  You should not have any volumes that display with the active_redirect status.  Additionally you should not run an aggregate reallocate if you have aggregate snapshots.  I always recommend turning off aggregate snapshots anyway.

Very few types of volumes suffer from heavy fragmentation.  We had an ESX LUN that we did not run a reallocate against for almost 2 years.  Once we did a reallocate measure it reported as having an optimization level of 3.  If you are very concerned about this type of thing you can easily set thresholds and schedule reallocations to occur once a volume is over the threshold you set.  NetApp recommends doing this for LUN's.  I think people have a tendency to become worried when they see optimization levels, but it is your application you should be concerned with and not this number you see from the CLI.

jeremypage

Sorry to cause confusion, I was making the assumption it was looking at contiguous from a RAID level not just for the spindles individually.If I add spindles to an existing aggregate for read performance I don't get anything for existing data unless I do a volume reallocate for each volume? That's pretty disappointing, I thought ontap would be smart enough to use the new drives.

radek_kubka

Guys,

This is great - few more posts & we will have almost a reallocate TR in this single thread!

Kindest regards,

Radek

I think I am not explaining this well. For example if you have a mostly full aggr with 2 RG's of 13 drives each and add a drive to each RG the space from the new drives is 100% contiguous but while the free space is contiguous on the disk it's extremely inefficient in terms of reads/writes since most of the time you'll only be hitting the new disks. If you reallocate the aggr's blocks across all of your spindles then you're less likely to have to read more than once per spindle.

Ideally you're file systems are laid out so that on the RG & aggr level you do contiguous reads and vary depending on the needs of the volumes from there. If you really want to make it complicated you add the fact that you've got volumes on that and possible LUNs on that and maybe VMDK files inside those LUNs with files inside of them as well. Remember the goal from a performance standpoint is to 1) maximize the number of spindles involved in a read/write and 2)reduce the seeks.

Reallocating an aggr should do this by leveling the data across all the disks in the aggr but does not address #2 because at the aggr level you can't really predict what actual files are where (which is why you need to do a volume reallocate). Basically there are two things that can be contiguous - blocks (any blocks) on disk and the blocks for a specific file system. While the aggr level reallocate does not help the files be contiguous it does help performance when you add new drives because otherwise you'll be producing uneven work across your spindles.

Ideally if you have a 48k file on a 14 disk DP RG you'd do one read per spindle to retrieve it, although that does not take the initial seek time into account (which assuming non sequential reads is almost always going to hit you anyways) but even if you do that's one seek per spindle. I don't know how expensive the seeks are though - reading 8 blocks from a single spindle may be faster than 1 block  from 8 disks and in that case you'd be right.  It would be interesting to test.

jasonczerak

Well, after some more reading I've concluded that volume -p is necessary and I'm confirming but if you do this on a parnet of a flex clone, you should on the clones.

Still no word on if a volume reallocate will "un do"  read_realloc volume option. This is alot of work to test this.

Anyone have any DFM counters at the volume level that could be graphed to show an improvment in IO or anything after these operations are performed?

Hi

One thing to watch out for.  We tried to do aggregate reallocation but becuase the aggregate was create with DoT 7.2 but the filers now run 7.3, the process failed and would not start.  So we had to make do with volume reallocation.

Hope it helps

Bren

That's interesting. We started at 7.2.4 and upgraded to 7.3.1 but didn't have any issues. Where you using traditional volumes? We're 100% USDA Flex here.

We started at 7.2.6 I think and moved up to 7.3.1.1 and the 4orignal aggr's created with 7.2 run the process with out error. It's just when they get to the "volume" part, one core gets chewed up and cuases latency issues filer wide. I have an issue up with support for this.

My bad. Just looked up the error and it was due to ours being a 7.1

aggregate. Error was:

Unable to start reallocation scan on 'aggrX': Aggregate created on older

version of ONTAP

Announcements
NetApp on Discord Image

We're on Discord, are you?

Live Chat, Watch Parties, and More!

Explore Banner

Meet Explore, NetApp’s digital sales platform

Engage digitally throughout the sales process, from product discovery to configuration, and handle all your post-purchase needs.

NetApp Insights to Action
I2A Banner
Public