VMware Solutions Discussions

Defragment VMs, should it be done?

HendersonD
13,217 Views

I am aware of at least two disk defragmenting solutions aimed directly at VMs (Diskeeper and Perfectdisk). Both claim increased IO and reduced latency for VMs that are defragmented. Has anyone tried these products? Is it important to defrag a VMs drives? Does running defrag on a VM affect its alignment? All of my VMs are aligned and I certainly do not want to change this.

17 REPLIES 17

radek_kubka
13,159 Views

Hi,

The short answer is: no.

Plenty of detailed info could be found here: https://communities.netapp.com/message/27821.

This quote is probably key:

"Virtual machines stored on NetApp storage arrays should not use disk defragmentation utilities as the WAFL file system is designed to optimally place and access data at level below the GOS file system. Should you be advised by a software vendor to run disk defragmentation utilities inside of a VM, please contact the NetApp Global Support Center prior to initiating this activity."

Regards,

Radek

BILLRTECHSPEC
13,157 Views

I have seen V-Locity (Diskeeper's VM Optimizing Product) make a very big difference on VMs.

It is true that you should not run a traditional defrag program on a VM, as it is not designed to address the virtual environment.

There's a lot of information on this.

Here's an Osterman Research White Paper on The Importance of Defragmentation in Virtualized Envelironments:

http://downloads.diskeeper.com/pdf/The_Importance_of_Defragmentation_in_Virtualized_Enviroments.pdf

And here's another White Paper by Topix Media on Best Practices for Optimising Virtual Disks: http://downloads.diskeeper.com/pdf/Topix_WP_en_DOLLAR.pdf

And here's a bunch of testimonials from IT managers running VMs of various sizes and configurations at their sites who have also actually used V-Locity:

http://downloads.diskeeper.com/pdf/V-locity_SuccessStories.pdf

So the more accurate answer would be Yes, one should defragment VMs, but only if one uses the right tool to get it done properly.

Check out these white papers and see what you think.

radek_kubka
13,157 Views

I have seen V-Locity (Diskeeper's VM Optimizing Product) make a very big difference on VMs.

It is true that you should not run a traditional defrag program on a VM, as it is not designed to address the virtual environment.

Okay, this is not the point whether a defrag program understands VMs - the point is there is no defrag program (yet) which understands NetApp WAFL file system & that is also true about Diskeeper.

The thread I am referring to in my previous post gives more details why running OS/VM level defrag on NetApp is a bad idea.

In particular this is NetApp official best practices document:

http://www.netapp.com/us/library/technical-reports/tr-3749.html

You will find the relevant paragraph on page 78.

Regards,
Radek

keitha
13,157 Views

Radek is 100% correct here! Regardless of the tool a defrag of the VMs will result in temporary loss of dedupe on the NetApp, huge snapshots and likely a decrease in vm performance.

Keith

brendanheading
13,157 Views

Running a defragmentation job on a production server would be tantamount to an act of suicide.

And, leaving aside the fact that defragmenting a logical volume on WAFL is a waste of time, I'm not convinced that file fragmentation is the issue that it once was. Fragmentation isn't an issue on SSDs (there is no distinction in average access times between random or sequential I/O) and even on traditional disks, with all the clever caching that modern OS's do, any frequently-accessed parts of a volume are going to be kept in the OS page cache anyway - provided there is sufficient RAM - irrespective of whether they are fragmented or not. RAM is cheap these days so it's a lot easier to add more rather than take massive risks with data.

The whole notion of defragmentation strikes me as a bit like colonic irrigation. The concept makes sense in theory but in practice there is little or no justification for it.

The claims that VMs in particular "need" defragmentation are very suspicious, and the erroneous statement "VMs usually run multiple OS's simultaneously" suggests that we are benefitting from the contributions of an individual who lacks technical background.

radek_kubka
13,157 Views

The whole notion of defragmentation strikes me as a bit like colonic irrigation. The concept makes sense in theory but in practice there is little or no justification for it.

It starts to get interesting . Whilst we both agree that using VM/OS level defrag tools doesn't make sense at all, fragmentation as such on a NetApp system can impact the performance & can be rectified, albeit using storage-level tools (namely reallocate). There is a lot of discussions around that on Communities with the most notable thread: http://communities.netapp.com/thread/6530.

As for the practical impact of reallocation, this is a nice example: http://communities.netapp.com/message/26889#26889.

Regards,
Radek

HendersonD
13,157 Views

Radek,

I read through both of the links you provided about Netapp's reallocate. It was good reading but generated more questions for me than answers:

- What volumes should I run reallocate on? I have an NFS volume that holds my VMs, I have NFS volumes that hold my SQL databases and log files, I have LUNs for my Exchange database and logs. Are they all candidates?

- What if the volume is deduped already, should reallocate be run? What is the affect?

I agree with the sentiments of people in both threads that it would be nice to have a TR on this subject. Is there any documentation or white papers on reallocate?

Thanks to everyone who responding to my post, I certainly will not be using any type of defragment utility on my VMs but my curiosity is peaked about possible performance improvements with reallocate.

Dave

radek_kubka
13,157 Views

Hi Dave,

Reallocation seems to be a prime example of a dark art in the NetApp world, indeed!

- What volumes should I run reallocate on? I have an NFS volume that holds my VMs, I have NFS volumes that hold my SQL databases and log files, I have LUNs for my Exchange database and logs. Are they all candidates?

The only answer can be - it depends! The old rule applies here: if it ain't broke, don't fix it. Is there a performance problem? Can it be related to fragmentation? The common suspects are slow sequential reads, as this is when fragmentation bites. Reallocate can be first run to measure the level of fragmentation, then the actual reallocation can be done if really needed.

- What if the volume is deduped already, should reallocate be run? What is the affect?

We have a neighbouring, fresh thread (http://communities.netapp.com/message/51653#51653) with the firm answer - no. Just common sense says if a block is shared between multiple files it is really impossible to tell where is its optimal location (you can't please them all )

Regards,
Radek

HendersonD
13,157 Views

I have 4 NFS volumes for my SQL Server. Below are these volumes along with information I gathered after running "reallocate measure /vol/volname" on each one

SQLSystemDB, Optimization: 4 [hot-spots: 44]

SQLUserDB, Optimization: 4

SQLLogs, Optimization: 7

SQLSnapInfo, Optimization: 1

Four quick questions:

1. It appears that I should run a reallocate on three of these volumes since their optimization level is 4 or above, is this correct?

2. I am going to run "reallocate start -f -p /vol/volname" to actually carry out the reallocation. Is this the correct command? It appears to be after reading the man pages

3. Should I run this off hours?

4. And the most important question - this is a production SQL 2008 R2 server, is this safe to run? The first rule of medicine and IT, do no harm!

Dave

brendanheading
9,513 Views

I'd suggest not doing anything of this kind unless it is in response to a user-reported problem.

radek_kubka
7,708 Views

This is my take on theses questions:

1. It appears that I should run a reallocate on three of these volumes since their optimization level is 4 or above, is this correct?

Question no. 0. - are these volumes deduped? If they are - don't bother, otherwise it depends (is anyone actually complaining about read performance?)

2. I am going to run "reallocate start -f -p /vol/volname" to actually carry out the reallocation. Is this the correct command?

Yes, it lloks good to me. -p option is critical, as without it existing snashots would grow massively.

3. Should I run this off hours?

In my opinion definitely outside of peak hours - the scan will quite likely hammer the performance whilst being run.

4. And the most important question - this is a production SQL 2008 R2 server, is this safe to run? The first rule of medicine and IT, do no harm!

It is deemed to be safe. I know people running reallocate on a regular weekly schedule on production volumes. That said, always plan for the unplanned (a new, undiscovered bug?) & have your backups done upfront

Regards,

Radek

mmaterie1
9,513 Views

Hi Dave,

I'm the Director of Product Management at Diskeeper Corporation, the vendor that makes the V-locity product for VMware. I hope what I’m writing is helpful and does not come across as a sales pitch, that is not my intent. I feel it’s important to substantiate statements (for or against) with data.

As you noted the purpose of defrag is to increase IO throughput / reduce latency – at the GOS. Windows-based file optimization tools are not going to affect physical placement in a NetApp FAS. So, the comment in the NetApp paper is correct in this regard. However, it does not speak to the fact that the IO overhead in Windows is negatively impacted. Defragmentation does not alter alignment whatsoever, but the unwanted effect of misaligned partitions is the same as from NTFS fragmentation – “…results in a dramatic increase of I/O load on a storage array and negatively affects the performance of all virtual machines being served on the array”; pulled from page 79 of the NetApp Best Practices paper referenced earlier in this thread.

We recently completed tests on a NetApp FAS3140 at a Microsoft lab in Washington. Needless to say we didn’t use ESX, but the results would be similar. Testing included Jetstress, Iometer, SQLIO and SQLIOsim, as well as Perfmon to collect additional data. Specifically, benchmarks were run on fragmented test files (starting at around 60,000 fragments and up) that are used by those apps (e.g. iobw.tst, Jetstress*.edb, Sqliosim.mdx, etc…) and non-fragmented test files. In every case, fragmentation in the GOS resulted in more IOs and latencies for those benchmark apps. Defragmented files resulted in less total IO required and faster response times. As one example with SQLIO, 8KB Random Writes IOs/Sec improved from 486.15 to 1274.78 with the Max_Latency(ms) dropping from 4490 to 161.

Perfmon also indicated IO issues (the Perfmon screenshot in this post is from those FAS tests: http://www.diskeeper.com/blog/post/2011/03/29/Best-Practices-SAN-defrag.aspx). The split I/Os reported in Perfmon are indicative of fragmentation and/or misaligned partitions.

We ran the reallocate command as part of the tests, which also helped in most cases.

I understand the “do no harm” viewpoint for any evaluation, and there are caveats; which is why we have special technology for VMs/SANs. If you can test on non-production LUNs, that’s a great start. Perhaps the most readily available option is to monitor Perfmon stats in the VMs for the presence of potential logical disk bottlenecks. Ultimately theory always has to be validated with scientific process, and we have support engineers, methodology, and software tools available to assist anyone with testing.

As to the comment on caching, yes if you can cache everything you don’t have to be as involved in disk bottlenecks, but where it is cached is relevant. If you can cache in Windows, you can leverage Fast IO, and don’t even need to generate IRPs to pass into the storage subsystem. I've also tested on many SSDs (http://www.diskeeper.com/blog/post/2010/04/02/Do-you-still-need-HyperFast-if-you-have-TRIM.aspx), which mitigate the impact of fragmentation on reads (Intel X25-E being the best I tested – had to get to 200k free space fragments before write performance dropped), but can still be impeded. Consolidating free space, not file fragments, is the principal need on SSDs. That said, better controllers and more channels (and PCM memory in the future) continue to improve SSD overall performance, so it's reasonable to expect fragmentation to be a relatively moot issue in the future.

Michael Materie

Diskeeper Corporation

keitha
9,513 Views

Michael,

I appreciate the info but I am still going to have to disagree with you. First, to say that fragmentation is the same or similar to misaligment on the storage array if grossly untrue. they are radially different use cases on the disk controller.

Second, did you do your testing with a single VM or a single physical server on the NetApp controller? I hope not since most of the the controllers in customer sites are handling streams of data from dozens of sources and laying that data out on the disks in a particular way. To say that a tool, any tool, can from within a guest OS determine where those blocks are on a NetApp controller and move those blocks to a better location is wrong.

Keith

mmaterie1
9,513 Views

Hi Keith,

Apologies if I was unclear, I should have been more specific to note "GOS" alignment - e.g. the flawed default alignment of Win2k3. And, my point is the effect in the GOS of "split I/Os". Technically the fragmentation issue is one that is ideally handled and coordinated at every layer as well, as it is with proper alignment. I do agree that mis-alignment is a more severe issue, but is one that proper config can address.

And yes, the testing was with multiple concurrent operations to try and get close to real-world, though I should note that the tests we did are proof-of-concept. I've attached a pic that shows the general setup used. The tests we did validate that performance can be gained from GOS defrag. They do not pretend to make assumptions as to benefits anyone else would achieve. The only point I take away from the results is "it's worth looking into"... that's all.

The data below is from a V-locity user on a production NetApp SAN (1week to 1week comparison). It is an in-progress case study, so I don't have any info about the model or config yet, but that will all be published when ready.

Counter

Before

After

Improvement %

Split IO/sec

2.858

0.251

91.22

% Disk Time

67.074

21.579

67.83

Avg. Disk sec/Transfer

0.059

0.027

54.24

Avg. Disk sec/Write

0.02

0.01

50.00

Avg. Disk sec/Read

0.118

0.097

17.80

Avg. Disk Queue Length

6.037

1.942

67.83

Avg. Disk Read Queue Length

5.096

1.477

71.02

Avg. Disk Write Queue Length

0.941

0.465

50.58

% Idle Time

94.291

96.689

-2.54


I agree that Windows-based tools cannot control physical placement (true even for DAS - e.g. any SSD with wear leveling, HDD with bad sector remapping, etc...), and made a specific statement about that in the original post.

keitha
9,513 Views

Sorry, it looks like I misunderstood you as I thought we were talking about defragmentation. Yes, GOS alignment makes a huge difference and should absolutly be done. I'm wondering, what is the down time to the VM during the alignment process with your product?

Keith

mmaterie1
7,708 Views

Hi Keith,

The results I'm presenting are from defrag.

The correlation I was trying to make on alignment is that fragmentation and a misaligned GOS basically both require additional unnecessary I/Os (in NTFS) to get data the same amount of data. I restated this as you referred to ‘”misalignment on the storage array”, and as you know alignment needs to be addressed at multiple levels.

V-locity currently does fragmentation prevention, defrag and virtual disk compaction (offline, but live if you have VMotion with the new release in June). Alignment is being looked at for a future version of V-locity.

I just noted you are a NetApp employee. We are under NDA, so I can discuss more if you contact me directly. I can also provide NFRs.

-Michael

brendanheading
9,513 Views

Radek, absolutely, but you could have a volume in need of reallocation and not have an observed performance issue. It all depends on what the entity using the data is doing.

Public