Misaligned VM performance impact analysis on NFS datastores ~ 18%

fletch2007 · ‎2009-09-21

Hi,

We are in the process of aligning our VMs on our NFS datastores - I read with interest the Netapp doc

http://media.netapp.com/documents/tr-3747.pdf

outlining the performance impact from the Netapp point of view - but I did not see the impact quantified.

Since the process of alignment currently requires the VM to be down (integrate with storage vMotion please?!)

I decided to try and design a test from the VM's point of view of the impact of aligned vs not-aligned.

The setup (I did this on an otherwise quiesced lab system (Dell 1950 x 2 cluster running vSphere + Netapp 2020 (NFS Datastores)):

1) Take a misaligned Linux VM (as checked by mbrscan)

2) clone the VM

3) align the clone with mbralign

Now we have two linux VMs M(isaligned) and (A)ligned

I wanted a way to generate IO of varying sizes - I used this script:

[fcocquyt@lab-vm-01 ~]$ more generateIO.csh

#!/bin/csh

set x=1
set bs=1024

while ( $bs < 9000 )
    echo $bs
while ( $x < 20 )
    dd if=/dev/zero of=tstfile$x bs=$bs count=10240
    sum tstfile$x
    @ x++
end
rm tstfile*
@ bs+=1024
set x = 1
end

What I found from repeated runs of this script on both M and A vms was the Misaligned VM took an average of 18% longer to run the same IO.

I also captured /usr/lib/vmware/bin/vscsiStats - but interestingly those numbers (latency and outStandingIOs for example) did not show the same result (it showed about the same average latency for M & A vms...

I welcome any and all comments on this analysis

One area: block size - I have a suspicion the blocksize has a big effect on the latency - while the script was stepping through the blocksizes I observed the throughput varying quite a bit.

But the finding of 18% impact is in line with my expectation for NFS datastores...

thanks

amiller_1 · ‎2009-09-23

Very handy info -- thanks much for posting.

I generally tell people that you have a certain performance "ceiling" based on the filer head and/or # of spindles (more driven by spindle count usually). Misalignment just means that you'll hit that "ceiling" sooner than you would otherwise. Until you hit that ceiling you won't see a huge difference (although 18% is higher than I would have thought.....very good to know). Once you do hit that ceiling, it's the same as when you max out your backend disk I/O under any circumstances (i.e. things get very slow)....you're just going to get there faster than otherwise due to the impact of misalignment.

fletch2007 · ‎2010-07-23

Turns out the impact of misaligned VMs can be more dramatic if the level of unaligned IO tips ONTAP into synchronous mode.

One indicator is the pw.over_limit stat -

ref: TR-3593.pdf

1.4 Counters that indicate Improper Alignment
There are various ways of determining if you do not have proper alignment. Using perfstat counters, under the wafl_susp section, “wp.partial_writes“, “pw.over_limit“, and “pw.async_read“ are indicators of improper alignment. The “wp.partial write“ is the block counter of unaligned I/O. If more than a small number of partial writes happen, then WAFL® will launch a background read. These are counted in “pw.async_read“; “pw.over_limit“ is the block counter of the writes waiting on disk reads.

This counter is not exposed via SNMP as standard, but can be trended as outlined here:

http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html

thanks