2011-09-02 01:21 PM
We have been dealing with a nasty I/O performance issue with VMware on our FAS2020 for a number of weeks.
Any time a VM did any significant amount of write access to disk, the VM would slow to a crawl, to the point of being unusable. When the write operation had completed the VM would return to normal. Using vmstat/iostat under Linux we could see that the iowait% would rise significantly and there would be spikes in the number of runnable processes (suggesting the kernel being too busy to schedule them). We raised a ticket with VMware, they took a look at the logs from esxtop and pointed at storage latency. We raised a ticket with NetApp, there were several perfstats sent; we had been doing some heavy snapmirroring which we were encouraged to turn off, but still the problem remained.
Eventually, taking a second look, we noticed that there were a few VMware snapshots present on the VMs showing slowness.
It's worth adding at this juncture that we use Virtual Storage Console (formerly SMVI) to keep nightly backups. VSC works by creating a VMware snapshot to quiesce the VM, and then creating a NetApp snapshot to lock the quiesced VM in place. When the NetApp snapshot is safely created, VSC is supposed to delete the VMware snapshot. On some occasions, for reasons that are presently unknown, this doesn't happen, which means that the VMs carry on running with a VMware snapshot present.
VMware snapshots are notoriously performance hungry (there's some great analysis here : http://www.vmdamentals.com/?p=332). When we deleted those old VMware snapshots the problems went away and our VMs and storage are now humming along nicely.
So if you're having weirdly consistent performance issues with vmware, check your VMs and make sure that any VMware snapshots are deleted - look for *.vmsn files in the datastore. Be careful, as deleting the snapshots will cause the VMware host to rebuild the primary vmdk which may take quite some time to complete - it took an hour in our case. It should be a simple matter to write a quick shell script and run a cron job to catch any cases where this happens in the future.
Hopefully this will save time and trouble for some other folks out there.
2011-09-03 06:04 AM
> few VMware snapshots present on the VMs showing slowness
Just for clarification: Did you mean 'VMware snapshots' by VMware (native for VMware), or 'VMware snapshots' by NetApp (native for NetApp, i.e. Snapshots (tm)) ?
2011-09-05 01:52 AM
Personally I would steer clear of VMware snapshots - unless they are really necessary.
OS consistency is not a massive issue in most cases. At the end of the day, taking inconsistent snapshot equates to suddenly pulling the plug from a server - nine out of then times (if not more) the OS will come back on with a slight complain only.
I know, however, some people are using VMware snapshots for acheiving application consistency (via VSS integration).
2011-09-05 08:31 AM
I like having my backups consistent
Given that a VMware snapshot also snapshots the memory and machine state, the machine will simply resume where it left off when it recovers (although obviously any open sockets/network connections will be broken). Looked at that way, you should never get data loss. When you're not using VMware snapshots you are taking a chance, even if it's a small one.
2011-09-05 08:40 AM
Yeah, I see the point - it's nice to have a better / consistent backup.
That being said, is always benefit vs. cost: if we are talking about just an OS image, without any user data, what's the actual risk of having even a number of unusable backups? Going two weeks back to find a good one & then re-applying two weeks worth of patches?
VMware snapshots can help with OS consistency, indeed - yet they proved to be cumbersome in many situations (e.g. when using SMVI).
2011-09-05 02:23 PM
Radek, yeah in a lot of cases there won't be user data (f.ex - VMs where the critical data is attached to the VM via NFS or iSCSI .. keeping the VM and the data it operates on separate is definitely a good move in most cases) but sometimes there will be. We are a fairly small shop with a small support team, and the VMware cluster is there for use by various people with various disparate needs.
Journalling filesystems have of course been standard for some time now so, as you said, the probability of loss is minor. But there'd have to be something cool about performing a major disaster recovery and, afterwards, the users logging in and finding all their stuff is still running just as it was before. Not necessarily strictly justified, but certainly cool as