I have been having an issue with the vmware snapshots generated by SMVI. On certain backup attempts by SMVI the VMware snapshots are never removed. It doesn't seem to matter whether the smvi job is successful or not as I have seen it not clear snapshots in both cases. The attached .bmp file shows the three uncleared snapshots from three different dates. The logs are from two different servers that had uncleared snaps that happened to be on the same day. I couldn't go any earlier as the smvi logs "could not find the backup id."
What seems to be the recurring issue:
2009-08-09 = 08:02:28,244=20 INFO - FLOW-11012: Operation requested retry 2009-08-09 08:02:38,306 = INFO -=20 FLOW-11006: Starting operation in threadpool backup
This repeats for several minutes and then it looks like it just gives up. I have most smvi backups kicking off at the same time - not sure if that could be a relevant factor.
I saw this once before where we were snapshotting about 75 servers on 4 dell 2950s (2x quad 2.5, 24 GB RAM, NFS vols). We were doing snaps with scripts (I think it was from the TR article on best practices for ESX on Netapp) instead of SMVI but functionally you could have the same issue. Our script and SMVI (in one way or another) calls to ESX to list the VMs that it is controlling at the moment, one by one take a snap of each VM (repeat for all servers), tells the filer to snap the volume, then tell all the ESX servers to unsnap their VMs one by one. It takes a few seconds per VM to snap them and more time to unsnap them. The amount of time to unsnap them depends on how many changes were racked up in the .000001.vmdk snapshot files. It may only take a few seconds to apply the first snapshot file to the first VM but the deltas in the other VMs were growing the whole time so by the time ESX got around to applying the snap in the 75th VM, it had been hours... and we were running the script every 4 hours.
So one day the NFS volume went offline. A quick glance had shown that it filled up. Grew the flexvol, get the volume back in ESX, power on the VMs, no problem. So what was taking the space? Oh, the 50 snaps of each VM. Wish I had saved that screenshot.
Solution (for us), run the script every 12 hours. We could have keep an eye on how long it takes to unsnap the last VM (moving target as you deploy more), moved page files to another volume thats not being snapped (since they get the most changes), etc They (the customer) didnt care enough to do that.
Sorry for the long story but if you happen to be snapping a bunch of VMs and relatively few ESX servers that will happen to anyone. I hear there are some fixes to how fast a delta can roll up in ESX4 but I havent played with it yet. Good thing about SMVI, you can choose which VMs you want quiesced and dont have to do them all each rotation. Of course you want them all but you can eliminate at least a few from each snap cycle. SMVI snaps the volume anyway so any VM in the volume will be in the .snapshot. Just the ones you chose to quiesce will have the .000001.vmdk there. What server doesnt come back from being powered off or having its plug pulled nowadays anyway? You can still recover unquiesced VMs from the snapshot dir. A db server or exchange or something like that might be different but most servers should be fine.
That's some great info and some good tips! I have also had success breaking VMs in the data store into separate jobs too. For example if you had 90 VMs in a datastore, schedule 3 jobs one at 1Am one at 1:20 and one at 1:40 and quiece 30 VMs at a time. This will limit how many snapshots have to be created on each backup run. Alternatively you could spread the jobs out evenly through the 24 hour period. What this gives you is still a 24 hour quieced RPO for each VM in the datastore but then a much shorter crash consistent RPO. This of course is not necessary if you do an hourly crash consistent backup anyhow which is very tempting since the crash consistent backups are mere seconds to complete and no load on VMware at all. Only one caution though, if you break up the VMs like this, when you add new VMs they will only be backed up crash consistent by default, you will have to go into one of the jobs and add the new VM into one of the consistent jobs to get a quieced backup. Not a big deal just have to be sure you don't forget to do it!
Yeah, separate volumes would speed up the unsnapping operation since the deltas wouldnt pile up so much. Youd loose out a little deduping but when you have 2.x T usable from a 300k fiber shelf at a minimum, its not really that big a deal. Could you you snap a qtree while deduping the volume for the best of both worlds (netapps not my day job if you couldnt tell). At the end of the day, ESX is the limiting factor (no, thats not a slam on them.. its not magic, stuff needs to do stuff to work as long as we are being technical). Breaking out page files to another volume sounds great in theory but to recover, you have to make sure that <pagefile>.vmdk is there and thats just more staging you have to do at the DR site if your snapmirroring it. Like you said, all that tweaking turns what could be a right click clone into a right click clone, check multiple volumes at the primary and DR site for page file vmdks, and change your backup schedule. Not really that bad (but still more work) when in the physical world, you have to deploy your backup agent and put it on a schedule... or cut a ticket to the 'backup team' with the new server name if youre so fortunate to have one of those : )
What I wound up doing is spanning the schedule out during the day although I just had the same thing happen this morning where smvi failed.
2009-08-28 08:15:49,782 INFO - FLOW-11012: Operation requested retry 2009-08-28 08:16:04,668 INFO - FLOW-11006: Starting operation in threadpool backup 2009-08-28 08:16:04,699 WARN - VMware Task "CreateSnapshot_Task" for entity "VMFINANCE" failed with the following error: Operation timed out. 2009-08-28 08:16:04,699 ERROR - VM "VMFINANCE" will not be backed up since vmware snapshot create operation failed.
I haven't received nearly as many errors as I was before so it appears to have helped. Also I do not have a large amount of VM's - smvi backs up about 21 of our vm's.
Interesting, since it helped we might be on the right path. How are your ESX hosts configured? How much memory did you assign to the Service console? Anything else running on the service console (HP SIM, backup agents ect)? Any chance the failures happen on VMs that happen to be on a particular host?
Two ESX hosts(herc and apollo) on FC infrastructure connected to two 4.4 TB SANs with clustered controller FAS2020. Herc esx has 800 MB for service console, Apollo has 272 MB. Nothing else running on service console that I know of. Doesn't seem to be isolated to a single esx host as it happens on guests on both hosts.
Yes VMware tools all up to date. I was thinking the same thing in regards to host/guest but it does not seem to be esx dependent. I am reinstalling vmware tools on that finance server that failed earlier this morning but can't reboot it just yet to install again.
Same issue continuing to happen on SMVI backups. The virtual machine appears to have uncommited vmware snapshots already sitting in snapshot manager and the recurring error is:
2009-09-10 01:32:00,814 INFO - FLOW-11012: Operation requested retry
2009-09-10 01:32:15,798 INFO - FLOW-11006: Starting operation in threadpool backup
2009-09-10 01:45:11,992 WARN - VMware Task "CreateSnapshot_Task" for entity "VMSolClient" failed with the following error: Operation timed out.
2009-09-10 01:45:11,992 ERROR - VM "VMSolClient" will not be backed up since vmware snapshot create operation failed.
I can create a manual snapshot from VC. I can also create VM or datastore snapshots successsfully from within SMVI manually and view that the VM snapshots get removed as they should.
Main issue is that SMVI claims VMware Task to create a snapshot failed when it fact it does not fail as you can see the snapshot sitting under the VM in VC. Since it doesn't think it created a vmware snapshot it never attempts to commit it which means it sits there until manually removed.
Link below shows symptoms but is not the actual cause of the issue. A snapshot that should take seconds that doesn't complete within 15 mins is not likely to complete in more than 15 minutes. My environment is very small (2 ESX hosts, 23 VM's and we have a 1:1 relationship between datastore and virtual machine) enough to safely assume increasing the timeout will not correct the issue.
We have this problem also. "Operation Timed Out" errors come in 2 forms - one that fails quickly (within a couple of minutes) that doesn't create a snap and another that fails after 15 minutes, after which it creates the snap. If you look at the vmware.log file for the VM concerned, you'll find that in the 15 minute case there is nothing logged during the 15 minutes SMVI/VC is waiting and only after the timeout occurs is a snap created (in a few seconds). It's as if the timeout is giving the hostd a kick, so increasing the timeout is unlikely to help (as you say).
We don't have a solution. Our suspicion is that it is due to the parallelism of the snapshot activity, but from the sounds of it you may not have a large number of simultaneous VMsnaps occurring... curious.