Re: SMVI VMware snapshots not committing

ttrulis01 · ‎2009-08-14

I have been having an issue with the vmware snapshots generated by SMVI. On certain backup attempts by SMVI the VMware snapshots are never removed. It doesn't seem to matter whether the smvi job is successful or not as I have seen it not clear snapshots in both cases. The attached .bmp file shows the three uncleared snapshots from three different dates. The logs are from two different servers that had uncleared snaps that happened to be on the same day. I couldn't go any earlier as the smvi logs "could not find the backup id."

What seems to be the recurring issue:

2009-08-09 = 08:02:28,244=20 INFO - FLOW-11012: Operation requested retry
2009-08-09 08:02:38,306 = INFO -=20 FLOW-11006: Starting operation in threadpool backup

This repeats for several minutes and then it looks like it just gives up. I have most smvi backups kicking off at the same time - not sure if that could be a relevant factor.

Anyone else experiencing similar issues?

markcarnes · ‎2009-08-21

I saw this once before where we were snapshotting about 75 servers on 4 dell 2950s (2x quad 2.5, 24 GB RAM, NFS vols). We were doing snaps with scripts (I think it was from the TR article on best practices for ESX on Netapp) instead of SMVI but functionally you could have the same issue. Our script and SMVI (in one way or another) calls to ESX to list the VMs that it is controlling at the moment, one by one take a snap of each VM (repeat for all servers), tells the filer to snap the volume, then tell all the ESX servers to unsnap their VMs one by one. It takes a few seconds per VM to snap them and more time to unsnap them. The amount of time to unsnap them depends on how many changes were racked up in the .000001.vmdk snapshot files. It may only take a few seconds to apply the first snapshot file to the first VM but the deltas in the other VMs were growing the whole time so by the time ESX got around to applying the snap in the 75th VM, it had been hours... and we were running the script every 4 hours.

So one day the NFS volume went offline. A quick glance had shown that it filled up. Grew the flexvol, get the volume back in ESX, power on the VMs, no problem. So what was taking the space? Oh, the 50 snaps of each VM. Wish I had saved that screenshot.

Solution (for us), run the script every 12 hours. We could have keep an eye on how long it takes to unsnap the last VM (moving target as you deploy more), moved page files to another volume thats not being snapped (since they get the most changes), etc They (the customer) didnt care enough to do that.

Sorry for the long story but if you happen to be snapping a bunch of VMs and relatively few ESX servers that will happen to anyone. I hear there are some fixes to how fast a delta can roll up in ESX4 but I havent played with it yet. Good thing about SMVI, you can choose which VMs you want quiesced and dont have to do them all each rotation. Of course you want them all but you can eliminate at least a few from each snap cycle. SMVI snaps the volume anyway so any VM in the volume will be in the .snapshot. Just the ones you chose to quiesce will have the .000001.vmdk there. What server doesnt come back from being powered off or having its plug pulled nowadays anyway? You can still recover unquiesced VMs from the snapshot dir. A db server or exchange or something like that might be different but most servers should be fine.

For what its worth...

keitha · ‎2009-08-21

Mark,

That's some great info and some good tips! I have also had success breaking VMs in the data store into separate jobs too. For example if you had 90 VMs in a datastore, schedule 3 jobs one at 1Am one at 1:20 and one at 1:40 and quiece 30 VMs at a time. This will limit how many snapshots have to be created on each backup run. Alternatively you could spread the jobs out evenly through the 24 hour period. What this gives you is still a 24 hour quieced RPO for each VM in the datastore but then a much shorter crash consistent RPO. This of course is not necessary if you do an hourly crash consistent backup anyhow which is very tempting since the crash consistent backups are mere seconds to complete and no load on VMware at all. Only one caution though, if you break up the VMs like this, when you add new VMs they will only be backed up crash consistent by default, you will have to go into one of the jobs and add the new VM into one of the consistent jobs to get a quieced backup. Not a big deal just have to be sure you don't forget to do it!

Keith

markcarnes · ‎2009-08-21

Yeah, separate volumes would speed up the unsnapping operation since the deltas wouldnt pile up so much. Youd loose out a little deduping but when you have 2.x T usable from a 300k fiber shelf at a minimum, its not really that big a deal. Could you you snap a qtree while deduping the volume for the best of both worlds (netapps not my day job if you couldnt tell). At the end of the day, ESX is the limiting factor (no, thats not a slam on them.. its not magic, stuff needs to do stuff to work as long as we are being technical). Breaking out page files to another volume sounds great in theory but to recover, you have to make sure that <pagefile>.vmdk is there and thats just more staging you have to do at the DR site if your snapmirroring it. Like you said, all that tweaking turns what could be a right click clone into a right click clone, check multiple volumes at the primary and DR site for page file vmdks, and change your backup schedule. Not really that bad (but still more work) when in the physical world, you have to deploy your backup agent and put it on a schedule... or cut a ticket to the 'backup team' with the new server name if youre so fortunate to have one of those : )

keitha · ‎2009-08-21

Mark,

I wasn't suggesting separate volumes, just separate jobs for the same volume. I agree, keep them all in one NFS datastore to keep the dedupe up.

Keith

markcarnes · ‎2009-08-21

Oh right. Sorry these forums arent like emails where you can see the post youre replying to Hey, emoticons!

Yup, multiple snapshots not volumes does all that. Woohoo, my post count just went to 3. I could do this all day.

ttrulis01 · ‎2009-08-28

What I wound up doing is spanning the schedule out during the day although I just had the same thing happen this morning where smvi failed.

2009-08-28 08:15:49,782         INFO - FLOW-11012: Operation requested retry
2009-08-28 08:16:04,668         INFO - FLOW-11006: Starting operation in threadpool backup
2009-08-28 08:16:04,699         WARN - VMware Task "CreateSnapshot_Task" for entity "VMFINANCE" failed with the following error: Operation timed out.
2009-08-28 08:16:04,699         ERROR - VM "VMFINANCE" will not be backed up since vmware snapshot create operation failed.

I haven't received nearly as many errors as I was before so it appears to have helped. Also I do not have a large amount of VM's - smvi backs up about 21 of our vm's.

keitha · ‎2009-08-28

Interesting, since it helped we might be on the right path. How are your ESX hosts configured? How much memory did you assign to the Service console? Anything else running on the service console (HP SIM, backup agents ect)? Any chance the failures happen on VMs that happen to be on a particular host?

ttrulis01 · ‎2009-08-28

Two ESX hosts(herc and apollo) on FC infrastructure connected to two 4.4 TB SANs with clustered controller FAS2020. Herc esx has 800 MB for service console, Apollo has 272 MB. Nothing else running on service console that I know of. Doesn't seem to be isolated to a single esx host as it happens on guests on both hosts.

keitha · ‎2009-08-28

I would have said that you try to up the memory in the SC for apollo but if it happens on both hosts... Are the VMware tools in the VMs up to date?

ttrulis01 · ‎2009-08-28

Yes VMware tools all up to date. I was thinking the same thing in regards to host/guest but it does not seem to be esx dependent. I am reinstalling vmware tools on that finance server that failed earlier this morning but can't reboot it just yet to install again.

joshb · ‎2009-08-28

By chance, do you have a rough idea how long it took before the snapshot failed for your finance VM?

If it fails after 15 minutes, then you likely have high load or I/O on either the ESX server or filer and the ESX server cannot finish the VMware snapshot in time.

If it fails in under 15 minutes, then the issue is elsewhere. Either VSS or VMware Tools in general are usually places to check next.

ttrulis01 · ‎2009-08-28

If I can judge from log - it looks like it failed under 15 mins. That is why I was looking at reinstall of vmware tools.

ianaforbes · ‎2009-10-23

ttrulis01 - The error message you received:

2009-08-28 08:15:49,782         INFO - FLOW-11012: Operation requested retry
2009-08-28 08:16:04,668         INFO - FLOW-11006: Starting operation in threadpool backup
2009-08-28 08:16:04,699         WARN - VMware Task "CreateSnapshot_Task" for entity "VMFINANCE" failed with the following error: Operation timed out.
2009-08-28 08:16:04,699         ERROR - VM "VMFINANCE" will not be backed up since vmware snapshot create operation failed

Is exactly my issue. This only happens intermitedly. Did you find a resolution?

Thanks

ttrulis01 · ‎2009-09-10

Same issue continuing to happen on SMVI backups. The virtual machine appears to have uncommited vmware snapshots already sitting in snapshot manager and the recurring error is:

2009-09-10 01:32:00,814 INFO - FLOW-11012: Operation requested retry

2009-09-10 01:32:15,798 INFO - FLOW-11006: Starting operation in threadpool backup

...

2009-09-10 01:45:11,992 WARN - VMware Task "CreateSnapshot_Task" for entity "VMSolClient" failed with the following error: Operation timed out.

2009-09-10 01:45:11,992 ERROR - VM "VMSolClient" will not be backed up since vmware snapshot create operation failed.

I can create a manual snapshot from VC. I can also create VM or datastore snapshots successsfully from within SMVI manually and view that the VM snapshots get removed as they should.

Main issue is that SMVI claims VMware Task to create a snapshot failed when it fact it does not fail as you can see the snapshot sitting under the VM in VC. Since it doesn't think it created a vmware snapshot it never attempts to commit it which means it sits there until manually removed.

Link below shows symptoms but is not the actual cause of the issue. A snapshot that should take seconds that doesn't complete within 15 mins is not likely to complete in more than 15 minutes. My environment is very small (2 ESX hosts, 23 VM's and we have a 1:1 relationship between datastore and virtual machine) enough to safely assume increasing the timeout will not correct the issue.

http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=324112

Thought I'd update for anyone else who has come across this. Hopefully a resolution will follow.

ktim · ‎2009-09-27

We have this problem also. "Operation Timed Out" errors come in 2 forms - one that fails quickly (within a couple of minutes) that doesn't create a snap and another that fails after 15 minutes, after which it creates the snap. If you look at the vmware.log file for the VM concerned, you'll find that in the 15 minute case there is nothing logged during the 15 minutes SMVI/VC is waiting and only after the timeout occurs is a snap created (in a few seconds). It's as if the timeout is giving the hostd a kick, so increasing the timeout is unlikely to help (as you say).

We don't have a solution. Our suspicion is that it is due to the parallelism of the snapshot activity, but from the sounds of it you may not have a large number of simultaneous VMsnaps occurring... curious.

ianaforbes · ‎2009-10-23

I'm having the same timeout issues (same errors) that you've reported.

Have you resolved the issue?

ttrulis01 · ‎2009-10-23

Unfortunately, I don't have a concrete resolution for this one. What I had noticed about SMVI setup was that all backups were starting at the same time. I spanned them out throughout the day so they were not all starting around the same time. Also increased timeout as per this article: http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=324112 but as I stated above I don't think it was due to this. Perhaps it will help solve someones problems though.

nmslabsit · ‎2009-10-26

We just began getting the same errors. I noticed at first there were snapshots left after SMVI did it's thing. Now we are seeing two errors when snapshoting all 21 of our VMs in one job.

Error # 1

Create virtual machine snapshot
eTrust v8.1 (HPDL140)
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
10/26/2009 2:40:45 PM
10/26/2009 2:40:45 PM
10/26/2009 2:42:11 PM

Error # 2

Create virtual machine snapshot
vCenter Server (VSS001)
Operation timed out.

10/26/2009 2:40:45 PM
10/26/2009 2:40:45 PM
10/26/2009 2:41:38 PM

When I break the SMVI backup job in half these errors go away. I saw the bug ID 324112 but it specifically states that "The error '....exceeded the time limit for holding off I/O in the frozen.....' is not related with this issue. This is related with the VMware environment and mentioned workarounds may not help in resolving this."

Is anyone else seeing these errors or have a potential solution?

ttrulis01 · ‎2010-02-08

This is still happening. If SMVI gets interrupted before it cleans up the snapshot it just leaves it there. Over time it will layer the snapshots and If I am not checking the servers for snapshots every week I risk running out of room on the partition. This product as well as SME are causing more pain than they are helping alleviate with their functionality.

keitha · ‎2010-02-08

I'm sorry to hear abou the problems. I did just post a blog about a tool we developed that might help insure the SMVI snaps are cleaned up. the link to it is here http://blogs.netapp.com/virtualization/2010/02/cleaning-up-vmware-snapshots.html

You could run it after the last SMVI job (assuming you are using SMVI 2.0) and it will insure all the SMVI snaps are cleaned up. Let me know what you think.

Keith