SMVI: Cannot quiesce vm after sfr restore unless it is restarted

thomasdorsch · ‎2012-08-24

Since a few days, i expierience problems during snapshots of virtual machines. The error is reproducible on w2k8 r2 machines. Other os i didn't test at the moment.

Environment: VSC 4 VSphere ESXi 5 U1 Cluster NetApp FAS 3140 Ontap 8.0.1 P5 Shared Storage (NFS)

A quiesced snapshot of a vm fails with the following error details:

The guest OS has reported an error during quiescing. The error code was: 5 The error message was: Asynchronous operation failed: VssSyncStart

The guest OS has reported an error during quiescing. The error code was: 3 The error message was: Snapshot operation aborted

The guest OS has reported an error during quiescing. The error code was: 3 The error message was: Error when notifying the sync provider.

The snapshot doesn't get deleted after that so i have to consolidate these snapshot vmdk's by hand.

Scenario:

Did a scheduled backup with quiescing option enabled (runs well for almost a year with previous vsc versions) SFR never made problems. Recently i did an update to VSC 4.0 cause i also update my VSphere Environment from 4.1 to 5.0 Update 1.
Mounted the backup using SFR and can access it without a problem.
Dismounted backup and deleted SFR session.
Trying to do a snapshot again with quiescing optiion enabled which fails with the above error.
Restart the vm and do a snapshot again which works without a problem (on another w2k8 server restarting didn’t fix it!)

The only additional information i see is an event in windows system log: The disk signature of disk 1 is equal to the disk signature of disk 0.

Another thing is, that the *vmx was modified and the disk mode changed: scsi0:1.mode = "independent-persistent" ???????????

What's going on here? I never saw this issues before!

Any help is welcome!

BR

Thomas

DOMINIC_WYSS · ‎2012-09-05

it looks like the snapshot is still mounted and the vmdk of the snapshot is still attached to the VM (on scsi0:1).

double check it and manually remove the disk and mounted snapshot, if required.

thomasdorsch · ‎2012-09-05

that's right. The snapshot still exists and is being used. Even you consolidate the snapshot, the entry about the previously mounted backup still exists.

To workaround i did the following:

Consolidate the snapshot (VCenter --> Snapshots >> Consolidate)
Shutdown the vm
Edit the vmx and delete the obsolete entries
Startup the vm again and everything is fine again ( ... until you mount a backup again)

Meanwhile i opened up a case.

I will post the result 😉

BR

Thomas

DOMINIC_WYSS · ‎2012-09-05

so the main problem is, that the unmounting does not work in the beginning.

have a look at the VMware logs when you unmount the snapshot in SFR, maybe this gives some hints.

I think that it cannot remove the VMDK from the running VM in the first place. and this will lead to a chain reaction:

-vmdk cannot be removed from VM

-Netapp snapshot cannot be umounted

-following SMVI job failes because of VSS error

-VMware snapshots are left over

maybe in your environment/constellation you need to shutdown your VM before dismounting the SFR snapshot and everything else may work as expected...

thomasdorsch · ‎2012-09-05

Shutting down the vm to avoid this can not be the solution by design 😉

I never had this issue with vsc 2.0 or 2.1 so this is probably a specific problem/behaviour with 4.0.

Our environment is set up following best practise guides form netapp/vmware so nothing exotic.

Anyway, i'll wait for the support engineers answers.

DOMINIC_WYSS · ‎2012-09-05

sure, shutting down cannot be the final solution.

but it would be a workaround and may help in troubleshooting the issue.

keitha · ‎2012-09-05

As Dominic said it seems like the SFR isn't cleaning up correctly which is confusing the VSS writer in the VMware tools when it tries to do a quiece. I suspect that after a SFR session if you did a disk scan you would be fine. Windows would see that disk 1 no longer exists and everything would work. It should be doing that scan after we remove the disk though....odd.

Keith

thomasdorsch · ‎2012-09-06

Hello Guys,

You are right. The SFR isn't cleaning up correctly and after unmounting the backup the remaining entries confuse the vss writer even there are no more mounted disks (sure i did rescan the disks in windows).

Last night, the NetApp SE sent me a link in which they suggest to reinstall the vmware tools without the VSS component which i did today. Now SFR and SMVI are working again as expected but the VMX still contains the entries of the SFR mounted disk!

For me, this issue is w2k8 specific, cause under w2k3 it has no effect.

Thomas

pdilernia · ‎2012-09-18

Hi Thomas,

Any luck? We're experiencing the same issues. Very similar situation;

Ran scheduled backups with quiescing option enabled on a 2008R2 SQL server, ran well for at least 6 months. We recently updated the environment from 4.1 to 5.0 Update 1 and now server backup fails with "The guest OS has reported an error during quiescing. The error code was: 3 The error message was: Error when notifying the sync provider. " We temporarily "fix" the issue by creating a snapshot, then going into the Snapshot manager and performing a "Delete All" clearing all the extra disks... until the next backup and then we start over again with the failure.

We're opening a case in the next day or two with VMware & NetApp at the same time. Will post with any progress.

Thanks for starting this discussion

Regards,

Paolo

thomasdorsch · ‎2012-09-19

Hi Paolo,

the last information from NetApp was that they suggest me to open a case at VMware.

I already deny it and asked them to do further research!

I still think it is a VSC related issue and also wait for an update from NetAPP.

As soon as I have news I’m going to post it!

BR

Thomas

pdilernia · ‎2012-09-25

Hello Thomas,

VMware support is reviewing logs from the server. In the mean time they did reference this link. It didn’t help me the configuration already had the entry. Here is what they suggested.

“Can you add the Advanced configuration parameter disk.EnableUUID = TRUE as described in the KB: 1028881?

https://portal.mxlogic.com/redir/?5d55N5wQszC1NJeXPPZSbI3zo0cGgUOGoSfYLxkxydQTDDAjt-hojuv78I9CzATsS09SIlgkplcr7-ndFKefCQmm7Am1PBSjoVcS6hINIvd40MSIJmcP...

The above procedure requires downtime.

You will need to do this after consolidating all snapshots and powering off the virtual machine.”

Hopefully this information will help someone.

Regards,

Paolo

GWILMINGTON · ‎2012-09-25

I just ran into this as well testing SFR. I noticed in the data store when the disks are attached and then removed that they're still in the VM's folder. I checked the VMX file and my VM is now referencing different sets of disks than it should be. It should be referencing RestoreTest_1.vmdk and RestoreTest.vmdk, but its now referencing RestoreTest-000002.vmdk and RestoreTest_1-000002.vmdk. As you can tell from the screenshot I've been doing some testing today. The disk files are from both the same VM and from another VM. My thoughts were that I would just use a VM to do restores to avoid this issue but perhaps there's another issue with doing that. However I don't see it actually consuming the space on the LUN. I assume that's because they were vClones.

scsi0:0.present = "TRUE"

scsi0:0.fileName = "RestoreTest-000002.vmdk"

scsi0:0.deviceType = "scsi-hardDisk"

scsi0:1.present = "TRUE"

scsi0:1.fileName = "RestoreTest_1-000002.vmdk"

scsi0:1.deviceType = "scsi-hardDisk"

thomasdorsch · ‎2012-09-26

Hi gwilmington,

using a dedicated vm for restore is a possible workaround so the original vm isn't involved/changed during sfr.

But this can only be a temporary solution.

Sure, your LUN doesn't grow because you are using snapshots 😉

I just wonder that there is no response from other users cause this problem is definately reproducible!

Could you please consolidate your snapshots even they aren't displayed in vcenter an post the result? I mean after that are you able to do another sfr?

BR

Thomas

GWILMINGTON · ‎2012-09-26

Hi thomasdorsch,

I will check them when I get in and get back to you. Yes I agree this is a workaround and not a solution.

GWILMINGTON · ‎2012-09-26

So here's the process that I had to perform to correct the issue:

I logged into the ESXi host where the VM in question resides on and ran the following commands. You will notice that even though above I had several -00000x.vmdk files for this VM, no snapshots show up for it when you run the vim-cmd vmsvc/snapshot.get command. So instead, I created a snapshot for the VM and then removed them and the files are now gone.

~ # vim-cmd vmsvc/getallvms

Vmid Name File Guest OS Version Annotation

36 RestoreTest [ntap_sas_mgmt_lun1] RestoreTest/RestoreTest.vmx windows7Server64Guest vmx-08

~ # vim-cmd vmsvc/snapshot.get 36

Get Snapshot:

~ # vim-cmd /vmsvc/snapshot.create 36 snapshot1 snapshot 0 0

Create Snapshot:

~ # vim-cmd vmsvc/snapshot.get 36

Get Snapshot:

|-ROOT

--Snapshot Name : snapshot1

--Snapshot Id : 15

--Snapshot Desciption : snapshot

--Snapshot Created On : 9/26/2012 14:14:44

--Snapshot State : powered off

~ # vim-cmd /vmsvc/snapshot.removeall 36

Remove All Snapshots:

~ #

I then did an SFR for the VM and I was able to mount the snapshot and recover a file. I then went into VSC and deleted the SFR. This removed the disk from the VM. I checked the data store and no -00000x.vmdk files exist either.

Let me clarify how I found this thread to start out. I was setting up and testing hourly snapshots through VSC for my management VMs. When the process would perform the snapshot, I would receive the quiesce warning for that one VM. After looking in the log of the VM, there was a partmgr error with the following information: "The disk signature of disk 2 is equal to the disk signature of disk 0." Now I normally get that when you do an SFR, I got this same error this morning when I retried for thomasdorsch, but when I was snapping last night I was not performing an SFR at all. This led me to check the datastore and where I found all the -00000x.vmdk files and the rest of the story you now know.

Edit: So I retried adding that machine to the hourly snapshot job and when I did, I got a quiesce error again. Now the -00000x.vmdk files are back.

thomasdorsch · ‎2012-10-02

Sorry for the late reply!

i already updated the case and referred to this thread. I hope they can provide a solution soon!

Thomas

thomasdorsch · ‎2012-10-04

Hello everybody,

yesterday i received a phone call from a netapp se.

Now it is official: It's a bug in VSC 4.0 which is already known!

Because it is non-public you cannot find it in the bug database.

It will be fixed in one of the next releases of VSC.

Stay tuned 😉

Thomas

pdilernia · ‎2012-10-05

Hello Everyone,

Our NetApp Engineers have heard rumors of a fix in the next release too. We haven’t heard of a release date as of yesterday.

These are the steps we went through last evening.

… I was able to take a quiesce snapshot of this server after much agony.

Tried …

Disable sql vss

Checked all sql permissions for service accounts for vss

Checked for broken shadow copies - none

Reinstalled sql vss

Removed all NetApp software

Reinstalled tools for VMware 5.1

Reregistered vss

Disabled Symantec

Changed sql vsswriter service account to a local admin

Multiple reboots between to have vss writers in a stable state.

None of that worked. I found this kb article

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1031298

VM had to be powered off to get to the setting. I am not sure if it presents any backup issues by disabling. All I know is I checked the box to quiesce and it did not give me an error.

GWILMINGTON · ‎2012-10-05

Thanks thomasdorsch for the update to the thread from your case. I'm glad that we've received the correct information on this issue. At this point, it looks like my work around is to build a VM for SFR restores for now. Appreciate everyone's hard work!