VMware Solutions Discussions

VSC - Backups Hanging (In progress)

penningtonkr
22,392 Views

Hi,

I have a situation where I have set VSC to backup at datastore level with a mixture of virtual machines including RDMS and VMDK drives.

All RDM mappings are set to physical so VSC wont look at backing these up, which is expected we use snapmanger for the data drives.

For some reason the virtual centre is "hung" on the following tasks:

Create virtual machine snapshot
TMLIVEDWHSRV-01
95%
Administrator
TMVCentreSrv01

01/04/2011 00:02:41
01/04/2011 00:02:41

NetApp Create Backup
VMFS5_OS
45%
Administrator
TMVCentreSrv01

01/04/2011 00:00:18
01/04/2011 00:00:18

Reconfigure virtual machine
TMLIVEDWHSRV-01
In Progress
Administrator
TMVCentreSrv01

01/04/2011 00:15:01
01/04/2011 00:15:01

* anyone know why there is a reconfigure event in this backup???


So i am in a situation where the datastore VMFS5_OS still has snapshots for every server that resides on there (20 VMS). There is 462.18GB free on that datastore totall size 900GB LUN/ 1.14TB VOL, so if this task doesnt clear up before the weekend we may experience a situation where by the datastore could be filled.

Bit of info about the backup job:

Enabled options: Inititate snapmirror update, Perform VMWare consistency snapshot

Bit of info about DWH:

Server 2008

18GB RAM

SnapDrive 6.3 x64

SnapManager 5.1 SQL

Bit of Info on storage system:

FAS2040 Active-Active configuration

3 x DS4243 shelves SAS 600

I can see a link between large RAM size and quiesced snapshots, but why would this cause this issue in this case when previous backups have always worked?

Anyone experienced these hanging jobs before or know a fix or manual overide to stop the job? Im concerned that with no abort option we could get in a situation where the job is waiting to complete, but datastores could fill?

Any help is appreciated!

1 ACCEPTED SOLUTION

keitha
22,240 Views

There is a know issue with the VSS writer of SnapDrive colliding with the VSS writer of VMware tools. You will want to remove the option for quieced snapshots.

Keith

View solution in original post

22 REPLIES 22

keitha
21,363 Views

I'm not sure why it would be reconfiguring a VM, very odd. What is in the SMVI.log file located in the install directory?

To do a manual overide you can stop then restart the SMVI service on the SMVI server and clean up the SMVI snapshots using the tool located here; http://blogs.netapp.com/virtualization/2010/02/cleaning-up-vmware-snapshots.html

However if the task is stuck in vCenter you may have to restart the vCenter service to "unstick" that one.

The SMVI log may indicate what is happening and why a reconfigure was launched.

Keith

penningtonkr
21,363 Views

Hi Keith,

Thanks for the reply,

I would not want to stop the snapshot operation as it still has its .00001 files etc for the host in question and i cannot afford to corrupt this server at any cost, also all of the other servers are currently locked i.e. cant manage the snaps on them as it looks like everything is waiting on the netapp job to complete.

Here are some key moments in the server.log file:

2011-04-01 00:00:10,835 [backup:aaa62c31788ca8929ec1be562048b95b:] INFO  - A VMware consistency snapshot will be performed on Virtual Machine TMLIVEDWHSRV-01.

2011-04-01 00:00:18,097 [backup:aaa62c31788ca8929ec1be562048b95b:] WARN  - Virtual Machine TMLIVEDWHSRV-01 has disks attached via raw device mapping. These disks will not be backed up.

2011-04-01 00:00:18,097 [backup:aaa62c31788ca8929ec1be562048b95b:] WARN  - Virtual Machine TMLIVEDWHSRV-01 has disks attached via raw device mapping. These disks will not be backed up.

2011-04-01 00:00:18,103 [backup:aaa62c31788ca8929ec1be562048b95b:] INFO  - Backing up the following virtual machine(s)

2011-04-01 00:00:18,105 [backup:aaa62c31788ca8929ec1be562048b95b:] INFO  - The following virtual machines will have a VMware snapshot taken for consistency

2011-04-01 00:12:02,800 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:02,820 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:02,820 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:12,779 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:12,872 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:12,872 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:22,753 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:22,776 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:22,776 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:32,790 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:32,803 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:32,803 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:42,821 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:42,835 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:42,835 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:52,850 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:52,871 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:52,871 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:13:02,854 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING

All the way to present time...

There are no entires in the smvi.log for the backup time frame

penningtonkr
21,751 Views

Ok for future reference, if anyone gets VMware stuck at 95% on the creating snapshot here is the non destructive method of fixing it.

1. Reboot the VM the snap is stuck on from within the OS.

Its simple to justify to the system owner, if you cant reboot its going to go down anyway when the datastore fills up, taking all the other servers with it.

Thats it... Did this and the NetApp Job completed within 20 seconds. Very wierd seems to be a certain dislike on creating quiesced snapshots on VMs with large amounts of RAM. In our case the page files and ram sit on a seperate transient datastore that has no backup job on it.

Anyone else experiencing this error on VMs with large amounts of RAM?

STORAGESPECTOR
21,748 Views

How is your vmdk configured?

STORAGESPECTOR
21,748 Views

There seems to be an issue with the VMDK

penningtonkr
21,748 Views

Storage Spector,

The VMDK is stored in a shared datastore which is configured like so (from NetApp to VMware)

LUN mapped to ESXi host

ESXi host added and formatted to VMFS datastore

Datastore is then available on the ESXi cluster (We use VCentre to manage)

VMFS05 Datastore is using a 1MB block size (256GB max filesize)

Server looks like:

Hard disk 1: Virtual (VMFS05 -  70GB - THIN)

Hard disk 2: Virtual (VMFS05 -  250GB - THIN)

Hard disk 3: Virtual (VMFS05 -  250GB - THIN)

Hard disk 4: Mapped Raw Lun (Pointer file on VMFS05)

Hard disk 5: Mapped Raw Lun (Pointer file on VMFS05)

Hard disk 6: Mapped Raw Lun (Pointer file on VMFS05)

VMDK is Thin provisioned, plenty of room (60%) in the datastore

keitha
21,748 Views

Can I ask what is running in the VM? The VMware VSS writer would appear to be having trouble quiecing the VM prior to creating the snapshot. This is not that uncommon but usually VMware times this out and fails the snapshot which we then record as a warning with VSC/SMVI. Odd.

The question is though, is that VMware snapshot really getting you anything? More and more I discourage the use of them with VSC as the quiecing it does does not give you any greater data integrety (or very little) and can cause odd problems like this. Without the VMware snaps the VSC backups are nearly instant with no load on the ESX servers and no performance impact on the storage. You can then take the backups more often.

I usually build for customers a hourly backup job with a retention of 2. This will snap the VMs hourly but only tie those blocks up for 2 hours which costs very little disk space. They not have 2 very short recovery points.

Back to your problem, if your VM is very busy or the app is behaving badly you could try to upgrade or refresh the VMware tools but if the problem continues you might want to either uninstall the VMware VSS writer or turn off VMware snapshots from the VSC console.

Any chance you have SnapDrive loaded in that VM for the RDMs?

Keith

penningtonkr
21,748 Views

The server is running:

  • Windows 2008 R1 64Bit
  • SQL Server 2008
  • SQL Analysis Services
  • Log Rhythm
  • DB Protect
  • SnapDrive 6.3.0.4601
  • SnapManager for SQL Server 5.1

SnapDrive will be dealing with the backups of the specific user data in future, at present it has been configured correctly via its configuration wizard but does not have any scheduled jobs.

My understanding of a quiesced snapshot was to ensure that you could boot the VM with no data corruption, if this is not the case then I will simply remove the option to use quiesced snapshots.

Cheers

keitha
22,241 Views

There is a know issue with the VSS writer of SnapDrive colliding with the VSS writer of VMware tools. You will want to remove the option for quieced snapshots.

Keith

penningtonkr
14,229 Views

Consider it done

support_2
14,229 Views

I didn't get a chance to read all of you problem, looking for some help on another issue.

I came up with a work around for this problem a while back ago better than taking a crash consistent snapshots (I mean why bother to pay for smvi if you are ok with that, just schedule them through systems manager?)

vmware allows you to run scripts before and after a vmware snapshot, so the idea is just run a batch file that disable the data ontap vss provider --> take the vmware snapshot --> take netapp snap --> enable the vss provider when vmware gets rid of it's snap

look up pre-freeze and post-thaw scripts on vmwares site.

* have to plan out the backups a little bit as the data ontap vss provider needs to be enabled for snap manager for sql, this problem when away later version of vmware / snapdrive

Here are the batch files (need to modified to your enviroment)

::*************************************************************

:: This script disables the Data ONTAP VSS Service so

:: you can take a vmware snapshot, this should be ran

:: as pre-freeze script

::*************************************************************

echo off

:: Disable the following services, SnapDrive, SnapDrive Management Service, Data ONTAP VSS Hardware Provider

NET Stop SWSvc

NET Stop SDMgmtSvc

NET Stop navssprv

:: Unregister the Data ONTAP VSS Service

"D:\Program Files\NetApp\SnapDrive\navssprv.exe" -r service /u

******************next script*************************************

::*************************************************************
:: This script enables the Data ONTAP VSS Service so
:: you can take a snap manager snapshot, you should run this
:: script as the thaw
::*************************************************************

echo off

:: Enable the following services, SnapDrive, SnapDrive Management Service, Data ONTAP VSS Hardware Provider
NET Start SDMgmtSvc
NET Start SWSvc

:: Register & Start the Data ONTAP VSS Service
"D:\Program Files\NetApp\SnapDrive\navssprv.exe" -r service -a <service account> -p <service account's password>
NET Start navssprv

Hope this helps.

robertmidwest
14,229 Views

We have the same problem and have been going back and forth with support with NetApp and VMware.  Very displeased with NetApp support on this.  Their Reference Architecture document:  tr-3785 - Microsoft Exchange Server, SQL Server, and SharePoint Server Mixed Workload on VMware vSphere 4, NetApp Unified Storage (FC, iSCSI, and NFS), and Cisco Nexus Unified Fabric says they can use VSC with VMware Snapshots  on vSphere 4.0.0, MS SQL 2008 running on Windows 2008 x64 Enterprise Edition SP2 (Page 9).  They say in the solution that they're using VSS but only on the NFS solution (P32).

For SMVI best practices, see NetApp TR-3737. SMVI leverages the VSS requestor in VMware Tools to create application-consistent backups. This is invoked as part of the VMware snapshot performed before creating the Snapshot copy on the NetApp array.

How reliable are those scripts for stopping services and unregistering the VSS provider and then reversing the issue?

I might have to try this.

I've had a couple of tech's say it's a VMware bug but they haven't been able to provide any evidence of this.  But, here's the thing, I can do a VMware snapshot fine in the VM.  But it appears to fail on the same piece using VSC if I specifiy VMware consistent snapshots.  It works fine if I don't select that. 

A huge pain to work with since when it fails I have to do a several step 'cleanup' that requires the reboot of vCenter which ticks off the rest of the techs.

And don't get me started on response times, feedback and escalations......

robertmidwest
14,230 Views

Is anyone having this problem on vSphere 4.1 or is it only vSphere 4?

support_2
14,229 Views

The scripts work fine for me but of course test them out as they need to be modified.  The idea was pretty simple stop the services, unregister, then the following script register and start services.  You can run it manually on a server.

When you do vm snapshot from vcenter, check the box quiesce the file systems and uncheck the box for the memory --> this is type of snapshot smvi ask vcenter to take and is different from the defaults in vmware.  If you are having the problem with the vss providers then your snapshot should time out at 95% with these settings.

The vss issue is known bug don't imagen you will get much help, and is fixed in later versions of snapdrive or vmware (not sure which as I upgraded vmware and then snapdrive and noticed that the problem was gone but didn't bother to figure out which one fixed what).  The problem also exist in some other backup software that lay down their own vss providers (like symantecs backup exec).  Idk this issue really falls in no-mans land type scenario.  In the end, without snapdrive installed then vmware snapshots will work fine.  This problem occurs becuase of the way windows handles vss providers, how vmware looks for it's vss provider, how netapp installs theirs.  So there is a lot of figure pointing.

vmware also has a driver for prior to when window's used vss providers (this driver only exist for 2k3 and below, again it was really created for windows 2K) forget the exact details of this becuase it was over year ago when I was working on this problem, but you can also use this on your 2k3 systems.

I would not suggest moving to 4.1 or beyond snapdrive 6.2.x, I had to downgrade my enviroment, there are some nasty bugs in which your problem will be fixed but snapdrive is not able to see fc luns.  They will still be connected to the vm\esx, windows will be fine, snapdrive will say something along the lines of not being able to enumerate them.  Issues with later snapdrive versions dealing with vm's that have more than 2 scsi controllers.  Again that is like a 4-6 month old problem think they are only for fc but can't be sure, it is a on the forums I will get that link, and it sucks.

we are running esx 4.0, vcenter 4.1, snapdrive 6.2.1, snapmanager for sql 5.0

Oh and welcome to dealing with netapp support , ask to speak to a duty manager, they can usually provide some assistents in getting you some "better" help,  aslo tell them that you want the case to be a P2 but in my experiance they will bust this down to a 3 the first chance they get, feedback is never going to be there at least with what I have delt with, you have to call them up everyday force them to help you.  If it makes you feel any better the more times you have to call in to support the better you will become with your enviroment, you will be force to learn .  Netapp is also a better product it's support is just really lacking and if you pay for their upgraded support it prices them out of the competetion (upgraded support will also take care of all your pain points or so I have heard).  Usually find quicker help on these forums which is why I a making sure I start to contribute, makes you feel any better your not alone in your frustrations with support (we have seriously brought up the topic of going emc route with our next sans becuase of support- which isn't much but it might make you feel better having that talk to your netapp sales team, maybe if enought customers do it they will get the message, maybe not).

support_2
14,229 Views

same issue in another thread more detail on scripts, posted as a different username Kris (udf) its at the very bottom anyways though

http://communities.netapp.com/thread/3015?tstart=3


4.1 problem

http://communities.netapp.com/message/45241

robertmidwest
12,100 Views

Wow and Yikes!   Thanks so much for this information.  I'm still working through VMware and NetApp support.  I was really looking at vSphere 4.1 as a resolution but with those other problems....  Those also appear to be hit or miss with SnapDrive 6.3P1.  They have released SnapDrive 6.3P2 that says it fixes it.

SnapDrive - 6.3P2

Date Posted: 03-FEB-2011
SnapDrive 6.3P2
Based on SnapDrive 6.3

Bugs fixed in this release:

6.3P2
===========
473592 SDW does not enumerate RDM disks if there are more than one scsi controller - exposed through an unsupported initiator error
464870 Hyperv restore failed with error "SDSnapshotRestore failed"
475282 SnapDrive ASUP string running in Windows Server 2008 R2 OS shows incorrect message.
475634 SnapVault update is not happening from MMC popup menu.

support_2
12,100 Views

Thanks I will have to check out snapdrive updates.

Just to reiterate in this environment (below) I am not having the vss conflit issue anymore.  I also not using it for exchange, just vm sql servers with the database setting on raw disk

We are running esx 4.0, vcenter 4.1, snapdrive 6.2.1, snapmanager for sql 5.0, vsc 2.0. all running over FC

MGRIEVE77
12,099 Views

keitha,

How/where do you go to disable the option for quieced snapshots?  I'm having a similar problem with only 1 of my VM's.  This VM is the result of a P2V.  I'm using Backup Exec 2010 with the VMWare agent.

keitha
12,099 Views

Wow, Sorry for the long delay on this...not sure how that slide by me.

You disable the SMVI/VSC VMware Snapshots by deselecting the "Use VMware quieced snapshots" checkbox when you edit the backup job. However if you only want to disable it on one VM you would be better off using the scripts above or just uninstall the VSS writer that is installed in the VMware tools.

A note on the scripts above. Although these are very cool and work great, I am not sure the vaule of them verses just disabling the VMware snapshots with SMVI? Either way you are not going to leverage VSS to quiece the VM prior to the NetApp Snapshot. If you don't have a VSS aware app in the VM this is likely fine since the VMware snapshot does not do any sort of application quiecing. The VMware snapshots that activate this VSS writer tend to be problimatic and is a common failure in SMVI backups.

It is also common to have sucess with a manual snapshot while a API called snapshot has problems. This is often because either the manual snapshot does not activate the VSS writer (you have to specfically specify this) or the problem arises when many snapshots are trying to close at the same time. On the SMVI side of things, we call the Open or Close VMware snapshot API but there is very little in error reporting back from VMware. It isn't really a bug but it does make it hard to handle situations when snapshots get stuck or have problems.

Without the VMware snapshot I think SMVI still have huge value by providing a GUI to restore VMs, mount snapshots and manage snapshot schedules from vCenter. Could you do it directly from the NetApp and not buy SMVI? Sure! but the GUI is sure nice to have and without the VMware snapshot SMVI is wildly fast and robust.

I hope this helps.

Keith

support_2
10,477 Views

With the scripts.  Could be off on this. 

SMVI                                         

quiece the file system of the vm and take a snapshot of vm, take the option of not quiescieng the file systems before your snapshots then my understanding might be bad analogy would be like taking a traditional backup without using vss, files are locked they may or may not be valid on a restore, you get a crash consistent backup.  9/10 you won't have a problem.  Lets just use sql server example.  Here I get the file systems / OS and everything quieces but the database files will be corrupt, vss providors vmware and ms can't dont flush sql.  SMVI doesn't do app specific like you said.  Why you use the snap manager for this piece.

Snap Manager <SQL/Exchange> 

snap manager products are being used to queisce the app, in the example sql (on there own lun) get quiesced snap-shots of the database

Scripts address the problem of the 2 combined. 9/10 is great till your hit with that 1 time.  Honestly may never come up should have multiple backups which put the odds in your favor.  I just think I have bad luck really don't want to put myself in that situation, IT manager prospective after they have spent the money hard thing to tell them.

The scripts turn off the snap manager vss and let you quiesce the file systems using vmware vss provider (better backup of your vm). 

They then turn it back on so you can get your quiesced snapshot of the app using snap manager.

Unless I am backwards on this and I could be, you are leveraging VSS to quiece the VM prior to the NetApp Snapshot for both the vm and the app with the two products together (smvi and snap manager).

Public