VMware Solutions Discussions
VMware Solutions Discussions
Hi,
I have a situation where I have set VSC to backup at datastore level with a mixture of virtual machines including RDMS and VMDK drives.
All RDM mappings are set to physical so VSC wont look at backing these up, which is expected we use snapmanger for the data drives.
For some reason the virtual centre is "hung" on the following tasks:
Create virtual machine snapshot
TMLIVEDWHSRV-01
95%
Administrator
TMVCentreSrv01
01/04/2011 00:02:41
01/04/2011 00:02:41
NetApp Create Backup
VMFS5_OS
45%
Administrator
TMVCentreSrv01
01/04/2011 00:00:18
01/04/2011 00:00:18
Reconfigure virtual machine
TMLIVEDWHSRV-01
In Progress
Administrator
TMVCentreSrv01
01/04/2011 00:15:01
01/04/2011 00:15:01
* anyone know why there is a reconfigure event in this backup???
So i am in a situation where the datastore VMFS5_OS still has snapshots for every server that resides on there (20 VMS). There is 462.18GB free on that datastore totall size 900GB LUN/ 1.14TB VOL, so if this task doesnt clear up before the weekend we may experience a situation where by the datastore could be filled.
Bit of info about the backup job:
Enabled options: Inititate snapmirror update, Perform VMWare consistency snapshot
Bit of info about DWH:
Server 2008
18GB RAM
SnapDrive 6.3 x64
SnapManager 5.1 SQL
Bit of Info on storage system:
FAS2040 Active-Active configuration
3 x DS4243 shelves SAS 600
I can see a link between large RAM size and quiesced snapshots, but why would this cause this issue in this case when previous backups have always worked?
Anyone experienced these hanging jobs before or know a fix or manual overide to stop the job? Im concerned that with no abort option we could get in a situation where the job is waiting to complete, but datastores could fill?
Any help is appreciated!
Solved! See The Solution
There is a know issue with the VSS writer of SnapDrive colliding with the VSS writer of VMware tools. You will want to remove the option for quieced snapshots.
Keith
I'm not sure why it would be reconfiguring a VM, very odd. What is in the SMVI.log file located in the install directory?
To do a manual overide you can stop then restart the SMVI service on the SMVI server and clean up the SMVI snapshots using the tool located here; http://blogs.netapp.com/virtualization/2010/02/cleaning-up-vmware-snapshots.html
However if the task is stuck in vCenter you may have to restart the vCenter service to "unstick" that one.
The SMVI log may indicate what is happening and why a reconfigure was launched.
Keith
Hi Keith,
Thanks for the reply,
I would not want to stop the snapshot operation as it still has its .00001 files etc for the host in question and i cannot afford to corrupt this server at any cost, also all of the other servers are currently locked i.e. cant manage the snaps on them as it looks like everything is waiting on the netapp job to complete.
Here are some key moments in the server.log file:
2011-04-01 00:00:10,835 [backup:aaa62c31788ca8929ec1be562048b95b:] INFO - A VMware consistency snapshot will be performed on Virtual Machine TMLIVEDWHSRV-01.
2011-04-01 00:00:18,097 [backup:aaa62c31788ca8929ec1be562048b95b:] WARN - Virtual Machine TMLIVEDWHSRV-01 has disks attached via raw device mapping. These disks will not be backed up.
2011-04-01 00:00:18,097 [backup:aaa62c31788ca8929ec1be562048b95b:] WARN - Virtual Machine TMLIVEDWHSRV-01 has disks attached via raw device mapping. These disks will not be backed up.
2011-04-01 00:00:18,103 [backup:aaa62c31788ca8929ec1be562048b95b:] INFO - Backing up the following virtual machine(s)
2011-04-01 00:00:18,105 [backup:aaa62c31788ca8929ec1be562048b95b:] INFO - The following virtual machines will have a VMware snapshot taken for consistency
2011-04-01 00:12:02,800 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:02,820 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:02,820 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:12,779 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:12,872 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:12,872 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:22,753 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:22,776 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:22,776 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:32,790 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:32,803 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:32,803 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:42,821 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:42,835 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:42,835 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:12:52,850 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
2011-04-01 00:12:52,871 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Waiting on VMware snapshot for VM TMLIVEDWHSRV-01
2011-04-01 00:12:52,871 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - FLOW-11012: Operation requested retry
2011-04-01 00:13:02,854 [backup:aaa62c31788ca8929ec1be562048b95b:] DEBUG - Task state was: RUNNING
All the way to present time...
There are no entires in the smvi.log for the backup time frame
Ok for future reference, if anyone gets VMware stuck at 95% on the creating snapshot here is the non destructive method of fixing it.
1. Reboot the VM the snap is stuck on from within the OS.
Its simple to justify to the system owner, if you cant reboot its going to go down anyway when the datastore fills up, taking all the other servers with it.
Thats it... Did this and the NetApp Job completed within 20 seconds. Very wierd seems to be a certain dislike on creating quiesced snapshots on VMs with large amounts of RAM. In our case the page files and ram sit on a seperate transient datastore that has no backup job on it.
Anyone else experiencing this error on VMs with large amounts of RAM?
How is your vmdk configured?
There seems to be an issue with the VMDK
Storage Spector,
The VMDK is stored in a shared datastore which is configured like so (from NetApp to VMware)
LUN mapped to ESXi host
ESXi host added and formatted to VMFS datastore
Datastore is then available on the ESXi cluster (We use VCentre to manage)
VMFS05 Datastore is using a 1MB block size (256GB max filesize)
Server looks like:
Hard disk 1: Virtual (VMFS05 - 70GB - THIN)
Hard disk 2: Virtual (VMFS05 - 250GB - THIN)
Hard disk 3: Virtual (VMFS05 - 250GB - THIN)
Hard disk 4: Mapped Raw Lun (Pointer file on VMFS05)
Hard disk 5: Mapped Raw Lun (Pointer file on VMFS05)
Hard disk 6: Mapped Raw Lun (Pointer file on VMFS05)
VMDK is Thin provisioned, plenty of room (60%) in the datastore
Can I ask what is running in the VM? The VMware VSS writer would appear to be having trouble quiecing the VM prior to creating the snapshot. This is not that uncommon but usually VMware times this out and fails the snapshot which we then record as a warning with VSC/SMVI. Odd.
The question is though, is that VMware snapshot really getting you anything? More and more I discourage the use of them with VSC as the quiecing it does does not give you any greater data integrety (or very little) and can cause odd problems like this. Without the VMware snaps the VSC backups are nearly instant with no load on the ESX servers and no performance impact on the storage. You can then take the backups more often.
I usually build for customers a hourly backup job with a retention of 2. This will snap the VMs hourly but only tie those blocks up for 2 hours which costs very little disk space. They not have 2 very short recovery points.
Back to your problem, if your VM is very busy or the app is behaving badly you could try to upgrade or refresh the VMware tools but if the problem continues you might want to either uninstall the VMware VSS writer or turn off VMware snapshots from the VSC console.
Any chance you have SnapDrive loaded in that VM for the RDMs?
Keith
The server is running:
SnapDrive will be dealing with the backups of the specific user data in future, at present it has been configured correctly via its configuration wizard but does not have any scheduled jobs.
My understanding of a quiesced snapshot was to ensure that you could boot the VM with no data corruption, if this is not the case then I will simply remove the option to use quiesced snapshots.
Cheers
There is a know issue with the VSS writer of SnapDrive colliding with the VSS writer of VMware tools. You will want to remove the option for quieced snapshots.
Keith
Consider it done
I didn't get a chance to read all of you problem, looking for some help on another issue.
I came up with a work around for this problem a while back ago better than taking a crash consistent snapshots (I mean why bother to pay for smvi if you are ok with that, just schedule them through systems manager?)
vmware allows you to run scripts before and after a vmware snapshot, so the idea is just run a batch file that disable the data ontap vss provider --> take the vmware snapshot --> take netapp snap --> enable the vss provider when vmware gets rid of it's snap
look up pre-freeze and post-thaw scripts on vmwares site.
* have to plan out the backups a little bit as the data ontap vss provider needs to be enabled for snap manager for sql, this problem when away later version of vmware / snapdrive
Here are the batch files (need to modified to your enviroment)
::*************************************************************
:: This script disables the Data ONTAP VSS Service so
:: you can take a vmware snapshot, this should be ran
:: as pre-freeze script
::*************************************************************
echo off
:: Disable the following services, SnapDrive, SnapDrive Management Service, Data ONTAP VSS Hardware Provider
NET Stop SWSvc
NET Stop SDMgmtSvc
NET Stop navssprv
:: Unregister the Data ONTAP VSS Service
"D:\Program Files\NetApp\SnapDrive\navssprv.exe" -r service /u
******************next script*************************************
::*************************************************************
:: This script enables the Data ONTAP VSS Service so
:: you can take a snap manager snapshot, you should run this
:: script as the thaw
::*************************************************************
echo off
:: Enable the following services, SnapDrive, SnapDrive Management Service, Data ONTAP VSS Hardware Provider
NET Start SDMgmtSvc
NET Start SWSvc
:: Register & Start the Data ONTAP VSS Service
"D:\Program Files\NetApp\SnapDrive\navssprv.exe" -r service -a <service account> -p <service account's password>
NET Start navssprv
Hope this helps.
We have the same problem and have been going back and forth with support with NetApp and VMware. Very displeased with NetApp support on this. Their Reference Architecture document: tr-3785 - Microsoft Exchange Server, SQL Server, and SharePoint Server Mixed Workload on VMware vSphere 4, NetApp Unified Storage (FC, iSCSI, and NFS), and Cisco Nexus Unified Fabric says they can use VSC with VMware Snapshots on vSphere 4.0.0, MS SQL 2008 running on Windows 2008 x64 Enterprise Edition SP2 (Page 9). They say in the solution that they're using VSS but only on the NFS solution (P32).
For SMVI best practices, see NetApp TR-3737. SMVI leverages the VSS requestor in VMware Tools to create application-consistent backups. This is invoked as part of the VMware snapshot performed before creating the Snapshot copy on the NetApp array.
How reliable are those scripts for stopping services and unregistering the VSS provider and then reversing the issue?
I might have to try this.
I've had a couple of tech's say it's a VMware bug but they haven't been able to provide any evidence of this. But, here's the thing, I can do a VMware snapshot fine in the VM. But it appears to fail on the same piece using VSC if I specifiy VMware consistent snapshots. It works fine if I don't select that.
A huge pain to work with since when it fails I have to do a several step 'cleanup' that requires the reboot of vCenter which ticks off the rest of the techs.
And don't get me started on response times, feedback and escalations......
Is anyone having this problem on vSphere 4.1 or is it only vSphere 4?
The scripts work fine for me but of course test them out as they need to be modified. The idea was pretty simple stop the services, unregister, then the following script register and start services. You can run it manually on a server.
When you do vm snapshot from vcenter, check the box quiesce the file systems and uncheck the box for the memory --> this is type of snapshot smvi ask vcenter to take and is different from the defaults in vmware. If you are having the problem with the vss providers then your snapshot should time out at 95% with these settings.
The vss issue is known bug don't imagen you will get much help, and is fixed in later versions of snapdrive or vmware (not sure which as I upgraded vmware and then snapdrive and noticed that the problem was gone but didn't bother to figure out which one fixed what). The problem also exist in some other backup software that lay down their own vss providers (like symantecs backup exec). Idk this issue really falls in no-mans land type scenario. In the end, without snapdrive installed then vmware snapshots will work fine. This problem occurs becuase of the way windows handles vss providers, how vmware looks for it's vss provider, how netapp installs theirs. So there is a lot of figure pointing.
vmware also has a driver for prior to when window's used vss providers (this driver only exist for 2k3 and below, again it was really created for windows 2K) forget the exact details of this becuase it was over year ago when I was working on this problem, but you can also use this on your 2k3 systems.
I would not suggest moving to 4.1 or beyond snapdrive 6.2.x, I had to downgrade my enviroment, there are some nasty bugs in which your problem will be fixed but snapdrive is not able to see fc luns. They will still be connected to the vm\esx, windows will be fine, snapdrive will say something along the lines of not being able to enumerate them. Issues with later snapdrive versions dealing with vm's that have more than 2 scsi controllers. Again that is like a 4-6 month old problem think they are only for fc but can't be sure, it is a on the forums I will get that link, and it sucks.
we are running esx 4.0, vcenter 4.1, snapdrive 6.2.1, snapmanager for sql 5.0
Oh and welcome to dealing with netapp support , ask to speak to a duty manager, they can usually provide some assistents in getting you some "better" help, aslo tell them that you want the case to be a P2 but in my experiance they will bust this down to a 3 the first chance they get, feedback is never going to be there at least with what I have delt with, you have to call them up everyday force them to help you. If it makes you feel any better the more times you have to call in to support the better you will become with your enviroment, you will be force to learn . Netapp is also a better product it's support is just really lacking and if you pay for their upgraded support it prices them out of the competetion (upgraded support will also take care of all your pain points or so I have heard). Usually find quicker help on these forums which is why I a making sure I start to contribute, makes you feel any better your not alone in your frustrations with support (we have seriously brought up the topic of going emc route with our next sans becuase of support- which isn't much but it might make you feel better having that talk to your netapp sales team, maybe if enought customers do it they will get the message, maybe not).
same issue in another thread more detail on scripts, posted as a different username Kris (udf) its at the very bottom anyways though
http://communities.netapp.com/thread/3015?tstart=3
4.1 problem
Wow and Yikes! Thanks so much for this information. I'm still working through VMware and NetApp support. I was really looking at vSphere 4.1 as a resolution but with those other problems.... Those also appear to be hit or miss with SnapDrive 6.3P1. They have released SnapDrive 6.3P2 that says it fixes it.
Date Posted: 03-FEB-2011
SnapDrive 6.3P2 Based on SnapDrive 6.3 Bugs fixed in this release: 6.3P2 =========== 473592 SDW does not enumerate RDM disks if there are more than one scsi controller - exposed through an unsupported initiator error 464870 Hyperv restore failed with error "SDSnapshotRestore failed" 475282 SnapDrive ASUP string running in Windows Server 2008 R2 OS shows incorrect message. 475634 SnapVault update is not happening from MMC popup menu.
Thanks I will have to check out snapdrive updates.
Just to reiterate in this environment (below) I am not having the vss conflit issue anymore. I also not using it for exchange, just vm sql servers with the database setting on raw disk
We are running esx 4.0, vcenter 4.1, snapdrive 6.2.1, snapmanager for sql 5.0, vsc 2.0. all running over FC
keitha,
How/where do you go to disable the option for quieced snapshots? I'm having a similar problem with only 1 of my VM's. This VM is the result of a P2V. I'm using Backup Exec 2010 with the VMWare agent.
Wow, Sorry for the long delay on this...not sure how that slide by me.
You disable the SMVI/VSC VMware Snapshots by deselecting the "Use VMware quieced snapshots" checkbox when you edit the backup job. However if you only want to disable it on one VM you would be better off using the scripts above or just uninstall the VSS writer that is installed in the VMware tools.
A note on the scripts above. Although these are very cool and work great, I am not sure the vaule of them verses just disabling the VMware snapshots with SMVI? Either way you are not going to leverage VSS to quiece the VM prior to the NetApp Snapshot. If you don't have a VSS aware app in the VM this is likely fine since the VMware snapshot does not do any sort of application quiecing. The VMware snapshots that activate this VSS writer tend to be problimatic and is a common failure in SMVI backups.
It is also common to have sucess with a manual snapshot while a API called snapshot has problems. This is often because either the manual snapshot does not activate the VSS writer (you have to specfically specify this) or the problem arises when many snapshots are trying to close at the same time. On the SMVI side of things, we call the Open or Close VMware snapshot API but there is very little in error reporting back from VMware. It isn't really a bug but it does make it hard to handle situations when snapshots get stuck or have problems.
Without the VMware snapshot I think SMVI still have huge value by providing a GUI to restore VMs, mount snapshots and manage snapshot schedules from vCenter. Could you do it directly from the NetApp and not buy SMVI? Sure! but the GUI is sure nice to have and without the VMware snapshot SMVI is wildly fast and robust.
I hope this helps.
Keith
With the scripts. Could be off on this.
SMVI
quiece the file system of the vm and take a snapshot of vm, take the option of not quiescieng the file systems before your snapshots then my understanding might be bad analogy would be like taking a traditional backup without using vss, files are locked they may or may not be valid on a restore, you get a crash consistent backup. 9/10 you won't have a problem. Lets just use sql server example. Here I get the file systems / OS and everything quieces but the database files will be corrupt, vss providors vmware and ms can't dont flush sql. SMVI doesn't do app specific like you said. Why you use the snap manager for this piece.
Snap Manager <SQL/Exchange>
snap manager products are being used to queisce the app, in the example sql (on there own lun) get quiesced snap-shots of the database
Scripts address the problem of the 2 combined. 9/10 is great till your hit with that 1 time. Honestly may never come up should have multiple backups which put the odds in your favor. I just think I have bad luck really don't want to put myself in that situation, IT manager prospective after they have spent the money hard thing to tell them.
The scripts turn off the snap manager vss and let you quiesce the file systems using vmware vss provider (better backup of your vm).
They then turn it back on so you can get your quiesced snapshot of the app using snap manager.
Unless I am backwards on this and I could be, you are leveraging VSS to quiece the VM prior to the NetApp Snapshot for both the vm and the app with the two products together (smvi and snap manager).