We have our VMs running off a FAS3020 using fiber channel disks and an NFS datastore. All VMs have been aligned using mbralign and we are running vSphere update 1 fully patched. We have two SMVI jobs scheduled. The first job DOES NOT do a VMWare snapshot, it only does a Netapp snapshot. I have 4 VMs including in this job:
Two domain controllers - from what I read and based on some experience, doing a VMWare snaphot on these is not a good idea
SQL and Exchange - both of these have iSCSI luns presented to them via Microsoft's iSCSI initiator. VMWare does not support doing VMWare snapshots in this case.
This job runs fine all the time. My other job has my remaining VMs in it, a total of 21 VMs. These DO get a VMWare snapshot done prior to doing the Netapp snapshot. Every few days this job will not complete. Typically, several VMWare snapshots are taken and then it hangs doing the VMWare snapshot on one particular VM. I wish I could say it was always the same VM it hangs on but that is not the case. What I do at this point is:
- Stop the Netapp SMVI service
- Manually delete whatever snapshots have been taken
- Remove the crash directory which can be found at C:\Program Files\NetApp\SMVI\server
- Restart the Netapp SMVI service
Why is this job not always working? I am at a loss just trying to troubleshoot this, are there log files or other methods I can use to trace down this problem?
I did find a vss error with event id 8194 in event viewer
I read this with interest as I am having similar issues.
I am using SMVI v2 with vSphere and a FAS2020 running 7.3.2P2.
I am performing hourly "NetApp" snaps of my NFS vol which stores all my VMs, and daily "VM quiesced snaps" every evening (I read somewhere that this was a good idea).
We also have Domain Controllers and Exchange & SQL servers with iSCSI LUNs (needed by SnapManager) included in these backups.
What I experience is that every evening at least 2-3 random VMs fail during the VM snapshots (timeouts when creating the snapshot). The VMs effected always seem to be those which have iSCSI LUNs mapped within them.
What I would like to know is:
1) You state in your post that making VM snapshots of VMs which have Microsoft iSCSI initiator LUNs is not supported. Can you tell me where you read this?
(in all cases, if I create a VM quiesced snapshot of the failed machines manually using the Infrastructure Client the following day, it works absolutely fine).
2) There is no option it seems in SMVI 2 to be selective in which VMs should be snapped? Therefore, do I simply ignore these random failures or is there a way to exclude them from the backup?
SMVI seems to backup all VMs in the datastore by default, and the only work around I can see would be to create different datastores for VMs which have iSCSI initiator LUNs.
In all cases though, with SMVI 2 I find that even if snapshots of a VM fail, the backup process completes and will continue to snap the remaining ones.
I did have a similar issue to you last year with ESX 3.5 and SMVI 1 whereby VM snaps would timeout on the "deleting snapshot" part of the process - and this was due to a NFS lock timeout parameter (search for SMVI, timeout - the post exists somewhere on this forum). Since then we've upgraded to vSphere and SMVI 2 and followed the best practices for NFS vols and we've never had the problem again (we also suffered timeouts on the actual VMs during snapshots whereby they would lose connectivity and pings to the servers would fail during a snapshot process - this was also solved by following the best practice NFS tweaks).
I hope you find something here useful, and if anyone else can provide some insight into my questions it would be great.
I run two different SMVI jobs, one that has VMs in it that I do not want a VMWare snapshot and one where I do. In the SMVI GUI there is a checkbox under the Entities tab called "Perform VMWare consistencty snapshot". For one job I have this checked, for the other I do not.
All of my VMs are in the same NFS datastore so with both jobs, the Netapp snapshot take a snap of ALL of my VMs. I suppose I could get by with just one job. This one job would include just the VMs that I wanted to take VMWare snapshots first. In other words, this job would in my case exclude Exchange, SQL, and my DCs. Once all the VMWare snapshots are taken of my other VMs, the entire NFS datastore (including Exchange, SQL, and my DCs) would get capture in my Netapp snap.
I am under the impression, I might be wrong, that if I did not have Exchange, SQL, and my DCs in another job that there is no way to restore just these VMs in case of a problem. I do use Snapmanager for SQL and Snapmanager for Exchange to backup my SQL and Exchange databases. If I have a VMWare snapshot fail, some of the time the product marches on, finishes the rest of the job and the report I get is that one or two VMs failed to backup. Many time though when the VMWare snapshot fails, everything comes to a grinding halt and the job never finishes.
I am up to date on my VMware tools and it appears that when a VMWare snapshot fails it times out indicating it could not quiesce that VM. Why this is happening is still a mystery to me.
1) If you're using vSphere you should now be able to install the NetApp plugin "Virtual Storage Center". Once you've installed this, it will compare your VMWare/Filer NFS settings against the best practices and tell you if any params are incorrectly set (it was really useful for us).
2) I found the old thread that I wrote when I was having the same problems as you. Check it out - maybe it helps (note that this change needs to be made in both the vCenter config and also on the ESX servers):
"You mentioned that you made the change to not disable the NFS locks, did you also then make the change to the /etc/vmware/config file by adding
prefvmx.consolidateDeleteNFSLocks = “TRUE”
It sounds like what is happening..."
ps: Keith, how do you remove the VSS writer service on VMs that I have iSCSI LUNs on?? I have a similar issue making snaps of my Exchange/SQL servers as they have iSCSI LUNs on my filer. At the moment I simply skip doing VM consistent backups of these servers.
Shorlty after upgrading to vSphere in December we did install the Netapp snapin and did let it adjust our NFS preferences to best practice. When we had ESX3.5 we took care of the NFS issues and had this parameter set on all of our ESX hosts:
prefvmx.consolidateDeleteNFSLocks = “TRUE”
SMVI has worked three days in a row without any problems, we will see what this week brings
I believe this likely cause by the VMware VSS writer conflicting with the VSS writer Snapdrive installs (you indicated that you were using SMSQL ect). For those VMs you will need to remove the VMware VSS writer I believe for the SMVI backups to complete error free.