Data Backup and Recovery

SMVI - backing up VMs with iSCSI LUNS and SnapManager

marcconeley
10,282 Views

Is it possible to use SMVI v2 to perform VM consistent snapshots of VMs that have iSCSI LUNs mounted?

I am using vSphere (latest v) and SMVI v2. All my VMs are mounted on an NFS volume on my FAS2020.

All my Exchange, SQL and Sharepoint servers are using SnapManager for backups, and therefore have their data mounted on iSCSI initiator LUNs within the VMs.

Until recently I was creating hourly "non VM consistent" snaps and daily "VM consistent" snaps (according to: http://blogs.netapp.com/virtualization/2009/07/scheduling-smvi.html).

I encountered 2 main problems with this:

1) The daily VM consistent snapshots would always fail on 4-5 of my servers, randomly it seemed. The rest would snapshot OK.

2) I noticed that many of the volumes containing my iSCSI LUNs were rapidly filling up.

It seems that the random server snapshot failures all had 1 thing in common: they all had iSCSI LUNs (why sometimes they worked though, and sometimes not I have no idea!).

It also seems that by performing SMVI "VM consistent" snapshots of these servers, a conflict is caused with SnapManager which also results in the iSCSI LUN (mainly the SnapInfo one in the case of SQL!) being snapped. This happens outside of the control of SnapManager. So in my case, after 1 week my SnapInfo vol reported full - when I checked the snapshots on this vol I could see 7 days worth of SQL snaps (normal) - but also 7 days of additional SMVI snapshots!! (very bad).

This happened on normal VM servers with iSCSI LUNs mapped and my SQL servers.

Since then I read somewhere that VMware do not support VM snapshots of VMs with Microsoft initiated iSCSI LUNs (which rules out half my VMs!) and therefore I've removed these servers from my backup.

Is there any way around this problem?

Longer term my NetApp partner is currently selling me a backup project whereby we will snapshot all VMs and then snapmirror them to a remote site for DR. The problem is that my most important VMs have iSCSI LUNs. My understanding according to the above article is that VM consistent snapshots are important as the VM is quiesced and the snapshots are clean. Therefore my backup idea is not good as basically I'd be mirroring "dirty snapshots" to my remote DR site. Ideally these snapshots should be clean, VM consistent ones (or am I over valuing the importance of this?).


Help appreciated.

//Marc

1 ACCEPTED SOLUTION

watan
10,251 Views

Hi Marc,


1)

Steps to configure the ESX SW iSCSI Software Initiator using VI-Client

    * Enable ESX iSCSI SW initiator in ESX server.
    * Check License under Licensed Features of Configuration Tab.
    * Add the VMKernel Port under the Networking configuration option. Enable      VMotion.
    * Under Security Profile select the port for the Software iSCSI client.
    * Select the ESX SW iSCSI initiator and click the "Enabled" checkbox.

Add ESX SW iSCSI targets to discovery list using VI-Client.

    * Navigate to the Properties of the iSCSI adapter and click on the Dynamic Discovery      tab.
    * Choose to Add new Target.
    * Enter the IP address & port of the target server and click OK.

2) If you're not having any specific problems with using MS iSCSI then I personally don't see any reason to change it.   If its not broken...

Also refer to the following docs as they may have more on performance or general best practices.

The performance best practices :

For ESX 3.5: http://www.vmware.com/pdf/vi3_35/esx_3/r35u2/vi3_35_25_u2_iscsi_san_cfg.pdf

For ESX 4 : http://www.vmware.com/pdf/vsphere4/r40/vsp_40_iscsi_san_cfg.pdf


NetApp and VMware VI3 best practice Guide :

http://media.netapp.com/documents/tr-3428.pd

View solution in original post

25 REPLIES 25

amritad
2,914 Views

Hi,

This is a known VMware issue which is documented here http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009073. We are actively working with VMware in order to get this issue resolved.

VMware’s general recommendation is to disable both VSS components and the sync driver in VMware Tools (which translates to turning off VMware snapshots for any SMVI backup jobs that include virtual machines mapped with Microsoft iSCSI Software Initiator LUNs) in environments that include both Microsoft iSCSI

Software Initiator LUNs in the VM and SMVI, thereby reducing the consistency level of a virtual machine backup to point-in-time consistency. However, by using SDW/SM to back up the application data on the Microsoft iSCSI Software Initiator LUNs mapped to the virtual machine, the reduction in the data consistency level of the SMVI backup has no effect on the application data.

Another recommendation for these environments is to use physical mode RDM LUNs, instead of Microsoft iSCSI Software Initiator LUNs, when provisioning storage in order to get the maximum protection level from the combined SMVI and SDW/SM solution: guest file system consistency for OS images using VSS-assisted SMVI backups, and application-consistent backups and fine-grained recovery for application data using the SnapManager applications.We understand the challenges here and continue to work with VMWare to resolve this issue.

REgards

Amrita

thomas_glodde
2,914 Views

Hi amritad,

well, thing is even IF we use physical mode iSCSI/FC RDMs, we still receive errors while backing up using SMVI and vmware consistency snapshots. We still get strange VSS errors. It seems that the Data ONTAP VSS Provider sorta messes up the VSS stack.

Well, lets hope you guys get it sorted out with vmware some day soon 🙂

Kind regards

Thomas

m_schuren
2,914 Views

Hi Thomas and all,

disabling both sync driver and vss driver is a functioning workaround, but still only a workaround.

The problem has been existing for waaay long time, and has never been fixed completely. Results vary between different VMTools versions, Guest-OS versions, SnapDrive Versions etc.

Disabling the drivers for the SQL/Exchange VM's OS drive is a good idea until this issue has fully been fixed by both NetApp (VSS provider component of SnapDrive) and VMware (VMTools VSS sync driver).

I personally wonder why it is so hard to fix these issues over years? They still exist with 4.0u1/SnapDrive 6.2P1 (and get worse with 4.0u2).

However, I think the main problem is not a technical one. The most annoying part of the problem is not to disable these two drivers - but to explain the "negligible" impact to a customer without inconvenience:

"Your application data is app consistent - guaranteed - because SME or SMSQL or our script or SnapCreator takes care of it. But your OS drive (SMVI), well... it is consistent enough... NTFS journal will take care of it... is not really guaranteed - but works most of the time..." I keep telling customers for years, that "crash consistency" is not really harmful to an OS drive with no application data on it, but people simply do not FEEL well with this kind of workaround, and are waiting for a fix/solution for years.

I think the world would appreciate a final and permanent fix, at least people would feel better with SMVI in combination with other SnapManagers.

This thread is very related by the way:

http://communities.netapp.com/message/30383#30383

Just my 2 cents,

Mark

HendersonD
2,914 Views

Great reading, I have two thoughts:

1. We run two SMVI jobs. One job contains ever VM except SQL and Exchange and the other job contains just SQL and Exchange. For the SQL and Exchange job I do not check the box that takes a VMWare snapshot. For the other job, I do check that box. I also use SME and SMSql to make sure I have backups of the data stored on these luns.

2. Take a look at this article, it shows how VMWare has fallen behind HyperV in this regard:

http://www.backupcentral.com/content/view/287/47/

It would be nice if SMVI could be my only backup solution and I could get rid of SME and SMSql entirely.

radnorsan
2,914 Views

Just to report back after some months of using SMVI and tweaking our setup; here is what we settled on. 

We have one NFS volume with all VMs except for our Exchange server.  This entire volume gets backed up at the volume level with SMVI.  I don't have too many problems.  Occasionally, I get one or to VMs that fail to quiesce the VM so I only get a crash consistent snapshot of these.  Other times all snapshot fine.  And I can't find any rhyme or reason as to which VMs fail or when.  I'm ok with this as it only happens to one or two VMs, once or twice per week.  It also seems to be completely random which VMs fail.  I snapshot my VMs every 6 hours so I figure - as long as I have a crash consistent snapshot for point in time recovery and a succesfull VM consistent backup 6 hours ago, just in case, I should be fine.

Anyway, the OS for my Exchange VM is in its own NFS volume.  All partitions for Exchange databases and logs are on physical RDM LUNs connected via ESXi iSCSI initiator.  RDM mapping file is in a seperate shared iSCSI LUN between my hosts so I can still vMotion Exchange as well as all the other VMware supported features that you lose out on in going with the MS iSCSI.  The Exchange OS has its own backup job with the checkbox to quiesce the VM unchecked.  I accept the crash consistent snapshots for the OS and I backup Exchange via SM for Exchange.  This seems to work well and have had no problem with backup or recovery of mailboxes or emails using single mailbox recovery.

Just as an aside, I have recently upgraded from vSphere 4.0 and Snapdrive 6.2 to vSphere 4.1 and Snapdrive 6.3.  I started with SMVI 2.0 and SM Exchange 6.0 and as of this post, these are the latest versions.  Also, running ONTAP v 7.3.2

Chris

Public