Data Backup and Recovery

SMVI time out errors and quiesce error

jap
4,249 Views

Hi guys

I am quite new to the Netapp world, but still very happy for the gear 🙂

We have a metro cluster with 2 FAS 3160 boxes, and on top of that 20 blades running Vmware ESX, and a total of about 220 virtual machines.

We really like the SMVI product, it is very nice to do backups in under 1 hour..

BUT, we havent yet made a complette error-free backup, everyday we se some machines not being backed up becauce of either a "Operation timed out" or a "Creating a quiesced snapshot failed because the create snapshot opreration exceeded the time linit for holding off I/O in the frozen virtual machine"..

Until now we have:

The latest host utils, 5_0R2 on all ESX boxes

The lastest ESX version, update 4, 3.5.0 153875

The latest vmtools on all vm machines

The latest vCenter, 2.5.0 u4

SMVI 1.0.1R1

Set the disk time out reg.key on all the vm's to 120 sec., although I have seen some mention that it should be 190 secs?

(HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Disk)

TimeOut Value 120

We are only using VMFS datastores, and FC.

The strange thing is that the machines that failed last backup probally will succed next time, but then others will fail..

Can anyone please help me with further tips om troubleshooting this, or is this the real world, and I must live with some machines failing?

-tonigh we try to do the backup without the "Vmware consistency snapshot" option enabled..but we would veru much like that feature enabled..

Thanks.

Best regards

Jan Pedersen

5 REPLIES 5

avbohemen
4,249 Views

Which Guest OS(es) are you running? I have seen this issue many times on Windows 2003 (all versions, including R2, Standard, Enterprise, etc). Even though you are only using VMDKs, therefore not using SnapDrive, I still recommend you install several Microsoft hotfixes that are required for SnapDrive.

SMVI calls upon the VSS stack in Windows, to create a quiesced snapshot. A lot of these hotfixes are to improve VSS. I found even newer hotfixes that helped reduce the number of quiesce errors even more. However, they have not disappeared for me completely as well. I've also opened several cases with NetApp, but they cannot help me completely. It seems to be a combined NetApp/Microsoft issue. Right now, I install these hotfixes on all of my Windows 2003 machines:

Hotfix 919117 (GPT disk support)
Hotfix 931300 (Snapshot mount hotfix)
Hotfix 937382 (LUN extend hotfix)
Hotfix 945119 (Storport.sys)
Hotfix 940349 (VSS update rollup)
Hotfix 949391 (VSS hardware snapshot delete hotfix)
Hotfix 951568 (VSS tracing hotfix)

Hotfix 932578 (ntfs.sys)
Hotfix 950974 (COM+ Security update)
Hotfix 934016 (COM+ 1.5 rollup 12)
Hotfix 954429 (hal.dll)
Hotfix 935926 (ntoskrnl.exe etc)
Hotfix 953323 (shell32.dll)

All of these hotfixes are available for x86 and x64 architectures. Let me know if this helps.

One more thing: are you using the StorPort driver for the LSI SCSI card in the VM? That should also help. The old type of driver is ScsiPort, and is called "symmpi.sys". The StorPort driver is "lsi_scsi.sys" (check in Device Manager).

Regards.

jap
4,249 Views

Hi Anton,

Thanks for your reply.

I have done a little investigation based on your response.

We run about fifty/fifty Windows 2003 / Windows 2008, mostly 64 bit, in both versions.

And mostly patched up to date, but not with extra hotfixes.

Out of 21 servers, which failed to do a successfull backup, 3 was on Windows 2008, and the rest on Windows 2003.

Then I checked the SCSI driver on the Windows 2003 servers, and there we are using the symmpi.sys driver 😞

Do you know what happens if I change the driver, do I lose access to the disc?

Of all the hotfixes you mention we only have these:

WindowsServer2003-KB931300-x86-ENU.exe
WindowsServer2003-KB932755-x86-ENU.exe
WindowsServer2003-KB937382-x86-ENU.exe
WindowsServer2003.WindowsXP-KB931300-x64-ENU.exe
WindowsServer2003.WindowsXP-KB932755-x64-ENU.exe
WindowsServer2003.WindowsXP-KB937382-x64-ENU.exe

So there are also some posibillities there...

I have used the last days to surf most of NetApp and varius blogs to find help with this problem...but most are based on marketing material, (I think)

One thing I also found is this one:

http://www.veeammeup.com/2008/08/vss-and-vmware-esx-what-your-vmware.html

So maybe we have to use SMVI for most of the servers, and another solution for the more transaction critical servers.

-yes, I am aware of Snap Manager for SQL and Sharepoint, which we also have, but not yet has had that great succes with, this problem we have asked our vendor to help solve. It mainly goes with the RDM disks in either virtual mode or physical mode, and problems with Vmwares vmotion

I will digg deeper into your information, do some reboots, and come back with the results.

Have you heard anything about what new will be in SMVI version 2.0, and when it will be released?

Best regards,

Jan

franciskim
4,249 Views

Do you know where I might find Hotfix 931300?  I am running Win 2003 SP2 x86.  I understand Microsoft has the x64 version but not the x86?

Thanks,

Francis

aborzenkov
4,249 Views

Microsoft KB article lists both 32 and 64 bit versions of this hotfix.

franciskim
4,249 Views

Thanks for that.  I checked again and was able to find it this time.  A little confusing that it lists Fix201376.

Public