I also wanted to let you know that NetApp is hosting an SMVI/SRM Webcast on February 19 focusing on data protection in a VMware environment: NetApp SMVI for backup/restore and VMware SRM for disaster recovery. SDDPC Sys Admin Rick Scherer - who designed and maintains a 25 host VMware ESX 3.5 farm with well over 300 Virtual Machines, plus writes a great blog (http://vmwaretips.com/wp/) - will be joining us to describe how his team uses SMVI.There will also be a panel of folks, including best practices authors and reference architects, to address questions submitted via chat.
we run smvi 1.0.1 for several weeks now and it is pretty stable (compared to 1.0!). last night, 2 of our daily backup jobs started off - and they still are in running state, producing event 4096 in the application log like this:
1st an Error:
2936921 [backup2 6778732c8323002d813f32f6dab0368e] ERROR com.netapp.common.flow.JDBCPersistenceManager - FLOW-10110: Lock "50228a39-7eba-0c39-5d7c-bb5b34be305f" already held by backup-create operation 9a80d4f1bc7f39fa6e4cabb0559a097a 
2nd a warning:
2936921 [backup2 6778732c8323002d813f32f6dab0368e] WARN com.netapp.smvi.task.AcquireVirtualMachineLockWrapper - Could not lock all virtual machines. The lock process will continue to be retried until it succeeds.
During yesterday afternoonn I was trying to restore a Test-VM which failed! Coud that maybe disturb the backup jobs?
I cannot stop the 2 backup jobs - have restarted the vc server (where smvi is also installed) - nothing helps. I cannot start new ones - so we are somewhat looked up!
Hope anyone can give me a hint how to stop this...
Oh - we found out, why the scheduled Jobs were running all night long until 30 minutes ago! One of the volumes on the DR filer was removed because the storage people were thinking, we don't need it anymore. But the SMVI jobs trigger a snapmirror update and that was failing all the time.
Again it would be very nice to get to know, how we could stop these running jobs. As a said - we tried reinstalling SMVI, clearing all sorts of temp files and rebooting the VC-Server several times. Those jobs just cannot be killed!
=== DETAILS === Could not locate or create an initiator group on storage system "vanna01" for ESX server "192.168.16.65". Please ensure the ESX server has one or more initiators logged into the storage system.
=== CORRECTIVE ACTION === null
=== STACK TRACE === com.netapp.nmf.smvi.main.SmviErrorDetailException: Error restoring backed up entity at com.netapp.nmf.smvi.restore.RestoreGUICallBack.status(RestoreGUICallBack.java:52) at com.netapp.nmf.smvi.services.server.EventQueuedGUICallBack$2.run(EventQueuedGUICallBack.java:48) at java.awt.event.InvocationEvent.dispatch(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source)
The ESX host in question is listed in VC as "192.168.16.65"
The Initiator group is called ISCSI-VANVS05
The initator nodename is iqn.1998-01.com.vmware:vanvs05-5ae8d131
The initiator alias is vanvs05.van.davis.ca
The initiator is logged into the storage system
Is there something that is supposed to match up between the host/ip/name used to login via SSH in SMVI and the iGroup that I am not aware of?
I have a customer running SMVI 1.0.1, on a 6080, running 100's of VMs.
When we run a backup on a datastore holding 15 guests, SMVI creates 1 snapshot per machine, which makes it impossible for the cusotmer to keep the snapshots on the main site for more that a few days (255 max snaps / 15 = 15 days).
The customer got burned in the past with a virus attack that took a while to discover and want to retain the snapshots for more than 15 days - is it possible to make SMVI take 1 snapshot of all 15 machines (we are aware of the fact that all machines will need to be in Hot Backup mode and that this will affect performance for the duration of the backup window).
It sounds like your customer is creating separate SMVI backups for each VM. For their case, I would suggest looking at creating a smaller number of backups with more VMs. They could try a single backup of just that datastore. SMVI will, by default, create VMware snapshots for every VM in the datastore, then it will create a single ONTAP snapshot for the datastore(s) involved.
Please be aware that if the VMs are experience heavy I/O, it is possible for one or more of the VMware snapshots to fail. In those cases, the VMs that did not complete their VMware snapshots will not be properly backed up. If this occurs, the two resolutions are to either a) disable VMware snapshots, b) Reduce the number of VMs per SMVI backup. If the choice is to reduce the number of SMVI backups, I would start by just dividing in half and testing.
Appreciate the quick response.pparently there was a misunderstanding between the customer and me:
He has 17 LUNs (300GB each, 500VMs total) inside one volume, and has 17 backup jobs (on the datastore level) running one after the other (he doesn't want to backup all the machines in one snapshot fearing that 400 quiesce requests will overload the servers, and that's why he's getting 17 snaps per backup.
He claims that our best practices for SMVI were the reason he chose the LUN size to be 300GB, and what lead to the need for 17 LUNs.I realise that we can ask them to move some of the LUNs to other volumes, but that would require a downtime for some of the servers.Is there another way around this issue ?
That is a lot of LUNs in a single vol. There are probably a few ways to reorganize the customers data as long as they have the spare storage. Now, please be aware that I'm not the most qualified to give the exact steps on this. I would also suggest subscribing to dl-server-virtualization and ask this same basic question there if you do not hear anything else on this thread.
The root issue is that there are multiple VMFS datastores backed by LUNs all in the same volume. Finding the best solution to break this apart so that there is no downtime is the trick.
The basic way I can see resolving this for this customer is through using VMotion and/or Storage VMotion. They should be able to bring up another VMFS datastore backed by another LUN in another volume. Then use VMotion to move several VMs over to the new VMFS datastore. I believe you can perform VMotion without powering off a VM, but I believe it cannot have any VMware snapshots.
I personally am not well versed on Storage VMotion, so I cannot really give you any ideas on how well that would work for the customer.