Re: VSC restore crashes filer

dnewby · ‎2011-02-07

I have a 2020 AA Cluster with 7.3.5 and I'm running Vsphere 4.1 with VSC 2.0. Twice now when I have attempted to restore one of my SMVI backups from VSC it has frozen at 61%. Eventually it has caused a panic on my filer and a loss of connectivity to any of my datastores on that filer from my esx servers for around 30 minutes. I had Netapp on the phone during the last outage and they and VMware are baffled at what caused the problem. What was even more concerning was what happened during the last restore. I had created a rapid clone of a production vm for testing a new upgrade. The upgrade failed so I went into SMVI and told it to restore the clone to the previous nights backup. I verified about 10 times that I was selecting the correct vm to restore. A few minutes later I started getting calls that the production server was not working. I tried pinging the server and it responded, but I couldn't connect to the server. I pulled up the console, but it was frozen. A couple of minutes later the filer crashed and I opened a ticket with Netapp. When the filer switched controllers I looked at the datastore and all of the snapshots were reporting 0 space used. Then suddenly vcenter and all the hosts lost connectivity to all the datastores on the first filer. All the datastores on the second filer were fine. After about 30 minutes for no reason the datastores came back online and I booted the server up. However my production server had been replaced by the backup from the previous night. I saw in the logs of the filer that the SMVI restore of the production server failed, even though I didn't tell it to restore it. Luckily even though all the snapshots showed 0 space I was able to browse to the latest snapshot and copy the files to a new folder and only lost about 5 minutes since I do hourly snapshots. If not I would have lost a whole days information. Vmware says they don't see any problems at all in their logs and Netapp said they don't know what happened either. It's been several days now and all my snapshots still show 0 space on the system manager. I'm completely gunshy of using VSC now. Has anyone had a similar problem?

forgette · ‎2011-02-16

Sounds like its possibly a network issue. Have a look there.

I'm a bit confused about your comment that "created a rapid clone of a production vm for testing" then tried to restore something from SMVI. If you created a rapid clone, you should have had 2 vms: the original and the clone. What were you trying to restore?

dnewby · ‎2011-02-16

I finally got a response back from Netapp today and they said the filer crashed because the restore caused the volume to run out of space. I'm not sure why it would allow a restore that would run the storage system out of space and cause a filer to crash. I believe the connectivity issue was due to a network problem. I didn't have single mode set in my rc file so I was getting mac flaps when the filer restarted causing the hosts to lose connectivity.

As far as your question, I created a rapid clone so I had the original vm named pmsi and a clone of the vm named pmsi-941. I selected the pmsi-941 vm to restore and it restored over the pmsi vm.

Here's the entry's from my SMVI.log

2011-02-03 16:20:22,403 [INFO ] virtualMachineList succeeded. [PMSI (42252f4d-a5a6-b665-7310-d99ea7bc4b33), vcenter-64bit (42250612-ff13-b8f4-b118-575633e05fd2), Blackberry (42255a07-df2c-6a08-92fe-f843fb4ff56c), Citrix (422515eb-d57d-dc09-fc17-249cd5d9eb4e), pmsi-941 (4225625b-35a8-7629-fb0f-a9f51ce2b040)]

2011-02-03 16:20:35,923 [INFO ]restore entity 4225625b-35a8-7629-fb0f-a9f51ce2b040 using backup backup_na1_nf2_nightly_20110203013500 esx Host storage ID null datastore ID null include RDM false restart VM false

You can clearly see the id is the correct id for the pmsi-941

The following error was logged on my netapp filer after it finally came back up after it crashed.

Thu Feb 3 18:36:10 CST [na1: app.log.crit:CRITICAL]: localhost: SMVI SnapManager for Virtual Infrastructure Server 3.0 (build date='100714_2200', version='1181'): (20105) Restore Failure, VC: Restore failure due to error in getting the completion status of the restore operation. Error Code = 20105 SMVI Server Error Messages = Failed to find progress of restore for lun /vol/vmware_nfs2/PMSI/vmware-19.log on storage system na1 Corrective Actions = This happens when VC no longer tracks a task (~15 minute

SMVI clearly says it's restoring pmsi-941, yet the filer says it was restoring PMSI which it actually did restore unfortunately. My other vm's on the same datastore were just fine other than crashing from the hosts losing access to the datastores for about 30 minutes. Netapp is still investigating this issue.

Thanks,

Darkstar · ‎2011-02-21

Hmm... it probably works as designed (the restore from SMVI always overwrites the original VM IIRC).

But if you already did a FlexClone of the affected VM, there is actually no need to do a restore:

Just power off your original VM, start the cloned VM directly on the clone datastore, and then do a Storage VMotion back to the "correct" datastore (i.e. where the original VM was running). When this has finished, unmap/unmount and destroy the cloned volume and you're done

-Michael

dnewby · ‎2011-02-21

I finally got an answer back from Netapp. "After some further research, I found the root cause of this. There is a bug in the version of the cloning portion of the VSC you are using that causes this behavior. The issue has been resolved in the current VSC release. Please upgrade to this version ASAP to prevent further occurrences."

I've asked for some documentation on the bug, but haven't recieved it yet. I'm glad it's not designed to overwrite the wrong vm as anyone that would intentionally design software to overwrite a vm other than the one you specifically told it to needs to be shown the door.

I'm a bit confused on the second part of your post. I was trying to restore the clone to a previous point in time to do the test upgrade again. What would moving the clone do? Also unfortunately I don't have licensing for storage vmotion.

Thanks,

Dustin

RSACCENTURE · ‎2012-04-18

Pardon the dumb question but - you said "Just power off your original VM, start the cloned VM directly on the clone datastore"

To do this, do you have to unregister your original vm, and reregister it on the new datastore, or is there something more obvious that I'm missing?

Thanks!

Christie

keitha · ‎2012-04-19

You can do either. If you unregister and reregister you can keep the same VM name, if however you just register the new one the name will change. This does however allow you to maintain the original VM for root cause analysis. Be very very careful though as now you will have 2 VMs with the same IP address (Danger!)

Keith