Re: SMSQL 5.2 Restore Fails

ITINFRASTRUCTURETEAM · ‎2013-03-12

I've been battling an issue with a SMSQL Clone database restore operation for the better part of six months. Here's a snippet from the restore log today.

Snapshot Name: sqlsnap__TWSQL_03-12-2013_10.16.03

Successfully completed data transfer.

SnapManager timed out while waiting for the completion of VDI database restore!

[10:27:13.474] **** RESTORE RESULT SUMMARY ****

[10:29:40.660] Restore Database - #1 [Apex_Mirror__Temp]: Failed with error code 0xc00408ab

*** SNAPMANAGER MOUNT BY RESTORE (CLONE) JOB ENDED AT: [03-12-2013 10.27.19]

Error Code: 0xc00408ab

SnapManager timed out while waiting for the completion of VDI database restore.

Msg 3014, SevLevel 0, State 1, SQLState 01000

[10:29:50.855] [Microsoft][ODBC SQL Server Driver][SQL Server]RESTORE DATABASE successfully processed 0 pages in 688.282 seconds (0.000 MB/sec).

The SQL command executed.

When the process works, it completes within 20 minutes or less, but when it fails, it takes upwards of 45 minutes to do so. I've started to notice that when SMSQL mounts the database as <dbname>__temp, it seems that the database, on the occasions when it fails, seems to linger in an (in recovery) state for an abnormal amount of time. This seems like a likely cause of the SMSQL timeout errors above. My only question is why does the database stick in the (in recovery) state sometimes but not all the time. This process seems to complete successfully about 60% of the time, and then fails the other 40%. Is there some issue or bug that could cause the backup on the source SQL server to not be consistent, and thereby causing the database to enter the (in recovery) mode when mounted on the destination SQL server?

If anyone has any input, it would be greatly appreciated. This has caused endless hours of headaches over the last several months.

SDEGENHARDT · ‎2013-03-30

Hey. I've been working at a customer's site that has the same problem you are. We're still working through it as well. Here, it has always been a problem with a corrupt/inconsistent snapshot backup of the database. So, if you're troubleshooting, try creating a new snapshot backup and restoring it. The SMSQL services may need restarted on the destination server first. Two big things you'll want to check to prevent it from happening:

-Make sure that the volume/LUN/database layout is supported by SMSQL. Refer to the SMSQL documentation on that. It has several pages with descriptions and accompanying images.

-There shouldn't be any other SQL backups (native or third party) running while the snapshot backups are running.

Like I said, we're still working through it here too so I'll let you know if I come across any other useful info. Please let me know what you find too. Thanks.

ITINFRASTRUCTURETEAM · ‎2013-04-03

Hey Scott,

I believe the trouble was simply with our database having certain periods of heavy usage which caused the SQL recovering process to take longer than SMSQL was willing to wait, that and the occasional user trying to access the clone database while the resync process is running. The layout and architecture has been vetted by NetApp support, and we aren’t using any other backup software on that machine. After months of troubleshooting this issue with NetApp support, I finally proposed to script the entire process using PowerShell with the DataOntap modules, SDCLI, and SQLCMD. NetApp support agreed that this might be the easiest fix for our predicament. I just finished the script earlier this week, and it has been in production for 3 days without issue. The reason this works is, essentially, the script is not affected by the length of time our database takes to recover when mounted, therefore, the job doesn’t fail.

If you are interested, I can provide the script to you. I’m sure that it would require some changes to implement it in your environment, but it would get you headed in the right direction. I could provide help with it, if you're unfamilar with scripting. Just let me know if that interests you or not.

Thanks,

William

SDEGENHARDT · ‎2013-04-03

Interesting. So, you had heavy usage on the destination server during the restore then?...or is the source and destination the same server? On a restore that failed, were you later able to restore that same snapshot?

Yeah, that would be awesome if you could send me the script you're using. We were using a Perl script before we switched to the clone commands. Using the clone commands solved some problems and created new problems. Most of the problems with the Perl script were because it was poorly written though. I'll change my profile so my e-mail address is viewable to registered users.

Also, can you send me the case number(s) that you opened for these problems? I might reference your cases when I open one. I've also got some contacts at NetApp that might be able to help to see if there is any yet to be published info on this. Thanks!

ITINFRASTRUCTURETEAM · ‎2013-04-04

The heavy usage is on the source server during the snapshot process. The log file alone is about 170 GB, so when the flexclone is mounted on the destination server (different), it takes the database a while before it comes fully online. The database stays in an "in recovery" mode for 10 to 20 minutes sometimes. According to our DBA, this is normal behavior. And yes, when the restore process failed, we could use the same snapshot, if we manually mounted it and gave it enough time to finish recovering.

I'll clean up the script a bit, and get it over to you as soon as possible, and here is the case number that I opened: 2004035419