I recently upgraded SnapDrive/SnapManager on a Windows Server 2008 R2 SP1 host to SnapDrive 6.5 and SnapManager 188.8.131.521. At the time I upgraded, supposedly there were no other changes being made to the database server. The server is a VMware virtual machine on vSphere 5.5 and the NetApp solution is a 3220 HA pair running ONTAP 8.1.2 7-Mode. The server is running SQL Server 2008 R2 SP2. I believe this configuration is supported according to the interop matrix.
Prior to updating this software, the hourly full SMSQL backups we were taking would take 3-5 minutes to complete. Post upgrade, I'm seeing different behavior (most of the time, at least). The syntax of my job is as shown below; I had to change it from what it was prior to the upgrade to get it to correctly pick up both SQL instances. From reading the documentation, it seems that the "new-backup -srv 'SERVERNAME' -d 'SERVERNAME' etc,etc,etc" should back up all instances on the server. This was not the case post-upgrade, and only the default instance was getting backed up. So, I changed it to what is shown below.
This new syntax seems to work and backs up the databases successfully. However, the time that the job takes to complete (backing up the same databases as with the old SnapDrive/SnapManager software versions) has increased significantly, but not every time it is run.
Now, when I run SMSQL backups called via a scheduled task, jobs that run between 7am and 1 am complete in 11-12 minutes. Jobs that run between 2 am and 6 am run in 2-3 minutes. Yes, certainly you'd be right to point out that when the job runs quickly, it is during off-peak hours. However, there's nothing I'm aware of that is going on between 10 pm and 1 am (just for a simple example) that should be hitting the server any harder than between 2 am and 6 am. And again, I never once saw this issue prior to upgrading the SnapDrive/SnapManager packages.
This excess time the jobs take seems to be due to SMSQL not retrieving the SQL Server database information successfully. When the jobs take longer to run, I see the event timeline display similar to the following:
10:00:01 am - Event 308 logged - "SnapManager for SQL Server per-server license is licensed on server SERVERNAME"
10:10:22 am (This is the next SMSQL event logged) - Event 368 logged - "SQL Server database information was retrieved successfully."
From that point on things start moving in the exected time frame
When the jobs run as I would have initially expcected (from 2-6 am) I see that the database information is retrieved in a much more timely manner:
6:00:03 am - Event 308 logged - "SnapManager for SQL Server per-server license is licensed on server SERVERNAME"
6:00:21 am (This is the next SMSQL event logged) - Event 368 logged - "SQL Server database information was retrieved successfully."
The job then completes within a couple of minutes.
I'm trying to figure out why I'm getting this 10 minute delay on the backups between 7 am and 1 am. Someone from our DBA group is also seeing that the query/(ies) coming from SMSQL on the delayed jobs seem to hit the server harder during this time frame, but they haven't been able to pinpoint anything other than to say the queries are taking a long time to complete.
Due to the nature/source of this issue, other SMSQL tasks are impacted as well (for example, database cloning [as one might expect] since we seem to be having issues retrieving the databases in a timely manner).
Does anyone have any ideas as how I might go about further troubleshooting this to get the time it takes to retrieve the SQL server database information back to a reasonable range so that I don't hit a 10 minute delay on any SMSQL task I run between the hours of 7 am and 1 am? Any suggestions would be greatly appreciated, thanks!
What version of SDW and SMSQL did you upgrade from? Starting in SMSQL 6.x, the SQL Management Object (SMO) API framework was utilized, as opposed to the older Data management Object (DMO) API.
However, in SMSQL 6.x, you can force the application to use the older DMO libraries to see if using the new libraries is causing a problem. If this is a route you would like to explore, I would suggest contacting NetApp Support for a case creation, as it does require modifying the registry.
Additionally, I would be curious as to what else is occurring on the storage controller during that time. You mentioned this is a virtual machine running on 5.5. Are you also utilizing NetApp's Virtual Storage Console for virtual machine/datastore backups? If so, are those backups running during the 10pm to 1am timeframe?
Lastly, are any other backups hitting this server during this timeframe? Other SQL backups, Exchange backups, Oracle backups, etc.? Are you utilizing SnapMirror or SnapVault with replication running during this time?
In terms of versioning, SnapDrive 6.5 has not been tested with vSphere 5.5. vSphere 5.5 and ESXi 5.5 have been verified with SDW 7.0 and higher.
There are a number of reasons why you could be experiencing slowdown. The more information we have to work with, the better we can determine where the probable issue resides.
Thanks for the good info, Andy. I was coming from SnapDrive 6.4.2 and SMSQL 5.2, so this is our first jump into SMSQL 6.x.
Your suggestion to go through support (in addition to the information you've provided) seems like a logical step to try to reign this in, so I will likely try that route.
To answer your other questions:
We do use VSC for VM/Datastore backups, but not during the timing in question, so I don't think that would be of impact here.
It's only the SMSQL backups that are occuring on the server at this time from everything I can tell. I took a look through the SQL Agent jobs on the box and didn't see anything that looked like it'd be getting in the way.
We're utilizing SnapMirror replication upon SMSQL job completion.
I hand't realized the lack of testing between SnapDrive 6.5 and vSphere 5.5. I can't move to SDW 7.0 as of yet, as I believe the matrix says I need to be on Data ONTAP 8.2 for this to be supported (we're currently on 8.1.2).