Data Backup and Recovery

SnapCenter MSSQL backups fail intermittently

borismekler
5,983 Views

Two Windows Server 2016 virtual machines running on top of ESXi 6.7 in a Microsoft cluster. ONTAP 9.7P8. Servers are using in-guest iSCSI initiators to mount a number of LUNs as CSV disks. SQL Server 2017 is installed as a failover cluster (not AlwaysOn groups) on these CSV disks. SnapCenter 4.4 is managing both VMs with plug-ins for Windows and MSSQL. SnapCenter for VSphere plug-in virtual appliance is installed and configured, but not linked to the SnapCenter server itself, so the C: drive VMDKs are not managed by SnapCenter. Policies and schedules are configured to back up the databases residing in the CSV disks.

Problem: about half the time backup jobs are executed, both full and log backups, they fail with an error message "Unable to find any healthy resource on NetApp storage".

This has started when this Windows/SQL cluster was deployed several months ago to replace a previous Windows Server 2008 R2/SQL 2012 cluster that was using clustered LUNs (not CSVs) and SnapManager rather than SnapCenter, and was not exhibiting this behavior. At the time, SnapCenter was running version 4.3.1P2, and the filer was running ONTAP 9.4P5. Since then, the filer has been replaced by a new AFF A220 (new cluster, new LUNs, data was moved by SQL backup/restore), and SnapCenter has been upgraded to version 4.4, but the problem persists. There is another SQL server in the same environment using the same settings except that it isn't clustered, so I suspect that the root of the problem is somewhere in the cluster settings, but I can't figure out what it is. Moving the instance between cluster nodes does not help with the problem. I tried digging through SMCore and plug-in logs but couldn't find anything pertinent.

1 ACCEPTED SOLUTION

hmoubara
5,846 Views

Hello,

 

Try performing the below step:

1. On both SC Plugin-Host Windows cluster nodes as well as SC server, Locate the ‘app settings‘ line in the "C:\Program Files\NetApp\SnapCenter\SMCore\SMCoreServiceHost.exe.Config" file.
2. add the line below underneath:
<add key="EnableWs2016NoRemoteCall" value="false" />”
3. on the plugin host(s),add the same line on the "C:\Program Files\NetApp\SnapCenter\SnapCenter Plug-in for Microsoft Windows\SnapDriveService.exe.config" file, restart the Plug-In for Windows Service and and SMCORE service and retry backup.

 

Let me know if this workout for you.

 

Thanks

View solution in original post

7 REPLIES 7

hmoubara
5,938 Views

Hello,

 

Can you confirm if the backup has ever worked in the new environment or is the issue intermittent?

Have you tried restarting the services of the SnapCenter plug-ins for SQL and Windows using the SnapCenter local administrative account.

If the issue persist after restarting services, we would recommend collecting the snapgathers logs and create a case with support in able to dig deep into the logs.

 

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Protection_and_Security/SnapCenter/How_to_run_the_SnapGathers_collection_tool

 

Thanks 

borismekler
5,921 Views

The backups work - about half the time; i.e. I have a log backup configured to run every 30 minutes, and it's run successfully at 02:00, 02:30, then failed at 03:00 and 03:30, then run successfully five times and failed again at 06:30, then run successfully again, etc.

I checked service accounts - SnapCenter SMCore Service and Plug-in for SQL Server Service were configured to run as local system, while Plug-in for Windows Service was configured to run as the DOMAIN\snapcenter account which is registered in SnapCenter. I set all the services on both nodes to use the DOMAIN\snapcenter account and restarted them, and the next four backups ran successfully, but the fifth failed with the same error as before.

I'm not sure I can open a case on this; the old filer on which this problem first manifested is out of support, while the new system was purchased with service from an ASP rather than direct from NetApp, and the ASP claims that SnapCenter is not covered by their contract.

hmoubara
5,913 Views

Hello,

 

Without the logs and based on the error messages, and since the issue is intermittent. I would double check the below:

  • account used for SC services has admin rights.
  • Vserver (SVM) are added within Snapcenter.
  • Re-enter the credentials for the Plugin within Snapcenter (push the credentials).
  • Update the SnapCenter Plugin for VMware interface in the vSphere Web Client plugin with an account that has SnapCenter Admin privileges

Below also some Knowledge base article that has been created for a similar issue but with a previous version of SC that might help resolve your issue:

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Protection_and_Security/SnapCenter/SCSQL_backup_fails_with_error%3A_Unable_to_find_storage_syste...

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Protection_and_Security/SnapCenter/Unable_to_find_any_healthy_resource_on_NetApp_storage

 

Thanks 

borismekler
5,874 Views

Thanks, I've seen those, but they don't apply as they relate to working with VMDKs through data broker rather than in-guest connected LUNs.

I have opened a case with our ASP, waiting for them to respond. Meanwhile, I found a relevant portion in the plug-in for Windows job logs. This is what it logs when a LUN discovery fails for a specific LUN:

 

2020-11-25T19:55:20.6124911+02:00 Verbose SDW PID=[12488] TID=[8584] ++HostDiscoveryHelper::GetHostFileSystem
2020-11-25T19:55:20.6124911+02:00 Verbose SDW PID=[12488] TID=[8584] GetHostFileSystem: volume.ObjectId = '\\?\Volume{98bfad9e-06df-4fc6-99af-841fce03e495}\'
2020-11-25T19:55:20.6124911+02:00 Verbose SDW PID=[12488] TID=[8584] ++HostDiscoveryHelper::GetPartitionForVolume
2020-11-25T19:55:20.6134910+02:00 Verbose SDW PID=[12488] TID=[8584] GetPartitionForVolume: return null for remote dedicated disk '\\?\Volume{98bfad9e-06df-4fc6-99af-841fce03e495}\'
2020-11-25T19:55:20.6134910+02:00 Verbose SDW PID=[12488] TID=[8584] --HostDiscoveryHelper::GetPartitionForVolume
2020-11-25T19:55:20.6134910+02:00 Verbose SDW PID=[12488] TID=[8584] --HostDiscoveryHelper::GetHostFileSystem
2020-11-25T19:55:20.6134910+02:00 Verbose SDW PID=[12488] TID=[8584] GetHostFileSystemList: no filesystem for \\?\Volume{98bfad9e-06df-4fc6-99af-841fce03e495}\
2020-11-25T19:55:20.6134910+02:00 Verbose SDW PID=[12488] TID=[8584] ++HostDiscoveryHelper::GetHostFileSystem

 

And this is what it logs when the same LUN (note the volume GUID) discovery succeeds, only a few minutes later:

 

2020-11-25T20:03:42.7928464+02:00 Verbose SDW PID=[12488] TID=[7540] GetHostFileSystem: volume.ObjectId = '\\?\Volume{98bfad9e-06df-4fc6-99af-841fce03e495}\'
2020-11-25T20:03:42.7928464+02:00 Verbose SDW PID=[12488] TID=[7540] ++HostDiscoveryHelper::GetPartitionForVolume
2020-11-25T20:03:42.7928464+02:00 Verbose SDW PID=[12488] TID=[7540] resultsAccessPaths.Count = 1
2020-11-25T20:03:42.7928464+02:00 Verbose SDW PID=[12488] TID=[7540] GetPartitionForVolume: return local partition of 'C:\ClusterStorage\Logs\' w/c has only 1 partition
2020-11-25T20:03:42.7928464+02:00 Verbose SDW PID=[12488] TID=[7540] --HostDiscoveryHelper::GetPartitionForVolume
2020-11-25T20:03:42.7938494+02:00 Verbose SDW PID=[12488] TID=[7540] GetHostFileSystem: partition.DiskId = '\\?\Disk{a767fbad-edcd-4be6-ac45-58926e81e625}', partition.Server = 'SQLPRODNODE1'
2020-11-25T20:03:42.7938494+02:00 Verbose SDW PID=[12488] TID=[7540] ++HostDiscoveryHelper::GetDiskForPartition
2020-11-25T20:03:42.7938494+02:00 Verbose SDW PID=[12488] TID=[7540] partition.DiskId: \\?\Disk{a767fbad-edcd-4be6-ac45-58926e81e625}
2020-11-25T20:03:42.7938494+02:00 Verbose SDW PID=[12488] TID=[7540] ++ConfigManager::ShouldCompareDiskAndPartitionServer
2020-11-25T20:03:42.7948496+02:00 Verbose SDW PID=[12488] TID=[7540] --ConfigManager::ShouldCompareDiskAndPartitionServer
2020-11-25T20:03:42.7948496+02:00 Verbose SDW PID=[12488] TID=[7540] ++ConfigManager::GetRemoteDiskEnabled
2020-11-25T20:03:42.7948496+02:00 Verbose SDW PID=[12488] TID=[7540] --ConfigManager::GetRemoteDiskEnabled
2020-11-25T20:03:42.7958570+02:00 Verbose SDW PID=[12488] TID=[7540] --HostDiscoveryHelper::GetDiskForPartition
2020-11-25T20:03:42.7958570+02:00 Verbose SDW PID=[12488] TID=[7540] GetHostFileSystem: disk.ObjectId = '\\?\Disk{a767fbad-edcd-4be6-ac45-58926e81e625}'
2020-11-25T20:03:42.7958570+02:00 Verbose SDW PID=[12488] TID=[7540] GetHostFileSystem: fileSystem = (VolumeGuid='\\?\Volume{98bfad9e-06df-4fc6-99af-841fce03e495}\', AccessPaths='C:\ClusterStorage\Logs\,\\?\Volume{98bfad9e-06df-4fc6-99af-841fce03e495}\,\\?\Volume{98bfad9e-06df-4fc6-99af-841fce03e495}\', DiskID='\\?\Disk{a767fbad-edcd-4be6-ac45-58926e81e625}')
2020-11-25T20:03:42.7958570+02:00 Verbose SDW PID=[12488] TID=[7540] --HostDiscoveryHelper::GetHostFileSystem

 

Obviously, when it can't find the logs volume for databases, it cannot execute the backup. I'm at my wits' end, however, regarding what it is that causes this intermittent 'return null' error.

hmoubara
5,847 Views

Hello,

 

Try performing the below step:

1. On both SC Plugin-Host Windows cluster nodes as well as SC server, Locate the ‘app settings‘ line in the "C:\Program Files\NetApp\SnapCenter\SMCore\SMCoreServiceHost.exe.Config" file.
2. add the line below underneath:
<add key="EnableWs2016NoRemoteCall" value="false" />”
3. on the plugin host(s),add the same line on the "C:\Program Files\NetApp\SnapCenter\SnapCenter Plug-in for Microsoft Windows\SnapDriveService.exe.config" file, restart the Plug-In for Windows Service and and SMCORE service and retry backup.

 

Let me know if this workout for you.

 

Thanks

borismekler
5,820 Views

Thank you. I have made the change, and so far LUN discovery seems to be reliable. I've done a bunch of refreshes on the cluster (under hosts -> disks view), and whereas before they used to occasionally show me three LUNs instead of five, now I get all five every time. The last ten log backups also run successfully - a longer streak than I ever got before.

Just out of curiosity, where is this documented? Is it something available to customers, or only internally to NetApp support?

hmoubara
5,743 Views

Hello,

 

It is available but i was not able to confirm till you shared the logs in the previous reply. Here is the Kb for your reference:

 

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Protection_and_Security/SnapCenter/SnapCenter_fails_to_discover_resources_or_update_resource_gro...

 

Thanks

Public