Two Windows Server 2016 virtual machines running on top of ESXi 6.7 in a Microsoft cluster. ONTAP 9.7P8. Servers are using in-guest iSCSI initiators to mount a number of LUNs as CSV disks. SQL Server 2017 is installed as a failover cluster (not AlwaysOn groups) on these CSV disks. SnapCenter 4.4 is managing both VMs with plug-ins for Windows and MSSQL. SnapCenter for VSphere plug-in virtual appliance is installed and configured, but not linked to the SnapCenter server itself, so the C: drive VMDKs are not managed by SnapCenter. Policies and schedules are configured to back up the databases residing in the CSV disks.
Problem: about half the time backup jobs are executed, both full and log backups, they fail with an error message "Unable to find any healthy resource on NetApp storage".
This has started when this Windows/SQL cluster was deployed several months ago to replace a previous Windows Server 2008 R2/SQL 2012 cluster that was using clustered LUNs (not CSVs) and SnapManager rather than SnapCenter, and was not exhibiting this behavior. At the time, SnapCenter was running version 4.3.1P2, and the filer was running ONTAP 9.4P5. Since then, the filer has been replaced by a new AFF A220 (new cluster, new LUNs, data was moved by SQL backup/restore), and SnapCenter has been upgraded to version 4.4, but the problem persists. There is another SQL server in the same environment using the same settings except that it isn't clustered, so I suspect that the root of the problem is somewhere in the cluster settings, but I can't figure out what it is. Moving the instance between cluster nodes does not help with the problem. I tried digging through SMCore and plug-in logs but couldn't find anything pertinent.
The backups work - about half the time; i.e. I have a log backup configured to run every 30 minutes, and it's run successfully at 02:00, 02:30, then failed at 03:00 and 03:30, then run successfully five times and failed again at 06:30, then run successfully again, etc.
I checked service accounts - SnapCenter SMCore Service and Plug-in for SQL Server Service were configured to run as local system, while Plug-in for Windows Service was configured to run as the DOMAIN\snapcenter account which is registered in SnapCenter. I set all the services on both nodes to use the DOMAIN\snapcenter account and restarted them, and the next four backups ran successfully, but the fifth failed with the same error as before.
I'm not sure I can open a case on this; the old filer on which this problem first manifested is out of support, while the new system was purchased with service from an ASP rather than direct from NetApp, and the ASP claims that SnapCenter is not covered by their contract.
Thanks, I've seen those, but they don't apply as they relate to working with VMDKs through data broker rather than in-guest connected LUNs.
I have opened a case with our ASP, waiting for them to respond. Meanwhile, I found a relevant portion in the plug-in for Windows job logs. This is what it logs when a LUN discovery fails for a specific LUN:
1. On both SC Plugin-Host Windows cluster nodes as well as SC server, Locate the ‘app settings‘ line in the "C:\Program Files\NetApp\SnapCenter\SMCore\SMCoreServiceHost.exe.Config" file. 2. add the line below underneath: <add key="EnableWs2016NoRemoteCall" value="false" />” 3. on the plugin host(s),add the same line on the "C:\Program Files\NetApp\SnapCenter\SnapCenter Plug-in for Microsoft Windows\SnapDriveService.exe.config" file, restart the Plug-In for Windows Service and and SMCORE service and retry backup.
Thank you. I have made the change, and so far LUN discovery seems to be reliable. I've done a bunch of refreshes on the cluster (under hosts -> disks view), and whereas before they used to occasionally show me three LUNs instead of five, now I get all five every time. The last ten log backups also run successfully - a longer streak than I ever got before.
Just out of curiosity, where is this documented? Is it something available to customers, or only internally to NetApp support?