Failover of CIFS - Issues with DFS

DinoBob · ‎2018-03-26

Perhaps those of you who use DFS can help.

I have a pretty typical setup with a cluster at my prod site and a cluster at my DR site, with snapmirror protecting my CIFS. My hardware is AFF8080s running CDOT 9.

I have a DNS A record for my production CIFS SVM: Let's call it

uscifsProd1.companyname.net

and a DNS A record for my DR SVM. Let's call it

uscifsDR1.companyname.net

I also have a CNAME record pointing to the production CIFS SVM:

usfs1.companyname.net >uscifsProd1.mathematica.net

I use this name in my DFS namespace, so a typical target looks like:

\\usfs1.companyname.net\Project1\NYC

During a failover, I will "float" the CNAME record over to my DR CIFS SVM:

usfs1.companyname.net >uscifsDR1.mathematica.net

I also change the Service Principal Names for the CIFS service:

Old:

setspn.exe -D HOST/usfs1.companyname.net USCIFSProd1

setspn.exe -D HOST/usfs1 USCIFSProd1

New:

setspn.exe -A HOST/usfs1.companyname.net USCIFSDR1

setspn.exe -A HOST/usfs1 USCIFSDR1

I then force a replication in AD. Once the DNS change propagates, clients should be able to access the CIFS shares at the DR site.

Key facts:

The CNAME record is updated and responds correctly to pings.

The workstation can access the share if I browse directly via

\\usfs1.companyname.net\Project1\NYC

or

\\uscifsDR1.companyname.net\Project1\NYC

However, when I browse to the network locations via the drive letter assigned to the namespace, i.e.

N:\Project1\NYC

or the UNC which uses the namespace, i.e.

\\companyname.net\NDrive\ProjectVol\Project1\NYC

I receive an error: The network path cannot be found.

A wireshark trace reveals a Kerberos mismatch. So it's not a network issue; it's that Kerberos is failing.

I have tried using KLIST to purge every ticket I can think of, including those of the network service, and the local system account. I have also purged the DFS caches using DfsUtil. All to no avail.

I do know that the client gets its DFS info through the Workstation Service. And, restarting the Workstation service (or rebooting the client) clears the issue.

So, question one:

1) Is there a way to remedy the issue without rebooting the clients (~1500) or restarting the Workstation service?

2) If not, is there another/better way to engineer the failover? I am NOT willing to move my CIFS service to a Windows environment as many have suggested, for many reasons.

I've considered instead modifying the links in the namespace directly via a script but this would obviously not be preferred as I'd much rather change one CNAME record than 2000 DFS target links.

Thanks.

MantyBlue · ‎2018-06-27

Having a similar issue, Did you determine a solution? Thanks!

DinoBob · ‎2018-07-03

I have no determined a solution as of yet. Is your problem identical to mine? Tell me more about your environment.

KieranMcKenna · ‎2019-03-14

We used to try the DNS c-name failover method several years ago but always hit the issues you describe when testing failover - TTLs causing clients not to redirect to the DR CIFS vserver, DFS caching/TTLs compounding the issue - eventually some clients redirect and some dont....not good. You end up troubleshooting DNS/DFS, knowing that your DR shares are available via hard UNC links \\servername\share, while the bosses are saying "Great. We gave you all that money for Netapp and when you fail-over, 100s of clients cant connect!"

Also, we have hundreds of shares, so messing around with DFS targets isnt really an option.

At 10,000ft, what we did to get round it in 7-mode, (AND WE'VE ADAPTED THE METHOD FOR C-MODE) was:

1. Create an OTV VLAN between production and DR site LANs, so that IP addresses of the production and DR CIFS vserver are on the same subnet. THIS MAKES FAILOVER SOOOOO MUCH EASIER - IT TAKES HAVING TO TOUCH ANY OTHER SUPPORTING INFRASTRUCTURE (DNS, DFS, etc) OUT OF THE PICTURE WHEN PERFORMING FAILOVER STEPS.

2. Create a 'dummy' vserver at the DR site, with a placeholder CIFS/AD name and IP address.

3. Production site goes down (test or real-life). At DR site -

Break your snapmirrors!

Delete dummy vserver/ SVM's IP address and replace with production CIFS IP address. (no duplicate conflict since primary is down!)

Delete dummy CIFS server config from vserver/SVM and re-create CIFS server config for DR vserver/SVM so that name matches failed primary CIFS vserver AD name. You need to join AD domain with this name, and you can re-use the AD object of the failed primary CIFS server (you should probably 'reset' the account in AD).

4. CIFS server should now start at the DR site and you should be able to ping the name of the failed primary CIFS server on its original IP!

5. Re-create your CIFS shares manually or import them via Powershell.......

6. Good to rock and roll!

We've used this in real life and it was as good as gold.....RTO was within 30mins for all clients instead of hours with hundreds of clients working and hundreds of clients not working.

main caveat: USE OTV VLANS!!!

Failover of CIFS - Issues with DFS

Introducing GenAI Search on NSS