Solved: SVM DR (‑identity‑preserve ture) , can implement with (on-box) DNS Load Balancing?

GidonMarcus · ‎2016-10-25

Hi

Well, The question is more Microsoft oriented – but I couldn’t find a MS resource (except https://technet.microsoft.com/en-us/library/cc771640(v=ws.11).aspx - Understanding Zone Delegation).

NetApp relevant TR’s not saying it specifically, and I want to make sure my implementation is right:

https://www.netapp.com/us/media/tr-4523.pdf - DNS Load Balancing in ONTAP Configuration and Best Practices

https://library.netapp.com/ecm/ecm_download_file/ECMP12462587 - Clustered Data ONTAP 8.3 SVM Disaster Recovery Preparation Express Guide

https://library.netapp.com/ecm/ecm_download_file/ECMP12454817 - Clustered Data ONTAP® 8.3 Data Protection Guide

Is it safe to add the DR LIF ip as name server to the DNS delegation? or that might create performance or other intermitted issues? (depends I guess how MS doing the load spread between these servers)

Should I maybe not use it on the SVM DR and just use classic dynamic DNS?

Thanks

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

GidonMarcus · ‎2016-10-29

OK - have the result, and I’m not happy with : (Again DNS -RFE/Microsoft to blame, not NetApp) ....:

Every time the TTL of the referrals expires the client sees around a second delay per dead referral. In the test I described earlier with 20 dead LIFs, I got around 7-13 sec of hung (as it goes on all the LIFS until it fins it in random) . Every time I removed a few it got better result – proving direct relation to that. (where I have healthy LIF(s) only the latency is around 25ms before it caches it until TTL expiration)

I guess it’s most environments this might still be acceptable (only 1-2 sec, every few min for one random client). But here I don’t want to limit myself with the scalability (adding more LIFS when the cluster expands), and put a known latency cause (even if it small) when I can implement it differently and avoid it.

Thanks anyway for whom tried to help.

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

View solution in original post

parisi · ‎2016-10-25

If you add those LIFs to the name server delegation, you are looking at the potential for clients to attempt to use them in DNS queries.

The cluster has no visibility into the DNS zone, so it wouldn't know that you've configured something externally, thus would not be able to handle it appropriately.

For SVM DR, you have two options:

- A specific DR zone

- Preserving the LIF identities across the failover so that the IP addresses would remain the same on DR and the DNS zone would forward only to the IP that is up and available.

I'd suggest setting up a test SVM with DNS load balancing and see how it behaves.

GidonMarcus · ‎2016-10-28

Hi

we likely missing each other. just to be sure i checked with a packettrace. the client never receives the list of delegates servers. it's up to the DNS to go up to the filer.

We don't have stretched VLAN's to DR so i can't take the IP's with me to the DR site. and for legacy reasons i cannot use new separate zone brined CNAME DNS. i must always use the SVM name, (I can in thorny just add to the DR procedure ad and remove IP’s from there. But then there a dependency on the actual person and replication, and will probably want to failback to old Fushun dynamic DNS just so the process will be short as possible)

Ive set now a delegation 20 fake address and one true with 1 min TTL everywhere. i'm going to test it through the weekend with nslookup every X random minutes and test the response time.

unfortunately i can't think on a better test for now.

Gidi.

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

GidonMarcus · ‎2016-10-29

OK - have the result, and I’m not happy with : (Again DNS -RFE/Microsoft to blame, not NetApp) ....:

Every time the TTL of the referrals expires the client sees around a second delay per dead referral. In the test I described earlier with 20 dead LIFs, I got around 7-13 sec of hung (as it goes on all the LIFS until it fins it in random) . Every time I removed a few it got better result – proving direct relation to that. (where I have healthy LIF(s) only the latency is around 25ms before it caches it until TTL expiration)

I guess it’s most environments this might still be acceptable (only 1-2 sec, every few min for one random client). But here I don’t want to limit myself with the scalability (adding more LIFS when the cluster expands), and put a known latency cause (even if it small) when I can implement it differently and avoid it.

Thanks anyway for whom tried to help.

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK