Changing NFS Export Policy causing outage?

belgianbiscuit · ‎2014-12-10

Hi,

I am not a storage engineer so excuse my ignorance.

We had an issue with ESX losing access to its volumes about 5-10 minutes after a change was made to the default export policy for the volumes.
Although I have no exact details of the change it basically involved adding a rule of adding a network range. Not sure of individual settings of rule.
The policy contained specific individual IP addresses refering to ESX NFS Vmkernels. Could the change to the default policy have overwritten something that caused the disconnects?
I was always under the impression that the export permissions were only checked on ESX boot but if that is my assumption is correct then this should not have happened?

Is there something else that could cause NFS to recheck for export (i.e. VSC storage discovery).
The datastores were reconnected upon removing the newly added rule.

Clustered Ontap using SVM for ESX

parisi · ‎2015-01-14

When a change is made to an export policy in clustered Data ONTAP, the export policy access cache is flushed and needs to be re-populated.

If there are issues re-populating the cache, access could be denied in those scenarios. If there are a large number of policies/rules/clients, then repopulating the cache could cause an RPC storm to the cluster management gateway or to name services like DNS and cause a denial of service scenario.

If you were to remove a client from a policy, that client would no longer have access to the export. This is by design.

TR-4067 covers export policies and rules in detail and describes some best practices for exports in cDOT.

http://www.netapp.com/us/media/tr-4067.pdf

Additionally, I am working on a name services best practice guide to cover this.

If you are on 8.2.x now, I'd highly recommend moving to 8.2.2P2 in the immediate future, as there are a number of fixes to avoid these types of problems, such as this one. Then as soon as 8.2.3 is available, upgrade. That is the recommended 8.2.x release for NFS exports.

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=843089

http://mysupport.netapp.com/NOW/download/software/ontap/8.2.2P2/

Also, if you are using IP addresses, I'd recommend ensuring those IP addresses have reverse lookup records (PTR) in DNS if possible. If that's not possible, then add the ESX servers to DNS hosts on the cluster.

::> dns hosts create -vserver [cluster vserver]

In 8.2 and earlier, this would be done at the cluster vserver. In 8.3 and later, do this at the data vserver.

bsnyder27 · ‎2015-03-09

I had an experience like this. It was due to export-policy rules existing in which the hostnames defined no longer existed. Caused the entire export-policy to fail UNTIL the moment I deleted all offending rules in the export-policy.

I somewhat understand this behavior, but it also occurs if you add a new rule to an export policy with a hostname that does not resolve. No warnings or anything. It allows you to create the rule and then the hosts defined in valid rules lose access as you described.

This actually is occurring for us in 8.2.3

parisi · ‎2015-03-09

Hostnames not existing could cause this issue in 8.2.3 as well.

Depending on your DNS server, there are two bugs open for this.

892388 - Windows DNS

894336 - BIND DNS

If you hit this issue, you'd likely see the following in mgwd logs:

ERR: mgwd::exports: hostMatchUsingClusterVserver: addrerr=2, errno=0

A patch release is in the works to address this problem.

In the meantime, ensure all your hosts can resolve and use FQDNs (or IP addresses) instead of short names in the policies/netgroups.

bsnyder27 · ‎2015-03-09

Thanks for this information. Good to know that it's been identified already and hopefully being remedied.

Someshat similarly, a quota resize will not succeed if there are user quotas defined for domain users that no longer exist in AD. This used to just be a warning in 7-mode, but now it will not succeed until the said user quotas are removed.

parisi · ‎2015-03-10

If you are seeing a hang on quota resize with missing users and 7mode used to fail and move on, open up a case and have a bug opened. The behavior should match.

bsnyder27 · ‎2015-03-11

Thanks! I will open a case as you mentioned.