Solved: Re: Using NFS 4.1 and VMWare vsphere - lock issue

ARUM · ‎2021-09-10

Hi,

After a controller failover due to hardware issue, some of our Virtual Machines was stopped. NetApp and VMware KBs seem to lead to avoiding NFS 4.1.

NetApp KB : VMware NFSv4.1 based virtual machines are powered off after ONTAP 9 storage failover events

VMware KB : 2089321

According to VMware, this problem is due to the release of the lock because the grace period established by the nfs server is lower than the takeover duration (if i correctly understand these bulletins). This is our case. The takeover took more than 96 seconds while the lock is maintained up to 90 sec (tr-4067).

For NetApp (KB), if we need high availability, we should avoir NFS 4.1. For VMware (KB), it works as it should, so don't wait for a resolution.

Am i the only one to use vSphere with NetApp/NFS4.1 ? I have more than 400 VMs hosted on NFS 4.1 datastore. Rather than migrating to NFSv3 datastore, could I increase lock grace period ? Which value ?

TMACMD · ‎2021-09-10

This is a known issue. It really has not been recommended to use NFSv4.1 with ESXi, especially with versions before 7.0. Lots of bad/weird things happen due to incompatibilities in implementation. ONTAP uses Parallel NFS (pnfs) and I think the term used by VMware is session trunking. The articles go into details about the nuances.

Anyway, two options....

(1) Upgrade ESXi to the latest version of 7.x *and* update the to the latest version of the NVS/VAAI VIB for ESXi *and* upgrade ONTAP to ONTAP 9.8. Not sure if there would be any painless way to get there with NFS v4.1 though. I would not be sure which order to recommend. Maybe ESXi, then the VIB, then ONTAP?

(2) Create a bunch of NFSv3 datastores and Storage vMotion everything off the v4 mounts and then unmount/delete the v4 volumes when complete. When all the volumes are evacuated and unmount, you may even want to disable nfsv4 on the ONTAP ESXi NFS SVM.

I have seen way too many customers hit this and the v3 migration is by far the easier fix.

View solution in original post

TMACMD · ‎2021-09-10

This is a known issue. It really has not been recommended to use NFSv4.1 with ESXi, especially with versions before 7.0. Lots of bad/weird things happen due to incompatibilities in implementation. ONTAP uses Parallel NFS (pnfs) and I think the term used by VMware is session trunking. The articles go into details about the nuances.

Anyway, two options....

(1) Upgrade ESXi to the latest version of 7.x *and* update the to the latest version of the NVS/VAAI VIB for ESXi *and* upgrade ONTAP to ONTAP 9.8. Not sure if there would be any painless way to get there with NFS v4.1 though. I would not be sure which order to recommend. Maybe ESXi, then the VIB, then ONTAP?

(2) Create a bunch of NFSv3 datastores and Storage vMotion everything off the v4 mounts and then unmount/delete the v4 volumes when complete. When all the volumes are evacuated and unmount, you may even want to disable nfsv4 on the ONTAP ESXi NFS SVM.

I have seen way too many customers hit this and the v3 migration is by far the easier fix.

ARUM · ‎2021-09-13

I will avoid vSphere 7/Ontap 8/NFS 4.1. I don't want another migration if there are new issues. And the KB note applies to "VMware ESXi 6 and higher"

. So I will use NFSv3. Even if there is still big problem with backup (https://kb.vmware.com/s/article/2010953).

FED_ESB · ‎2021-10-06

I came across a similiar issue with nfsv4.1 mounted datastores within a esx 6.x environment last year. Folks at NetApp agreed this patch for ESX would resolve the issue.

https://docs.vmware.com/en/VMware-vSphere/6.7/rn/esxi670-202004002.html

After applying the patches no issue through panics and OnTap upgrades.

Ontapforrum · ‎2021-09-11

Another NetApp KB: (This issue has been resolved with the release of ESXi 6.7P02 and ESXi 7.0 GA)
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/VMware_NFSv4.1_datastores_see_disruption_during_failover_events_for_ON...

As stated in previous response, lot of customers have seen this issue, I believe you might find long forum discussion threads on this issue but it says it's a VMware defect ?

ARUM · ‎2021-09-13

Thank you, i know this KB note, we updated last year as soon as a fix was published.

TMACMD · ‎2021-09-13

Like the kb says you do not have to use nfsv4

Alternatively, you can use LAN/NBD transport that uses the NFC (Network File Copy) in your backup solution or disable SCSI hot-add through the backup software

i have had customers use this method and seems to stop the stunning

heightsnj · ‎2022-06-11

All,

I just came across the message thread.

1. Can this issue really be resolved by applying the patch below?
https://docs.vmware.com/en/VMware-vSphere/6.7/rn/esxi670-202004002.html

2. Regardless if the patch would work or not, since NFSv4 is a stateful protocol, there would still be interruptions during failover/failbackup/ONTAP upgrade, Am I right?

FED_ESB · ‎2022-06-13

It's probably best to go to NFSv3. CIFS is stateful and we tend to not see issues there for failover events, which occur in milliseconds. I've come to believe VMware's implementation of NFSv4 client is problematic for OnTap. VMware uses their own locking daemon for NFSv3, and they trust their creation a little more than NFSv4's locking. Worst part about NFSv3, is how locked VMware becomes on the IP/DNS entry for the LIF. If you have to move a volume from one aggregate to another on a different node, you are basically forcing intercluster communication.

heightsnj · ‎2022-06-13

Wouldn’t that VMWare patch solve the problem with NFSv4 as you said it did?

FED_ESB · ‎2022-06-13

Yes, it should as it did for us at the time. After having conversations with VMware support for a separate issue with datastores, they appeared to me to favor NFSv3, and thus my recommendation. So, technically VMware has a fix for NFSv4, we've installed and used it, but my personal opinion is use the version the vendor prefers, although VMware hasn't listed an official preference.

Also, another thing I've learned about VMware and esx, any time you update OnTap when using datastores sourced form there, do a rolling reboot of your esx hosts. VMware's mount daemon is finicky and a rolling reboot of ESX solves all potential mount issues. Yes, that's a sad bit of "rule of thumb" especially with both versions of NFS clients allowing for failover to specific LIFs, but if your environment allows for it, I suggest doing it.

TMACMD · ‎2022-06-13

There is another reason to use NFSv3 instead of v4....Datastore Clusters. VMware does not (yet) support NFSv4.1 as a datastore for a datastore cluster....When building the DS cluster, it just will not let you add it. You can add NFSv3 datastores easily.

When I have a multi-node cluster, I will typically:

create a datastore on each controller
Create a datastore cluster with each of the datastores from each controller.

This will distribute across the nodes.

Another trick without Datastore Clusters (and using FlexGroups) would be like this:

Use ESX 7.0U3+
Use the NetApp NFS VAAI-2.0 VIB
Be sure vStorage is enable in the SVM
Be sure the export-policy rules are set (ro=sys, rw=sys,super=sys,proto=nfs)
Place ONE private/non-routable data LIF on each controller in the SVM
On your DNS server, create a reverse lookup zone for the private/non-routable network
Create a single name for the 2 (or 4 or 6 or whatever) IPs you are using.
- Test the resolution (nslookup my.host.local -> should return ALL IPs)
Create a FlexGroup (take care and make sure each constituent member is appropriately sized!)
When mounting the flexgroup, use the NAME and not the IP.
- Each ESX host will mount, but will likely go to a different controller.

heightsnj · ‎2022-06-13

when you said "datastore cluster", did you mean to use VSC for creating "datastore cluster"? In Cluster Mode, there is no concept of or a way to create "datastore cluster", as my understanding.

TMACMD · ‎2022-06-13

"Datastore Cluster" is a VMware thing (if your license supports it).

After you create the Datastore Cluster (in VMware), you can have OTV (You have updated VSC to OTV by now, right!?? -> VSC (9.7 and lower) Virtual Storage Console, OTV (9.8 and higher) ONTAP Tools for VMware) re-scan the infrastructure. Next Datastore you add with OTV should be able to see the Datastore Cluster and you can add it there. The tool ultimately creates the volume, sets up the export policy, mounts it to the cluster then moves it into the Datastore Cluster

heightsnj · ‎2022-06-13

Thanks!

Another trick without Datastore Clusters (and using FlexGroups) would be like this:

Are all those bullets were referred to steps to provision Datastore if without OTV?

TMACMD · ‎2022-06-13

@heightsnj those are things that need to or should be in place for things (like OTV and VAAI) to work. I provision Datastores all the time without OTV, but I have read the docs and know what should be going on (i.e. provisioning a volume as thin, no snapshot reserver, no snapshots, export-policy rules, etc). I do not recommend creating the datastores on your own...it is easy to miss something. Using OTV typically will apply all known best practices at the time it was created.

heightsnj · ‎2022-06-13

One last follow-up:

We often said two reasons to cause possible disruptions when use NFSv4:
NFSv4 is stateful protocol; NFSv4's lock is maintained up to 90 sec

Are these two things are the same thing, or two different things? How are they related to each other if any?