ONTAP Discussions

Using NFS 4.1 and VMWare vsphere - lock issue

ARUM

Hi,

After a controller failover due to hardware issue, some of our Virtual Machines was stopped. NetApp and VMware KBs seem to lead to avoiding NFS 4.1. 

NetApp KB : VMware NFSv4.1 based virtual machines are powered off after ONTAP 9 storage failover events

VMware KB : 2089321

According to VMware, this problem is due to the release of the lock because the grace period established by the nfs server is lower than the takeover duration (if i correctly understand these bulletins). This is our case. The takeover took more than 96 seconds while the lock is maintained up to 90 sec (tr-4067).

For NetApp (KB), if we need high availability, we should avoir NFS 4.1. For VMware (KB), it works as it should, so don't wait for a resolution. 

 

Am i the only one to use vSphere with NetApp/NFS4.1 ? I have more than 400 VMs hosted on NFS 4.1 datastore. Rather than migrating to NFSv3 datastore, could I increase lock grace period ? Which value ?

1 ACCEPTED SOLUTION

TMACMD

This is a known issue. It really has not been recommended to use NFSv4.1 with ESXi, especially with versions before 7.0. Lots of bad/weird things happen due to incompatibilities in implementation. ONTAP uses Parallel NFS (pnfs) and I think the term used by VMware is session trunking.  The articles go into details about the nuances.

 

Anyway, two options....

(1) Upgrade ESXi to the latest version of 7.x *and* update the to the latest version of the NVS/VAAI VIB for ESXi *and* upgrade ONTAP to ONTAP 9.8. Not sure if there would be any painless way to get there with NFS v4.1 though. I would not be sure which order to recommend. Maybe ESXi, then the VIB, then ONTAP? 

(2) Create a bunch of NFSv3 datastores and Storage vMotion everything off the v4 mounts and then unmount/delete the v4 volumes when complete. When all the volumes are evacuated and unmount, you may even want to disable nfsv4 on the ONTAP ESXi NFS SVM.

 

I have seen way too many customers hit this and the v3 migration is by far the easier fix.

View solution in original post

6 REPLIES 6

TMACMD

This is a known issue. It really has not been recommended to use NFSv4.1 with ESXi, especially with versions before 7.0. Lots of bad/weird things happen due to incompatibilities in implementation. ONTAP uses Parallel NFS (pnfs) and I think the term used by VMware is session trunking.  The articles go into details about the nuances.

 

Anyway, two options....

(1) Upgrade ESXi to the latest version of 7.x *and* update the to the latest version of the NVS/VAAI VIB for ESXi *and* upgrade ONTAP to ONTAP 9.8. Not sure if there would be any painless way to get there with NFS v4.1 though. I would not be sure which order to recommend. Maybe ESXi, then the VIB, then ONTAP? 

(2) Create a bunch of NFSv3 datastores and Storage vMotion everything off the v4 mounts and then unmount/delete the v4 volumes when complete. When all the volumes are evacuated and unmount, you may even want to disable nfsv4 on the ONTAP ESXi NFS SVM.

 

I have seen way too many customers hit this and the v3 migration is by far the easier fix.

ARUM

I will avoid vSphere 7/Ontap 8/NFS 4.1. I don't want another migration if there are new issues. And the KB note applies to  "VMware ESXi 6 and higher"

.  So I will use NFSv3. Even if there is still big problem with backup  (https://kb.vmware.com/s/article/2010953).

FED_ESB

I came across a similiar issue with nfsv4.1 mounted datastores within a esx 6.x environment last year.  Folks at NetApp agreed this patch for ESX would resolve the issue.

 

https://docs.vmware.com/en/VMware-vSphere/6.7/rn/esxi670-202004002.html

 

After applying the patches no issue through panics and OnTap upgrades.

Ontapforrum

Another NetApp KB: (This issue has been resolved with the release of ESXi 6.7P02 and ESXi 7.0 GA)
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/VMware_NFSv4.1_datastores_see_disruption_during_failover_events_for_ON...

 

As stated in previous response, lot of customers have seen this issue, I believe you might find long forum discussion threads on this issue but it says it's a VMware defect ?

ARUM

Thank you, i know this KB note, we updated last year as soon as a fix was published.

TMACMD

Like the kb says you do not have to use nfsv4 

 

Alternatively, you can use LAN/NBD transport that uses the NFC (Network File Copy) in your backup solution or disable SCSI hot-add through the backup software

 

i have had customers use this method and seems to stop the stunning

Public