Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
After a controller failover due to hardware issue, some of our Virtual Machines was stopped. NetApp and VMware KBs seem to lead to avoiding NFS 4.1.
NetApp KB : VMware NFSv4.1 based virtual machines are powered off after ONTAP 9 storage failover events
VMware KB : 2089321
According to VMware, this problem is due to the release of the lock because the grace period established by the nfs server is lower than the takeover duration (if i correctly understand these bulletins). This is our case. The takeover took more than 96 seconds while the lock is maintained up to 90 sec (tr-4067).
For NetApp (KB), if we need high availability, we should avoir NFS 4.1. For VMware (KB), it works as it should, so don't wait for a resolution.
Am i the only one to use vSphere with NetApp/NFS4.1 ? I have more than 400 VMs hosted on NFS 4.1 datastore. Rather than migrating to NFSv3 datastore, could I increase lock grace period ? Which value ?
Solved! See The Solution
1 ACCEPTED SOLUTION
ARUM has accepted the solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a known issue. It really has not been recommended to use NFSv4.1 with ESXi, especially with versions before 7.0. Lots of bad/weird things happen due to incompatibilities in implementation. ONTAP uses Parallel NFS (pnfs) and I think the term used by VMware is session trunking. The articles go into details about the nuances.
Anyway, two options....
(1) Upgrade ESXi to the latest version of 7.x *and* update the to the latest version of the NVS/VAAI VIB for ESXi *and* upgrade ONTAP to ONTAP 9.8. Not sure if there would be any painless way to get there with NFS v4.1 though. I would not be sure which order to recommend. Maybe ESXi, then the VIB, then ONTAP?
(2) Create a bunch of NFSv3 datastores and Storage vMotion everything off the v4 mounts and then unmount/delete the v4 volumes when complete. When all the volumes are evacuated and unmount, you may even want to disable nfsv4 on the ONTAP ESXi NFS SVM.
I have seen way too many customers hit this and the v3 migration is by far the easier fix.
16 REPLIES 16
ARUM has accepted the solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a known issue. It really has not been recommended to use NFSv4.1 with ESXi, especially with versions before 7.0. Lots of bad/weird things happen due to incompatibilities in implementation. ONTAP uses Parallel NFS (pnfs) and I think the term used by VMware is session trunking. The articles go into details about the nuances.
Anyway, two options....
(1) Upgrade ESXi to the latest version of 7.x *and* update the to the latest version of the NVS/VAAI VIB for ESXi *and* upgrade ONTAP to ONTAP 9.8. Not sure if there would be any painless way to get there with NFS v4.1 though. I would not be sure which order to recommend. Maybe ESXi, then the VIB, then ONTAP?
(2) Create a bunch of NFSv3 datastores and Storage vMotion everything off the v4 mounts and then unmount/delete the v4 volumes when complete. When all the volumes are evacuated and unmount, you may even want to disable nfsv4 on the ONTAP ESXi NFS SVM.
I have seen way too many customers hit this and the v3 migration is by far the easier fix.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will avoid vSphere 7/Ontap 8/NFS 4.1. I don't want another migration if there are new issues. And the KB note applies to "VMware ESXi 6 and higher"
. So I will use NFSv3. Even if there is still big problem with backup (https://kb.vmware.com/s/article/2010953).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I came across a similiar issue with nfsv4.1 mounted datastores within a esx 6.x environment last year. Folks at NetApp agreed this patch for ESX would resolve the issue.
https://docs.vmware.com/en/VMware-vSphere/6.7/rn/esxi670-202004002.html
After applying the patches no issue through panics and OnTap upgrades.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another NetApp KB: (This issue has been resolved with the release of ESXi 6.7P02 and ESXi 7.0 GA)
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/VMware_NFSv4.1_datastores_see_disruption_during_failover_events_for_ON...
As stated in previous response, lot of customers have seen this issue, I believe you might find long forum discussion threads on this issue but it says it's a VMware defect ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, i know this KB note, we updated last year as soon as a fix was published.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Like the kb says you do not have to use nfsv4
Alternatively, you can use LAN/NBD transport that uses the NFC (Network File Copy) in your backup solution or disable SCSI hot-add through the backup software
i have had customers use this method and seems to stop the stunning
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All,
I just came across the message thread.
1. Can this issue really be resolved by applying the patch below?
https://docs.vmware.com/en/VMware-vSphere/6.7/rn/esxi670-202004002.html
2. Regardless if the patch would work or not, since NFSv4 is a stateful protocol, there would still be interruptions during failover/failbackup/ONTAP upgrade, Am I right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's probably best to go to NFSv3. CIFS is stateful and we tend to not see issues there for failover events, which occur in milliseconds. I've come to believe VMware's implementation of NFSv4 client is problematic for OnTap. VMware uses their own locking daemon for NFSv3, and they trust their creation a little more than NFSv4's locking. Worst part about NFSv3, is how locked VMware becomes on the IP/DNS entry for the LIF. If you have to move a volume from one aggregate to another on a different node, you are basically forcing intercluster communication.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wouldn’t that VMWare patch solve the problem with NFSv4 as you said it did?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, it should as it did for us at the time. After having conversations with VMware support for a separate issue with datastores, they appeared to me to favor NFSv3, and thus my recommendation. So, technically VMware has a fix for NFSv4, we've installed and used it, but my personal opinion is use the version the vendor prefers, although VMware hasn't listed an official preference.
Also, another thing I've learned about VMware and esx, any time you update OnTap when using datastores sourced form there, do a rolling reboot of your esx hosts. VMware's mount daemon is finicky and a rolling reboot of ESX solves all potential mount issues. Yes, that's a sad bit of "rule of thumb" especially with both versions of NFS clients allowing for failover to specific LIFs, but if your environment allows for it, I suggest doing it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is another reason to use NFSv3 instead of v4....Datastore Clusters. VMware does not (yet) support NFSv4.1 as a datastore for a datastore cluster....When building the DS cluster, it just will not let you add it. You can add NFSv3 datastores easily.
When I have a multi-node cluster, I will typically:
- create a datastore on each controller
- Create a datastore cluster with each of the datastores from each controller.
This will distribute across the nodes.
Another trick without Datastore Clusters (and using FlexGroups) would be like this:
- Use ESX 7.0U3+
- Use the NetApp NFS VAAI-2.0 VIB
- Be sure vStorage is enable in the SVM
- Be sure the export-policy rules are set (ro=sys, rw=sys,super=sys,proto=nfs)
- Place ONE private/non-routable data LIF on each controller in the SVM
- On your DNS server, create a reverse lookup zone for the private/non-routable network
- Create a single name for the 2 (or 4 or 6 or whatever) IPs you are using.
- Test the resolution (nslookup my.host.local -> should return ALL IPs)
- Create a FlexGroup (take care and make sure each constituent member is appropriately sized!)
- When mounting the flexgroup, use the NAME and not the IP.
- Each ESX host will mount, but will likely go to a different controller.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
when you said "datastore cluster", did you mean to use VSC for creating "datastore cluster"? In Cluster Mode, there is no concept of or a way to create "datastore cluster", as my understanding.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Datastore Cluster" is a VMware thing (if your license supports it).
After you create the Datastore Cluster (in VMware), you can have OTV (You have updated VSC to OTV by now, right!?? -> VSC (9.7 and lower) Virtual Storage Console, OTV (9.8 and higher) ONTAP Tools for VMware) re-scan the infrastructure. Next Datastore you add with OTV should be able to see the Datastore Cluster and you can add it there. The tool ultimately creates the volume, sets up the export policy, mounts it to the cluster then moves it into the Datastore Cluster
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks!
Another trick without Datastore Clusters (and using FlexGroups) would be like this:
Are all those bullets were referred to steps to provision Datastore if without OTV?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@heightsnj those are things that need to or should be in place for things (like OTV and VAAI) to work. I provision Datastores all the time without OTV, but I have read the docs and know what should be going on (i.e. provisioning a volume as thin, no snapshot reserver, no snapshots, export-policy rules, etc). I do not recommend creating the datastores on your own...it is easy to miss something. Using OTV typically will apply all known best practices at the time it was created.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One last follow-up:
We often said two reasons to cause possible disruptions when use NFSv4:
NFSv4 is stateful protocol; NFSv4's lock is maintained up to 90 sec
Are these two things are the same thing, or two different things? How are they related to each other if any?
