Network and Storage Protocols

Loss of connection in vSphere 6.7 during ONTAP upgrade with datastore NFS 4.1

AlexMM
8,273 Views

We have 3 hosts with vSphere 6.7 with many datastore mounted in NFS 4.1 and a FAS2750 with ONTAP 9.6. I have follow this guide https://www.netapp.com/us/media/tr-4597.pdf but in every ONTAP upgrade we have problems of virtual machines disconnecting. Datastores lose connection for a few seconds, and when they come back browsable the VMs are "disconnected", only solution is reboot the hosts.. has it ever happened to anyone? with NFS 3 it works perfectly!

 

1 ACCEPTED SOLUTION

Ontapforrum
7,968 Views

Here is the NetApp Kb, stating : Do not use NFSv4.1 if high availability is required.

 

VMWare NFSv4.1 datastores see disruption during failover events for ONTAP 9:
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/VMWare_NFSv4.1_datastores_see_disruption_during_failover_events_for_ON...

 

Cause: VMWare defect PR 2180116.

View solution in original post

7 REPLIES 7

Ontapforrum
8,239 Views

Hi,

 

We don't use NFSv4 yet, but I am talking here purely from Protocol's feature/capability wise. 

 

With NFS version 4 onwards, NFS turns 'Stateful' just like SMB1/2/2_1 (CIFS). Hence, by very nature they will be disconnected. However, NFSv3 is 'state-less' which means it simply 'keeps trying', therefore more handy/reliable protocol for transparent failover (Where Node undergoes reboot) . Ironically, you can compare NFSv3 with SMBv3 b'cos with SMB-3, it supports transparent failover as 'continuous-availability' or CA feature which is a recommended settings for HYPER-V environment. It suits HYPER-V b'cos CA maintains persistent handle lock-state across Node reboot. I believe, NFSv4 will do a good job as long as Node is not rebooted.

 

I think NFSv3 protocol is better suited for transparent planned fail-overs with more availability and reduced downtime (Where Node is under going a reboot), compare to NFSv4.

 

Thanks!

AlexMM
8,183 Views

Hi,

 

Thank you for your answer, I don't understand why Netapp doesn't say not to use nfs 4.1 in production environments... it is unthinkable to lose the link during an upgrade...

tahmad
8,147 Views

Hopefully the below guide is helpful, please refer to page 66 (7.3 NFSv4.1 Sessions) where more information is provided about the disruption and the behavior when using NFSv4.1.

TR-4067: NFS Best Practice and Implementation Guide 

AlexMM
8,143 Views

thanks tahmad,

 

very interesting guide, therefore I remain of the opinion that in production environments it is better not to use NFS4.1. Everything works fine until you have to do a cluster upgrade...

Ontapforrum
8,133 Views

I totally understand your point and what I mentioned was purely based on NFSv3 and number of controller upgrades have done over the years. This does not mean that NFS4 is not fit for production usage, that would be completely false. NFSv4 is much more optimized and efficient compared to NFSv3 and there is no doubt about it, just that we need some solid understanding and some kind of testing to determine the recommended settings and there is very little documentation around it especially around 'Node Fail-over-givebacks'.

 

I think it's not about just NFSv4/4_1 on the Server side (NetApp/ONTAP), but onus is also on the client (*nix/VMware) side to re-establish the stateID.

 

According to :RFC:7530 (NFSv4)

If the server loses locking state (usually as a result of a restart or reboot), it must allow clients time to discover this fact and re-establish the lost locking state. A client can determine that server failure (and thus loss of locking state) has occurred, when it receives one of two errors. The NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a client ID invalidated by reboot or restart.

 

I believe there is a requirement for more testing on determining what would be the 'recommended settings' on ESX side along with Best Practices guide that is already published by NetApp and other vendors for their NFSv4 adaptability. Until then, I would say - We must find out the reasons for failure. As you have already experienced 'time-outs' and disconnections first-hand, can we look at the logs on the ESXi side to determine what were the 'errors' reported so that we can probably 'tweak' the NFS advance settings, there are number of these but one such is 'NFS.DiskFileLockUpdateFreq'. I am not an expert on NFS protocol, but it's a very neatly documented (Public Facing RFC) information and helps in understanding the basic nature of communications between client & server.

 

The best case would be to: Get VMware & NetApp to jointly look at the case (Which is doable) and probably investigate the reasons for the lost connections during 'Controller Node Reboot'.

Ontapforrum
7,969 Views

Here is the NetApp Kb, stating : Do not use NFSv4.1 if high availability is required.

 

VMWare NFSv4.1 datastores see disruption during failover events for ONTAP 9:
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/VMWare_NFSv4.1_datastores_see_disruption_during_failover_events_for_ON...

 

Cause: VMWare defect PR 2180116.

mattjudson
3,788 Views

As far as I'm concerned this issue still exists with lastest esxi version and ontap version with NFS4.1 datastores. I'm running ESXi 7.0U3i and during the upgrade from ontap 9.12.1RC1 to 9.12.1 my NFS4.1 datastore entered APD state while the NFSv3 datastore is still unaffected. In my option both vmware and netapp KB's associated needs to be updated to reflect this and have more information.

 

mattjudson_0-1676206476001.png

 

Public