We have 3 hosts with vSphere 6.7 with many datastore mounted in NFS 4.1 and a FAS2750 with ONTAP 9.6. I have follow this guide https://www.netapp.com/us/media/tr-4597.pdf but in every ONTAP upgrade we have problems of virtual machines disconnecting. Datastores lose connection for a few seconds, and when they come back browsable the VMs are "disconnected", only solution is reboot the hosts.. has it ever happened to anyone? with NFS 3 it works perfectly!
We don't use NFSv4 yet, but I am talking here purely from Protocol's feature/capability wise.
With NFS version 4 onwards, NFS turns 'Stateful' just like SMB1/2/2_1 (CIFS). Hence, by very nature they will be disconnected. However, NFSv3 is 'state-less' which means it simply 'keeps trying', therefore more handy/reliable protocol for transparent failover (Where Node undergoes reboot) . Ironically, you can compare NFSv3 with SMBv3 b'cos with SMB-3, it supports transparent failover as 'continuous-availability' or CA feature which is a recommended settings for HYPER-V environment. It suits HYPER-V b'cos CA maintains persistent handle lock-state across Node reboot. I believe, NFSv4 will do a good job as long as Node is not rebooted.
I think NFSv3 protocol is better suited for transparent planned fail-overs with more availability and reduced downtime (Where Node is under going a reboot), compare to NFSv4.
I totally understand your point and what I mentioned was purely based on NFSv3 and number of controller upgrades have done over the years. This does not mean that NFS4 is not fit for production usage, that would be completely false. NFSv4 is much more optimized and efficient compared to NFSv3 and there is no doubt about it, just that we need some solid understanding and some kind of testing to determine the recommended settings and there is very little documentation around it especially around 'Node Fail-over-givebacks'.
I think it's not about just NFSv4/4_1 on the Server side (NetApp/ONTAP), but onus is also on the client (*nix/VMware) side to re-establish the stateID.
According to :RFC:7530 (NFSv4)
If the server loses locking state (usually as a result of a restart or reboot), it must allow clients time to discover this fact and re-establish the lost locking state. A client can determine that server failure (and thus loss of locking state) has occurred, when it receives one of two errors. The NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a client ID invalidated by reboot or restart.
I believe there is a requirement for more testing on determining what would be the 'recommended settings' on ESX side along with Best Practices guide that is already published by NetApp and other vendors for their NFSv4 adaptability. Until then, I would say - We must find out the reasons for failure. As you have already experienced 'time-outs' and disconnections first-hand, can we look at the logs on the ESXi side to determine what were the 'errors' reported so that we can probably 'tweak' the NFS advance settings, there are number of these but one such is 'NFS.DiskFileLockUpdateFreq'. I am not an expert on NFS protocol, but it's a very neatly documented (Public Facing RFC) information and helps in understanding the basic nature of communications between client & server.
The best case would be to: Get VMware & NetApp to jointly look at the case (Which is doable) and probably investigate the reasons for the lost connections during 'Controller Node Reboot'.