Solved: Any way to make Linux systems more tolerant to NFS disruptions?

kodiak_f · ‎2021-07-15

Hi Folks,

I'm making this post hot on the heels of yet another network blip bringing down NFS hard mounts across a bunch of Linux systems. Most of our systems are reasonably modern, Ubuntu 20.04 LTS & RHEL 7.

The mount arguments are:

rw,relatime,vers=4.1,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=*.*.*.*,local_lock=none,addr=*.*.*.*,_netdev

I believe those are pretty much the default - we set these in /etc/fstab:

nfsvers=4.1,defaults,_netdev,nofail

Sadly we don't have any sort of dedicated NFS network, and our NFS shares are exported on one vlan and have to be routed through generally 1 intermediate network firewall to get to the client. Hard to get around this given the network we are stuck with.

Any advice is welcome - one thing I was thinking about doing was really pushing the timeo value - maybe 1 hour total by setting timeo=12000,retrans=2 - or timeo=600,retrans=60 ?

When our network has a problem it's usually only a problem for about 15-20 minutes.

Thanks!

parisi · ‎2021-07-16

Your firewall may be breaking this. Once the connection is severed with NFSv4, lock reclaims will be attempted for about 90 seconds, but then the storage will get rid of the locks. But that shouldn't impact connectivity.

What may be happening here is the firewall is closing the source port that the connection was using.

For example, when an NFS mount is made to ONTAP, it uses port 2049 for NFS, but it uses a different *source* port each time.

tme-a300-efs01::*> network connections active show -service nfs* -fields remote-port,service,local-port -sort-by cid
node              cid       vserver local-port remote-port service
----------------- --------- ------- ---------- ----------- -------
tme-a300-efs01-01 46969301   DEMO   2049       819         nfs
tme-a300-efs01-01 46969303   DEMO   2049       805         nfs
tme-a300-efs01-02 275045092  DEMO   2049       1020        nfs
tme-a300-efs01-02 275045094  DEMO   2049       917         nfs

What probably happens is that the firewall severs the connection and when the client tries to re-connect over that same port, it can't because the connection is gone.

We cover similar issues with NAT in this TR on page 73:

https://www.netapp.com/pdf.html?item=/media/10720-tr-4067.pdf

And we discuss firewall considerations on page 77. There are a couple of ONTAP NFS server options (-idle-connection-timeout and -allow-idle-connection) that may help here.

[-idle-connection-timeout <integer>] - Idle Connection Timeout Value (in seconds)
This optional parameter specifies the idle connection timeout for NFS connections. The value specified must be between 120 and 86400.

[-allow-idle-connection {enabled|disabled}] - Are Idle NFS Connections Supported
This optional parameter specifies whether to enable idle NFS connections. The default setting is disabled.

View solution in original post

GerhardWolf · ‎2021-07-16

Hi,

what perhaps can help is tweaking the Linux system for a better buffer behavior. In the AI environment we are choosing these parameters:

•/etc/sysctl.d parameters

# allow testing with buffers up to 64MB - good for 10 GbE

#net.core.rmem_max=67108864

#net.core.rmem_default=67108864

#net.core.optmem_max=67108864

#net.core.wmem_max = 67108864

#net.core.wmem_default = 67108864

# min, default, max number of bytes allocated for the socket's receive/send buffer - good for 10GbE

# 32MBfor autotuning TCP buffer limits min, default and max

#net.ipv4.tcp_rmem = 4096 87380 33554432

#net.ipv4.tcp_wmem = 4096 65536 33554432

# allow testing with buffers up to128MB - good for 40GbE

#net.core.rmem_max=134217728

#net.core.rmem_default=134217728

#net.core.optmem_max=134217728

#net.core.wmem_max = 134217728

#net.core.wmem_default = 134217728

# min, default, max number of bytes allocated for the socket's receive/send buffer - good for 40GbE

# 64MB for autotuning TCP buffer limits min, default and max

#net.ipv4.tcp_rmem = 4096 87380 67108864

#net.ipv4.tcp_wmem = 4096 65536 67108864

# The maximum number of packets queued in received state - good for 10GbE

#net.core.netdev_max_backlog = 30000

# allow testing with buffers up to 256MB - good for 100GbE

net.core.rmem_max=268435456

net.core.rmem_default=268435456

net.core.optmem_max=268435456

net.core.wmem_max = 268435456

net.core.wmem_default = 268435456

# min, default, max number of bytes allocated for the socket's receive/send buffer - good for 100GbE

# 128MB for autotuning TCP buffer limits min, default and max

net.ipv4.tcp_rmem = 4096 87380 134217728

net.ipv4.tcp_wmem = 4096 65536 134217728

# Enable memory auto tuning - good for all

net.ipv4.tcp_moderate_rcvbuf = 1

# The maximum number of packets queued in received state - good for 100GbE

net.core.netdev_max_backlog = 300000

# for NFSv3 performance (128 for more read workloads, 64 for better write latency, 16 if a lot of clients accessing the system) – ONTAP only supports 128 slot.

sunrpc.tcp_slot_table_entries = 128

sunrpc.tcp_max_slot_table_entries=128

# Timeout closing of TCP connections after 7 seconds.

net.ipv4.tcp_fin_timeout = 7

# Avoid falling back to slow start after a connection goes idle.

net.ipv4.tcp_slow_start_after_idle = 0

# Enable Forward Acknowledgment, which operates with Selective Acknowledgment (SACK) to reduce congestion.

net.ipv4.tcp_fack = 1

# Support windows larger than 64KB.

net.ipv4.tcp_window_scaling = 1

# Enable selective acknowledgment, which improves performance by selectively acknowledging packets received out of order.

net.ipv4.tcp_sack = 1

# Enable calculation of RTT in a more accurate way (see RFC 1323) than the retransmission timeout.

net.ipv4.tcp_timestamps = 1

# don't cache ssthresh from previous connection

net.ipv4.tcp_no_metrics_save = 1

# recommended for hosts with jumbo frames enabled

net.ipv4.tcp_mtu_probing=1

# recommended to enable 'fair queueing'

net.core.default_qdisc = fq

kodiak_f · ‎2021-07-16

Thanks very much for the replies GerhardWolf and Parisi!

For us, we do know what is breaking our mounts - network interruptions that are outside of the control of my team (Linux & storage).

These are unavoidable breaks that happen when an intermediate link/hop goes down for a period of longer than 180s (timeo=600,retrans=2), causing the Linux system to give up and hang the mount until we reboot the client system entirely.

We're hoping for any guidance related to coping with a network outage of up to 20 minutes that could still recover without the need to reboot the client system after the network outage resolves.

I'm tempted to set the timeo & retrans to 600 & 20 to see if that copes with a ~15 minute outage and then recovers as we'd hope.

Currently I'm a bit afraid of soft mounts because I can't be sure that our developers and the libraries they use will cope properly with I/O errors and not corrupt data.

GerhardWolf - I appreciate the sysctl settings but I'm not sure if any of those will help with availability problems, though I'm sure they are great for performance.

Thanks again all!

- Kodiak

parisi · ‎2021-07-16

As I mentioned, network interruptions alone wouldn't break the mounts. The firewall is probably doing that as a result of the network outage since it's probably seen as a stale connection so it severs it. The indefinite hang is probably due to the fact that you can't re-establish the connection because it's gone.

I'd suggest trying the idle timeout and keepalive settings I posted about, and perhaps the mount options you described. But I don't have any real way to test the scenario you mentioned out, so it would be trial and error for you.

GerhardWolf · ‎2021-07-17

Hi,

well as Parisi mentioned, the firewall is here the point as well as the network. Sometimes I could see that packet-inspection intrusion detection services are also harming the communication.

For me as an old networking guy it would be interesting why your have these link outages and where you have this single-point-of-failure. Also it would be nice to know how many traffic is flowing over this line that breaks to create a mitigation.

For small traffic amount or simply to keep the line open, a GSM/phone/satellite(like StarLink) alternative route could be possible that jumps in, if the main line is down.

An additional mitigation could be to implement a cache on your side in front of the linux systems via an VM ONTAP Select FlexCache where you would be able to still read the hot data. The cache write behavior I have to look into the actually state.

kodiak_f · ‎2021-07-18

Thanks again folks - apologies as I didn't quite grok the earlier posts at first. I appreciate it and will be following up with our infosec department.

For the linked tunables on the SVM, do you have thoughts about what values to try?

It sounds like I should toggle away from the default to allow idle connections, then perhaps I should set the idle connection timeout to a period greater than a likely disruption - eg 30 minutes?

Thanks again folks!

parisi · ‎2021-07-19

yes, enable idle connections, and then maybe set a lower timeout than 30 minutes and do trial and error. Idle connections that hang around too long can pile up and potentially cause you to hit the per-node connection limits (depending on the platform and number of clients, time-out values, etc)

kodiak_f · ‎2021-07-19

Thanks very much Parisi - our NFS servers have very few clients so I think we'll be pretty safe when it comes to running out of connections.

Much appreciated!

parisi · ‎2021-07-16

Your firewall may be breaking this. Once the connection is severed with NFSv4, lock reclaims will be attempted for about 90 seconds, but then the storage will get rid of the locks. But that shouldn't impact connectivity.

What may be happening here is the firewall is closing the source port that the connection was using.

For example, when an NFS mount is made to ONTAP, it uses port 2049 for NFS, but it uses a different *source* port each time.

tme-a300-efs01::*> network connections active show -service nfs* -fields remote-port,service,local-port -sort-by cid
node              cid       vserver local-port remote-port service
----------------- --------- ------- ---------- ----------- -------
tme-a300-efs01-01 46969301   DEMO   2049       819         nfs
tme-a300-efs01-01 46969303   DEMO   2049       805         nfs
tme-a300-efs01-02 275045092  DEMO   2049       1020        nfs
tme-a300-efs01-02 275045094  DEMO   2049       917         nfs

What probably happens is that the firewall severs the connection and when the client tries to re-connect over that same port, it can't because the connection is gone.

We cover similar issues with NAT in this TR on page 73:

https://www.netapp.com/pdf.html?item=/media/10720-tr-4067.pdf

And we discuss firewall considerations on page 77. There are a couple of ONTAP NFS server options (-idle-connection-timeout and -allow-idle-connection) that may help here.

[-idle-connection-timeout <integer>] - Idle Connection Timeout Value (in seconds)
This optional parameter specifies the idle connection timeout for NFS connections. The value specified must be between 120 and 86400.

[-allow-idle-connection {enabled|disabled}] - Are Idle NFS Connections Supported
This optional parameter specifies whether to enable idle NFS connections. The default setting is disabled.

JLoudy · ‎2023-10-25

Know this has been years. One thing I've been doing lately is to convert mounts to an automount method using x-systemd.automount,x-systemd.idle-timeout=2min in fstab entries. These options are available in most recent Linux versions, and cause the client to unmount NFS if it has not been used in 2 minutes. Have seen this greatly reduce hung mounts during network maintenance/outages, unless the mount is actively being used during interruption. Doesn't truly fix, but has reduced needed cleanup work.