Solved: Any limit in the number of NFS mounts on a LIF ?

netappmagic · ‎2021-09-23

Is there a recommended or a limit on how many NFS mounts should have on a LIF associated with a node?

parisi · ‎2021-09-23

Recommendation is to spread mounts across multiple nodes/LIFs in the cluster. While you can establish up to 100,000 NAS connections on a single node, there are other resource limits you might hit, such as memory, CPU or exec contexts.

TR-4067 covers it in networking considerations.

View solution in original post

parisi · ‎2021-09-23

Recommendation is to spread mounts across multiple nodes/LIFs in the cluster. While you can establish up to 100,000 NAS connections on a single node, there are other resource limits you might hit, such as memory, CPU or exec contexts.

TR-4067 covers it in networking considerations.

netappmagic · ‎2021-09-24

@parisi ,

Thanks!

NFS datastores (layer-2) or volumes(layer-3) are heavily used here. NFS mounts for both type seem not evenly distributed across LIFs/nodes. Each node bears 1,200-1,600 mounts.

I have also noticed there are packets discards on some ports within ifgrp group(4x10GbE or 4x40GbE). Some of ports on some nodes have reached as high as 1.x or 2.x%. 0.1% or below could be treated as acceptable, as my understanding.

Should "packets discards" be used as the measurement to determine if LIFs/node have been overly used for NFS mounts, and if there is the need to redistribute NFS mounts? are there any other measurements?

Should we increase ifgrpfrom 4x10GbE to 4x40GbE for those ports have 10GbE, and packets discards?

parisi · ‎2021-09-24

Packet discards aren't generally caused by "too many NFS mounts."

ONTAP won't discard packets if it's overloaded; instead it uses flow control mechanisms to tell the clients to wait until resources are freed. Packet discards are usually network related issues - bad cables, ifgrp config issues, etc.

I'd suggest opening up a support case to try to narrow down why those are happening.

What issue prompted you to look at packet discards? Where are you seeing those accumulate? Ifstat on the cluster? Packet traces?

netappmagic · ‎2021-09-24

We some time have experienced performance issues on some batch jobs running on Linux VMs, Latency is on the datastore or volume only 1-2 ms, We couldn't identify the root cause. That made me to look down the ifgrp and ports, and find packets discards.

I run "node run -node node-name ifstat port-name", and found the "discards/total frame" is higher than expected. As you point out that too many NFS mounts won't cause discards, then what if I increase the bandwidth from 4x10GbE to 4x40GbE, would that improve ? because one of HA pair has 4x40GbE and the discards rate is very low, < 0.1%.

parisi · ‎2021-09-24

I'd suggest having a look at the section in TR-4067 on RPC slot tables/exec context blocking starting on page 111.

https://www.netapp.com/us/media/tr-4067.pdf

See if your cluster nodes are getting excess blocked execs as per page 112. That may be the source of your latency. Remediation is configuring the clients' slot tables lower or to use nconnect (if your client OS supports it). That would be done on the NFS VMs.

As for discards, you could try to chase that rabbit if you want, and it likely does need to be resolved, but I don't know if it solves your latency problem.

netappmagic · ‎2021-09-27

I have read page starting from pg111, there are obviously "exec context blocking" on some storage nodes, because at least 10k or many more increases during 1 minute period of time. We are using 9.7p11 and most of Linux vm servers are running redhat 7.9, all using default RPC slot tables.

As I said, previously, each storage nodes has 1200-1600 NFS mounts, so, my question is:
1. Should reduce workloads (NFS mounts) from nodes having more exec context blockings by moving workloads to other nodes? Thus to redistribute loads among nodes?

2. for the server in question, because we are all using default RCP slot tables, then there would be no way to identify if RPC slot tables/exec context blocking is the issue here. What can I do in this situation?

Thank you for your valuable message!

parisi · ‎2021-09-27

ONTAP 9.8 and later has an EMS message that tells you which client is overrunning the exec contexts. ONTAP 9.9.1 and later introduced exec context throttling for some node types.

Spreading the workload across nodes may help, but only if a single node’s resources are being overrun. If the resource issue is constrained to single TCP connections, spreading the workload won’t necessarily help there - only reducing the client driven slot tables or using nconnect.

netappmagic · ‎2021-09-28

I have some follow-up below if you can please help me out:

1. Based on the large number of increases on execs_blocked_on_cid as the result of running that "statistics" command(> 10k in a minute), It looks that OnTap pushes back clients requests on some nodes in the cluster, can we then conclude that these nodes have performance issues and caused by too many concurrent NFS operation requests?

2. In your document, you used the workload "creating lots of files or directories" as examples to illustrate concepts, what if in the case of Oracle Database files? Could the workload like this also cause large amount of NFS requests at the same time or cause overrunning exec contexts? I can see there are about 300 dbf Oracle datafiles on a NFS File System, and based on the time stamp, they are kept changing across all these files from time to time(I know they are not OLTP type applications). To accurately tell if this client really overrun exec contexts, we will have to upgrade OnTap to 9.8 first. Correct?

3. Would that be okay to configure nconnect without knowing if the client really has overrunning exec contexts?

4. On page 113, you stated that 3ms of latency was added, but Figure 19) showed ~9ms latency. Why?

5. Will overrunning issue also apply to NFS datastore mounted on ESXi host as the client?

parisi · ‎2021-09-28

1. No, we can't make that conclusion. The exec blocking can happen due to TCP resource exhaustion (slot tables per TCP connection) *or* node level exhaustion. It is not clear here which one is causing the issue. ONTAP 9.8 and/or 9.9.1 could help some there.

2. We do see Oracle databases causing slot table exhaustion on occasion, which is why it's covered in the Oracle Best Practice guide on page 36. dNFS can also help there.

https://www.netapp.com/pdf.html?item=/media/8744-tr3633pdf.pdf

3. nconnect won't hurt you (other than using up more TCP connections - nodes have 100K limits per node); the issue is if the client OS supports it. RHEL 7.9 does not that I am aware, but RHEL 8.3 and later does.

4. 65536 slot figure had 9ms latency. 128 slot figure had 6ms latency. 9ms - 6ms = 3ms

5. ESXi can also overrun the slot tables, yes. But again, we don't have a clear picture on which clients are possibly overrunning the TCP connections. ONTAP 9.8 has an EMS that gives more detail. For example (also seen on page 113):

cluster::*> event log show -node tme-a300-efs01-0* -message-name nblade.execsOverLimit
Time Node Severity Event
------------------- ---------------- ------------- ---------------------------
4/8/2021 17:01:30 node1
ERROR nblade.execsOverLimit: The number of in-flight requests from client with source IP x.x.x.x to destination LIF x.x.x.x (Vserver 20) is greater than the maximum number of in-flight requests allowed (128). The client might see degraded performance due to request throttling.

netappmagic · ‎2021-09-29

Appreciate your help!

1. Each NFS mount is corresponding a CID, and for each CID, OnTap can allow maximum of 128 execs. If I have 2 NFS mounts on a client, there would be then 2x128 = 256 execs OnTap can allow. If nconnect=0, then one TCP connections per mount. If I have 2 NFS mounts then there would be 2 TCP connections. OnTap can only supports 128 slot tables per TCP connection. Are these statements all correct?

2. For clients, 65535 as their default slot tables, should we change it to 128 as the maximum value, because too many slot tables may exhaust "exec_contexts" on nodes from these clients. Make sense?

4. A node can have 100K TCP connection, and each TCP connection can allow 128 exec_contexts, then the node can allow 12,800K exec_contexts. Correct?

5. OnTap enables NAS flow control, then for the cluster mainly uses NAS protocol, we don't need to warry about if we should disable or enable flow control on NIC/Switch. Correct?

parisi · ‎2021-09-29

1. Correct. Plus, there are per-node exec context limits, so if you have 1000 mounts and each uses 128 execs per CID, you could potentially run out of execs per node. TR-4067 has a section on how to see the max execs per node allowed and to see if you're approaching that limit on page 109.

2. Yes, that would help reduce the overruns. However, I have seen in large environments with lots of clients where the value had to be lowered even further (16 in some cases) due to the per-node limits. ONTAP 9.9.1 has exec context throttling (page 109) that helps mitigate that issue.

3. No. See page 109 of the TR - different node types have different exec limits depending on RAM. And execs aren't always allocated; once the operation is done, ONTAP releases the exec back to the system for a new operation.

4. NAS flow control is specific to CIFS/NFS operations. Switch flow control is for all ethernet operations. They are not really related to one another, so it's ultimately your choice if you want to enable/disable it on the switch.

heightsnj · ‎2023-05-12

@parisi ,

We found the error below:
ERROR nblade.execsOverLimit: The number of in-flight requests from client with source IP x.x.x.x to destination LIF x.x.x.x (Vserver 20) is greater than the maximum number of in-flight requests allowed (128).

So, does that mean that the NFS requests from the client (source) exceeded allowed RPC slot(128)?

Should we tuned down the RCP slots on the client to 128? What is the kernel parameters need to be adjusted?

Thank you!

heightsnj · ‎2023-05-12

@parisi
Upon my research, I need to tune both of these two parameters below to 128, in order to change RPC Slot Table to 128, please confirm:

tcp_slot_table_entries
tcp_max_slot_table_entries

Also, will this also be applied to ESXi hosts if we use NFS datastores, will same parameters need to be tuned to 128?

parisi · ‎2023-05-22

Not sure. That'd be more of a question for VMware. I suspect it's NFS.MaxQueueDepth instead.

parisi · ‎2023-05-22

Yes, that is covered in TR-4067.

https://www.netapp.com/pdf.html?item=/media/10720-tr-4067.pdf

heightsnj · ‎2023-05-23

I am sorry @parisi , but I am not clear on your response.
1. In order to change RPC Slot Table to 128 from the default on Linux clients, should we change both following parameters to 128? I couldn't find this infor from your document.

tcp_slot_table_entries --> 128
tcp_max_slot_table_entries -->128

2. About RPC slot table tuning on ESXi hosts, you refer me to check vmware, right?

parisi · ‎2023-05-23

1. Yes, change both.

2. Yes, check with VMware, but this KB covers the queue depth piece: https://kb.netapp.com/onprem/ontap/hardware/Seeing_nblade.execsOverLimit_for_ESXi_host_running_7.0u1_or_higher

netappmagic · ‎2021-09-24

@parisi ,

Thanks!

NFS datastores (layer-2) or volumes(layer-3) are heavily used here. NFS mounts for both type seem not evenly distributed across LIFs/nodes. Each node bears 1,200-1,600 mounts.

I have also noticed there are packets discards on some ports within ifgrp group(4x10GbE or 4x40GbE). Some of ports on some nodes have reached as high as 1.x or 2.x%. 0.1% or below could be treated as acceptable, as my understanding.

Should "packets discards" be used as the measurement to determine if LIFs/node have been overly used for NFS mounts, and if there is the need to redistribute NFS mounts? are there any other measurements?

Should we increase ifgrpfrom 4x10GbE to 4x40GbE for those ports have 10GbE, and packets discards?

zizibagnon · ‎2022-04-23

@netappmagic

your issue is solved ?

@parisi

what should I understand by these values? (see below)

ALLCA::*> systemshell -node * sysctl sysvar.nblade | grep -i cid
(system node systemshell)

Node: ALLNA1A
sysvar.nblade.debug.core.cid_in_use: 226
sysvar.nblade.debug.core.cid_max: 115911
sysvar.nblade.debug.core.cid_reserved: 10526
sysvar.nblade.debug.core.cid_allocs: 33452
sysvar.nblade.debug.core.total_execs_blocked_on_per_cid_limit: 6858371
sysvar.nblade.ngprocess.rewind.PerCIDRewindContextCount: 11578

Node: ALLNA1B
sysvar.nblade.debug.core.cid_in_use: 230
sysvar.nblade.debug.core.cid_max: 115911
sysvar.nblade.debug.core.cid_reserved: 10526
sysvar.nblade.debug.core.cid_allocs: 101805
sysvar.nblade.debug.core.total_execs_blocked_on_per_cid_limit: 2844707
sysvar.nblade.ngprocess.rewind.PerCIDRewindContextCount: 11578
2 entries were acted on.

parisi · ‎2022-04-25

In a vacuum, not much. You'd want to monitor it over time and see if that sysvar.nblade.debug.core.total_execs_blocked_on_per_cid_limit value increases a bunch. 9.8 and later also has EMS messages if there are large increases. Again, TR-4067 covers it.