Network and Storage Protocols
Network and Storage Protocols
Hey folks,
i have a strange issue and i cant get rid of it.
Environment:
ActiveMQ starts normally and is operating as expected but the "LockCount" and "OwnerCount" is rising steadily while all other counters stay low.
otscl::*> nfs storepool show -vserver mysvm
Node: otscl1
Vserver: mysvm
Data-Ip: 192.168.1.66
Client-Ip Protocol IsTrunked OwnerCount OpenCount DelegCount LockCount
-------------- --------- --------- ---------- ---------- ---------- ---------
192.168.1.67 nfs4.2 false 0 0 0 0
192.168.1.68 nfs4.2 false 26099 23 0 26099
When the Lock/OwnerCount hits ~131k, the following error appears:
otscl1 EMERGENCY Nblade.nfsV4PoolExhaust: NFS Store Pool for Owner exhausted. Associated object type is CLUSTER_NODE with UUID: XXXXXXXXXXXXXXXX.
From now on, all NFS4-Shares on the OTS-Cluster (all SVMs) cant be accessed anymore until we restart ActiveMQ which resets the counters.
I also checked the locks in detail. See:
locks show -vserver mysvm
(vserver locks show)
Notice: Using this command can impact system performance. It is recommended
that you specify both the vserver and the volume when issuing this command to
minimize the scope of the command's operation. To abort the command, press Ctrl-C.
Vserver: mysvm
Volume Object Path LIF Protocol Lock Type Client
-------- ------------------------- ----------- --------- ----------- ----------
trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/journal/server.lock
mysvm_lif
nfsv4.1 share-level 192.168.1.66
Sharelock Mode: read_write-deny_none
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/journal/serverlock.1
mysvm_lif
nfsv4.1 share-level 192.168.1.66
Sharelock Mode: read_write-deny_none
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/journal/serverlock.2
mysvm_lif
nfsv4.1 share-level 192.168.1.66
Sharelock Mode: read_write-deny_none
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/bindings/activemq-bindings-1.bindings
mysvm_lif
nfsv4.1 delegation 192.168.1.66
Delegation Type: write
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/bindings/activemq-bindings-2.bindings
mysvm_lif
nfsv4.1 delegation 192.168.1.66
Delegation Type: write
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/journal/activemq-data-1.amq
mysvm_lif
nfsv4.1 share-level 192.168.1.66
Sharelock Mode: read_write-deny_none
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/journal/activemq-data-2.amq
mysvm_lif
nfsv4.1 share-level 192.168.1.66
Sharelock Mode: read_write-deny_none
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/journal/server.lock
mysvm_lif
nfsv4.1 share-level 192.168.1.67
Sharelock Mode: read_write-deny_none
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/journal/serverlock.1
mysvm_lif
nfsv4.1 byte-range 192.168.1.66
Bytelock Offset(Length): 0 (18446744073709551615)
share-level 192.168.1.67
Sharelock Mode: read_write-deny_none
/trident_pvc_1e201be0_e6d6_4ab2_8270_579061df7f89/journal/serverlock.2
mysvm_lif
nfsv4.1 share-level 192.168.1.67
Sharelock Mode: read_write-deny_none
trident_pvc_52530de0_c35d_4a2b_a133_d84b9ef2b9b7
/trident_pvc_52530de0_c35d_4a2b_a133_d84b9ef2b9b7/.healthcheck
mysvm_lif
nfsv4.1 delegation 192.168.1.67
Delegation Type: write
15 entries were displayed.
Running the AMQ-Operator and Trident on OpenShift (instead OKD), the counters will stay low, so i thought this could be a kernel or OS issue.
I installed CentOS Stream 10 (Kernel: 6.12.0-170) which is the OS for OKD and set the same kernel-params as in the cluster,
mounted the trident-share (copied mount-options from okd-node) and deployed AMQ-Artemis using the same config. The counters stay low.
While my research, i stumbled across the following comments:
A not so long while ago I managed to crash a NetApp filer by upgrading a Linux host to an early 6.x kernel and connecting with new NFSv4 features. Seems like its early for all implementors 🙂
Just ran across this post and thought it worth mentioning that as of v6.17 there have been over 1k patches to the in-kernel linux NFS client since v5.15 and 2021.
https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4KernelStateNotImpressed?showcomments
Do you have any hints, tweaks or ideas how i could further investigate this problem?
I already installed a nfs4-server on linux which runs without any problems, so my devs already think abour replacing our OTS-Cluster 😄
Thank you.
This tends to be a pretty common issue, as evidenced by the number of KB articles we have around it. 🙂
This one covers why a client might cause this problem:
https://kb.netapp.com/on-prem/ontap/da/NAS/NAS-KBs/What_are_the_NFSv4_Storepools_why_do_they_exist
How can specific clients potentially cause problems?
This one talks about how to troubleshoot/identify the offending client:
This one consolidates all the different links:
https://kb.netapp.com/on-prem/ontap/da/NAS/NAS-KBs/NFSv4_Storepool
Basically, it's probably a client issue, but I would use the above information to gather data and verify. Then maybe see if restarting the client makes the issue go away. After that, you have ample evidence to convince your group to upgrade the clients. 🙂