How could I tell if a LIF is the bottleneck of a performance issue in NFS protocol?

heightsnj · ‎2021-08-19

We are NFS datastores shop. The throughput on a handful LIF's constantly very high, more than 1,200MB/s. Most of datastores are mounted via these LIF's.

However, on the rest of LIF's only about a few hundreds MB/s with less datastore's mounted.

There seems no way in OCUM or any commands to tell me if or which LIF has bottleneck.

"> statistics lif show -vserver vserver_name" doesn't show any Recv or Sent Errors

How can I tell if a LIF is over loaded, and caused bottleneck, if any?

Thank you!

Ontapforrum · ‎2021-08-19

Starting in ONTAP 9.7, you can.

We’ve had the ability to determine which clients are accessing IP addresses in a cluster (network connections active show), as well as CIFS/SMB session information (cifs session show), but could never get granular information about NFS.

The use cases are varied, but usually fall into:
1) Need to discover who’s using a volume before performing migrations, cutovers, etc.
2) Troubleshooting issues
3) Load distribution

Admin level command:
cluster::> nfs connected-clients show -node <node> -vserver <vserver>

Courtesy:
https://whyistheinternetbroken.wordpress.com/2019/11/08/ontap97-feature-sneak-peek-nfs-client-to-volume-mapping/

pedro_rocha · ‎2021-08-19

Which is the ONTAP version?

Are you using OCUM or AIQUM?

LIF limits will be the limit of the underneath physical ports. Are you using LACP?

Are you having issues or just want to monitor and be able to tell that the LIF is overloaded?

Regards,

Pedro

heightsnj · ‎2021-08-19

To answer your questions:
ontap 9.7p7

AIQUM

Yes, we do use LACP

Yes, we do have performance issues, but don't know where the problem is.

pedro_rocha · ‎2021-08-19

The most common culprit is disk. Did you check disk utilization for the
aggregates involved with the datastores that are suffering?

heightsnj · ‎2021-08-19

Aggregate seems ok.

I found the total number of discards is high. It happened on all 4 ports under LACP. below is the example on one of them.

A certain percentage of drops is acceptable, Can you tell me how do I calculate the percentage of that in my case?

cluster::*> node run -node node-1 ifstat e0f

-- interface e0f (284 days, 21 hours, 36 minutes, 16 seconds) --

pedro_rocha · ‎2021-08-19

I recommend that first you zero the stats for all the interfaces and then start checking the increments in errors/discards and etc

Ifstat -z interface

Old data could point you to the wrong way

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/During_periods_of_high_traffic%2C_what_is_an_acceptable_percentage_of_...

below 0.1% is acceptable according to the article above.

heightsnj · ‎2021-08-19

Yes, I already read that link.

my question is, how could this 0.1% be figured out? The numerator is”total discards”, What is denominator in my example?

pedro_rocha · ‎2021-08-19

total frames

but zero first...

pedro_rocha · ‎2021-08-19

Also, where are you reading the latency from?

pedro_rocha · ‎2021-08-19

Check if any of the physical ports in a LACP are saturated. It is normal on big envs that incorrect load balancing due to the nature of the LB implemented by ONTAP happens if you do not pay attention to the IPs involved in the communication

pedro_rocha · ‎2021-08-26

Did you find what you needed?