ONTAP Discussions

How could I tell if a LIF is the bottleneck of a performance issue in NFS protocol?

heightsnj

We are NFS datastores shop. The throughput on a handful LIF's constantly very high, more than 1,200MB/s. Most of datastores are mounted via these LIF's. 

However, on the rest of LIF's only about a few hundreds MB/s with less datastore's mounted.

 

There seems no way in OCUM or any commands to tell me if or which LIF has bottleneck. 

"> statistics lif show -vserver vserver_name" doesn't show any Recv or Sent Errors

 

How can I tell if a LIF is over loaded, and caused bottleneck, if any?

 

Thank you!

11 REPLIES 11

Ontapforrum

Starting in ONTAP 9.7, you can.

 

We’ve had the ability to determine which clients are accessing IP addresses in a cluster (network connections active show), as well as CIFS/SMB session information (cifs session show), but could never get granular information about NFS.

 

The use cases are varied, but usually fall into:
1) Need to discover who’s using a volume before performing migrations, cutovers, etc.
2) Troubleshooting issues
3) Load distribution

 

Admin level command:
cluster::> nfs connected-clients show -node <node> -vserver <vserver>

 

Courtesy:
https://whyistheinternetbroken.wordpress.com/2019/11/08/ontap97-feature-sneak-peek-nfs-client-to-volume-mapping/

pedro_rocha

Which is the ONTAP version?

 

Are you using OCUM or AIQUM?

 

LIF limits will be the limit of the underneath physical ports. Are you using LACP?

 

Are you having issues or just want to monitor and be able to tell that the LIF is overloaded?

 

Regards,

Pedro

 

 

To answer your questions:
ontap 9.7p7

AIQUM

Yes, we do use LACP

Yes, we do have performance issues, but don't know where the problem is.

 

 

The most common culprit is disk. Did you check disk utilization for the
aggregates involved with the datastores that are suffering?

heightsnj

Aggregate seems ok. 

 

I found the total number of discards is high. It happened on all 4 ports under LACP. below is the example  on one of them. 

A certain percentage of drops is acceptable, Can you tell me how do I calculate the percentage of that in my case?

cluster::*> node run -node node-1 ifstat e0f

-- interface e0f (284 days, 21 hours, 36 minutes, 16 seconds) --

RECEIVE
Total frames: 384g | Frames/second: 15608 | Total bytes: 1501t
Bytes/second: 61004k | Total errors: 23 | Errors/minute: 0
Total discards: 3352k | Discards/minute: 8 | Multi/broadcast: 715m
Non-primary u/c: 0 | CRC errors: 0 | Runt frames: 23
Long frames: 0 | Length errors: 0 | Alignment errors: 0
No buffer: 1143k | Pause: 0 | Jumbo: 292g
Noproto: 0 | Bus overruns: 2208k | LRO segments: 241g
LRO bytes: 1480t | LRO6 segments: 0 | LRO6 bytes: 0
Bad UDP cksum: 0 | Bad UDP6 cksum: 0 | Bad TCP cksum: 0
Bad TCP6 cksum: 0 | Mcast v6 solicit: 0 | Lagg errors: 2
Lacp errors: 0 | Lacp PDU errors: 0
TRANSMIT
Total frames: 1086g | Frames/second: 44156 | Total bytes: 70585g
Bytes/second: 2867k | Total errors: 0 | Errors/minute: 0
Total discards: 0 | Queue overflow: 0 | Multi/broadcast: 2454k
Pause: 126k | Jumbo: 833g | Cfg Up to Downs: 0
TSO segments: 0 | TSO bytes: 0 | TSO6 segments: 0
TSO6 bytes: 0 | HW UDP cksums: 0 | HW UDP6 cksums: 0
HW TCP cksums: 0 | HW TCP6 cksums: 0 | Mcast v6 solicit: 0
Lagg drops: 0 | Lagg no buffer: 0 | Lagg no entries: 0
DEVICE
Mcast addresses: 4 | Rx MBuf Sz: 4096
LINK INFO
Speed: 10000M | Duplex: full | Flowcontrol: full
Media state: active | Up to downs: 1

 

 

 

I recommend that first you zero the stats for all the interfaces and then start checking the increments in errors/discards and etc

 

Ifstat -z interface

 

Old data could point you to the wrong way

 

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/During_periods_of_high_traffic%2C_what_is_an_acceptable_percentage_of_...

 

below 0.1% is acceptable according to the article above.

 

Yes, I already read that link.

 

my question is, how could this 0.1% be figured out? The numerator is”total discards”, What is denominator in my example?

total frames

 

but zero first...

pedro_rocha

Also, where are you reading the latency from?

 

pedro_rocha

Check if any of the physical ports in a LACP are saturated. It is normal on big envs that incorrect load balancing due to the nature of the LB implemented by ONTAP happens if you do not pay attention to the IPs involved in the communication

pedro_rocha

Did you find what you needed?

Announcements
NetApp on Discord Image

We're on Discord, are you?

Live Chat, Watch Parties, and More!

Explore Banner

Meet Explore, NetApp’s digital sales platform

Engage digitally throughout the sales process, from product discovery to configuration, and handle all your post-purchase needs.

NetApp Insights to Action
I2A Banner
Public