Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How could I tell if a LIF is the bottleneck of a performance issue in NFS protocol?
2021-08-19
06:23 AM
5,061 Views
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are NFS datastores shop. The throughput on a handful LIF's constantly very high, more than 1,200MB/s. Most of datastores are mounted via these LIF's.
However, on the rest of LIF's only about a few hundreds MB/s with less datastore's mounted.
There seems no way in OCUM or any commands to tell me if or which LIF has bottleneck.
"> statistics lif show -vserver vserver_name" doesn't show any Recv or Sent Errors
How can I tell if a LIF is over loaded, and caused bottleneck, if any?
Thank you!
11 REPLIES 11
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Starting in ONTAP 9.7, you can.
We’ve had the ability to determine which clients are accessing IP addresses in a cluster (network connections active show), as well as CIFS/SMB session information (cifs session show), but could never get granular information about NFS.
The use cases are varied, but usually fall into:
1) Need to discover who’s using a volume before performing migrations, cutovers, etc.
2) Troubleshooting issues
3) Load distribution
Admin level command:
cluster::> nfs connected-clients show -node <node> -vserver <vserver>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Which is the ONTAP version?
Are you using OCUM or AIQUM?
LIF limits will be the limit of the underneath physical ports. Are you using LACP?
Are you having issues or just want to monitor and be able to tell that the LIF is overloaded?
Regards,
Pedro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To answer your questions:
ontap 9.7p7
AIQUM
Yes, we do use LACP
Yes, we do have performance issues, but don't know where the problem is.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The most common culprit is disk. Did you check disk utilization for the
aggregates involved with the datastores that are suffering?
aggregates involved with the datastores that are suffering?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Aggregate seems ok.
I found the total number of discards is high. It happened on all 4 ports under LACP. below is the example on one of them.
A certain percentage of drops is acceptable, Can you tell me how do I calculate the percentage of that in my case?
cluster::*> node run -node node-1 ifstat e0f
-- interface e0f (284 days, 21 hours, 36 minutes, 16 seconds) --
RECEIVE
Total frames: 384g | Frames/second: 15608 | Total bytes: 1501t
Bytes/second: 61004k | Total errors: 23 | Errors/minute: 0
Total discards: 3352k | Discards/minute: 8 | Multi/broadcast: 715m
Non-primary u/c: 0 | CRC errors: 0 | Runt frames: 23
Long frames: 0 | Length errors: 0 | Alignment errors: 0
No buffer: 1143k | Pause: 0 | Jumbo: 292g
Noproto: 0 | Bus overruns: 2208k | LRO segments: 241g
LRO bytes: 1480t | LRO6 segments: 0 | LRO6 bytes: 0
Bad UDP cksum: 0 | Bad UDP6 cksum: 0 | Bad TCP cksum: 0
Bad TCP6 cksum: 0 | Mcast v6 solicit: 0 | Lagg errors: 2
Lacp errors: 0 | Lacp PDU errors: 0
TRANSMIT
Total frames: 1086g | Frames/second: 44156 | Total bytes: 70585g
Bytes/second: 2867k | Total errors: 0 | Errors/minute: 0
Total discards: 0 | Queue overflow: 0 | Multi/broadcast: 2454k
Pause: 126k | Jumbo: 833g | Cfg Up to Downs: 0
TSO segments: 0 | TSO bytes: 0 | TSO6 segments: 0
TSO6 bytes: 0 | HW UDP cksums: 0 | HW UDP6 cksums: 0
HW TCP cksums: 0 | HW TCP6 cksums: 0 | Mcast v6 solicit: 0
Lagg drops: 0 | Lagg no buffer: 0 | Lagg no entries: 0
DEVICE
Mcast addresses: 4 | Rx MBuf Sz: 4096
LINK INFO
Speed: 10000M | Duplex: full | Flowcontrol: full
Media state: active | Up to downs: 1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I recommend that first you zero the stats for all the interfaces and then start checking the increments in errors/discards and etc
Ifstat -z interface
Old data could point you to the wrong way
below 0.1% is acceptable according to the article above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I already read that link.
my question is, how could this 0.1% be figured out? The numerator is”total discards”, What is denominator in my example?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
total frames
but zero first...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, where are you reading the latency from?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Check if any of the physical ports in a LACP are saturated. It is normal on big envs that incorrect load balancing due to the nature of the LB implemented by ONTAP happens if you do not pay attention to the IPs involved in the communication
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you find what you needed?