We are NFS datastores shop. The throughput on a handful LIF's constantly very high, more than 1,200MB/s. Most of datastores are mounted via these LIF's.
However, on the rest of LIF's only about a few hundreds MB/s with less datastore's mounted.
There seems no way in OCUM or any commands to tell me if or which LIF has bottleneck.
"> statistics lif show -vserver vserver_name" doesn't show any Recv or Sent Errors
How can I tell if a LIF is over loaded, and caused bottleneck, if any?
Starting in ONTAP 9.7, you can.
We’ve had the ability to determine which clients are accessing IP addresses in a cluster (network connections active show), as well as CIFS/SMB session information (cifs session show), but could never get granular information about NFS.
The use cases are varied, but usually fall into:
1) Need to discover who’s using a volume before performing migrations, cutovers, etc.
2) Troubleshooting issues
3) Load distribution
Admin level command:
cluster::> nfs connected-clients show -node <node> -vserver <vserver>
Which is the ONTAP version?
Are you using OCUM or AIQUM?
LIF limits will be the limit of the underneath physical ports. Are you using LACP?
Are you having issues or just want to monitor and be able to tell that the LIF is overloaded?
To answer your questions:
Yes, we do use LACP
Yes, we do have performance issues, but don't know where the problem is.
Aggregate seems ok.
I found the total number of discards is high. It happened on all 4 ports under LACP. below is the example on one of them.
A certain percentage of drops is acceptable, Can you tell me how do I calculate the percentage of that in my case?
cluster::*> node run -node node-1 ifstat e0f
-- interface e0f (284 days, 21 hours, 36 minutes, 16 seconds) --
Total frames: 384g | Frames/second: 15608 | Total bytes: 1501t
Bytes/second: 61004k | Total errors: 23 | Errors/minute: 0
Total discards: 3352k | Discards/minute: 8 | Multi/broadcast: 715m
Non-primary u/c: 0 | CRC errors: 0 | Runt frames: 23
Long frames: 0 | Length errors: 0 | Alignment errors: 0
No buffer: 1143k | Pause: 0 | Jumbo: 292g
Noproto: 0 | Bus overruns: 2208k | LRO segments: 241g
LRO bytes: 1480t | LRO6 segments: 0 | LRO6 bytes: 0
Bad UDP cksum: 0 | Bad UDP6 cksum: 0 | Bad TCP cksum: 0
Bad TCP6 cksum: 0 | Mcast v6 solicit: 0 | Lagg errors: 2
Lacp errors: 0 | Lacp PDU errors: 0
Total frames: 1086g | Frames/second: 44156 | Total bytes: 70585g
Bytes/second: 2867k | Total errors: 0 | Errors/minute: 0
Total discards: 0 | Queue overflow: 0 | Multi/broadcast: 2454k
Pause: 126k | Jumbo: 833g | Cfg Up to Downs: 0
TSO segments: 0 | TSO bytes: 0 | TSO6 segments: 0
TSO6 bytes: 0 | HW UDP cksums: 0 | HW UDP6 cksums: 0
HW TCP cksums: 0 | HW TCP6 cksums: 0 | Mcast v6 solicit: 0
Lagg drops: 0 | Lagg no buffer: 0 | Lagg no entries: 0
Mcast addresses: 4 | Rx MBuf Sz: 4096
Speed: 10000M | Duplex: full | Flowcontrol: full
Media state: active | Up to downs: 1
I recommend that first you zero the stats for all the interfaces and then start checking the increments in errors/discards and etc
Ifstat -z interface
Old data could point you to the wrong way
below 0.1% is acceptable according to the article above.
Yes, I already read that link.
my question is, how could this 0.1% be figured out? The numerator is”total discards”, What is denominator in my example?
but zero first...
Also, where are you reading the latency from?
Check if any of the physical ports in a LACP are saturated. It is normal on big envs that incorrect load balancing due to the nature of the LB implemented by ONTAP happens if you do not pay attention to the IPs involved in the communication
Did you find what you needed?