Re: Latency difference between LIFs on same port

MaximRonsse · ‎2020-08-25

Hello!

I've been breaking my head over an issue I'm having with a 4-node MetroCluster FC (so 2 FAS8200 nodes per site) running in a production environment.

So I have two LIFs on the same controller, hosted on the same LACP group (tested with both one and two member ports). Accessing volumes through the first LIF, I saw latencies of up to 23ms. Accessing these same volumes on the second LIF, I saw latencies of up to 1ms. I'm measuring against volumes hosted on the same controller, as well as on the other controller, which gives me the same results.

I've done the usual things:

Checked CPU/disk utilization (sysstat -x on all controllers): less than 25% CPU usage in a few minutes, less than 70% disk usage.
Checked volume latency (qos statistics volume performance): 50µs latency.
Checked port usage through AIQUM: one physical port was using 9% of 10Gbps
There is no duplicate IP in the subnet

Any help/suggestions are greatly appreciated! If you need any output of commands or logs, I'll provide them. As I have quite some output already, I didn't add it to the post as it would be a long post.

Thanks!

paul_stejskal · ‎2020-08-26

Where are you measuring the latency and which protocol? The port shouldn't affect the internal ONTAP latency, so I'd say get packet traces and review to make sure there isn't loss or something else.

MaximRonsse · ‎2020-08-27

Thanks for the reply, Paul!

All good questions and suggestions indeed.

Sorry for not mentioning it; I'm seeing the latencies over both NFS (over TCP) and CIFS. Actually, even a simple ping to the LIF shows the higher latencies.

At this point the "how did you measure latency over nfs/cifs" question is probably not relevant anymore, but I used both AIQUM and strace on linux (which shows the time it takes to open/list files).

I already did a tcpdump on the client's end, which shows no TCP retransmissions occurring. Or would you suggest doing the same on the controller's end?

I've got my 2 cents on the switch in between; So I'm also investigating that in parallel.

By the way, I'm running 9.7P6 at the moment.

paul_stejskal · ‎2020-08-27

Is a case open? Honestly I'd need to look at the data to know why.

MaximRonsse · ‎2020-08-28

no case is opened yet. I guess I'll do that then.

Thanks for the help on here anyway!

paul_stejskal · ‎2020-08-28

Maxim,

Sorry man! Without data it's hard to say what is going on, so I can't confirm. A case will let me or someone on my team (Perf L2, or probably more likely NAS L2) review the data.

Latency difference between LIFs on same port

Portfolio Updates | NetApp ONAIR