Re: Delta between DFM protocol latency and client statistics

colsen · ‎2013-05-29

Hello,

I found a few "2nd cousin" (almost related) postings along these lines, but didn't find any answers to exactly what we're looking at. Anyway, as our shop has matured in NetApp technologies, we've gotten more an more focused on performance statistics from both the storage controller as well as the client perspective. We had a incident just recently where we were running a NFS client that performs network testing and logs its test results to a NFS mount on a NetApp Filer. Part of the testing is the latency the client experiences trying to write to the NFS mount (since storage I/O is part of the test) and although for several days its logged ~2-5 millisecond latency (which almost directly matches what we see in DFM on the volume) it recently logged a 2 second event then 6 second event lag in being able to write to its log.

We looked at the NFS latency on volume at this time and we see some latency spikes which directly correlate to the client's timestamps, but these spikes are only up to ~15-20 milliseconds or so - certainly not 2-6 full seconds. After this period everything settles down and the NFS latency on the volume matches what the client sees. CPU/disk had an uptick at these times, but there's nothing about the controller performance during this time that would lead us to believe that it was saturated. Client performance data was equally unremarkable.

We have a similar situation with our Exchange environment where LUN read latency will be 5-10 milliseconds but perfmon/SCOM records closer to 25ms from the mailbox side - but this delta seems easily attributable to the various moving parts between the client and the storage.

So my question is two-fold: what are other shops seeing with regards to the difference between client and storage latency numbers - AND - when that difference is large what are the most likely culprits for exacerbating that difference? We're starting some packet gathers on the storage side, client and switch which we're hoping will help us zero in on where the delay is being introduced but we just didn't want to start blindly pointing fingers at infrastructure components.

Gracious thanks in advance!

Chris

shaunjurr · ‎2013-06-05

Hi,

Looks like you have a lot of spare time. Getting down to ms latencies on every piece of equipment would be hard enough in an environment where you had dedicated and isolated equipment every step of the way.

FC latencies are definitely a matter of the "moving parts", as you say, mostly queuing/buffering along the paths, possibly reordering of commands within switches with multiple paths/ISL's.

NFS latencies are probably more complex given congestion algorithms and the possibility of necessary TCP retransmissions (assuming/hoping you use TCP with NFS). NFS also needs to keep track of the status of files (and parts of files) and directories and has quite a bit of RPC "chatter" depending the version. Depending on the client, you also have a fair amount of buffering within the OS, without mentioning normal TCP window sizing, NFS read/write size options, Ethernet jumbo frames, TCP offloading or other tuning options for the NICs, general TCP stack tuning, and the like. NFS just has a lot more slack built in. It's not always a negative thing, but this resilience has its cost if it needs to be used, i.e. the latency spikes you see.

FC is pretty rigid, but almost simple in comparison and has generally lower latencies. FC also fails in spectacular ways during congestion scenarios that TCP handles with ease, at least if one doesn't use a lot more complex fabric configuration and strict queue depth policies on servers. It's like a fine sports car while NFS is like your reliable jeep.

I wish you luck in your endeavor, but I would be slightly surprised if you succeed. Isolating all of the parts and settings would be a very complex matrix.

S.