Subscribe
Accepted Solution

stats - latency volume vs LUN

Recently I've been looking into some performance issues with FC attached LUNs on NLSAS drives.  I found a couple of things that have me a little perplexed...right now I'm just trying to better understand the performance statistics I'm looking at.

First, does anyone know the difference between volume latency and the latency of a LUN inside that volume?  I have some LUNs where if I use stats to measure latency (or look at it in performance advisor), the volume measurements are in the tens of ms while the LUN latencies are in the hundreds of ms.  (Yes, I've properly converted the volume latencies from microseconds).  Strangely, I don't seem to see this issue on faster disks (15K FC or SAS).  In this case, the volume and LUN latencies are always very close.

Second, I've been told the LUN latency statistic (stats show lun:lun_name:avg_latency) on the filer is somehow measuring end-to-end performance.  This is not the way I understood it, but would like some clarification on this.  My understanding is that the latency from the filer is the time it takes to service the I/O from the time the I/O is received by the filer to the time that the response is sent out of the filer. 

Say that I have a ~50ms round trip delay in my FC switching, and on the server-side I measure I/O latency as ~100ms.  I would expect the filer latency to show ~50ms, not 100ms, but again I've been told that I should actually see 100ms on the filer.

Re: stats - latency volume vs LUN

Would be helpful to have the lun, volume and disk statistics.

A reason for the difference in latency could be a different op size on the volume and protocol level. A real example I encountered during an iSCSI performance issue (only 1 LUN in the volume):

LUN:

read_data: 12115386

read_ops: 61

avg_read_latency: 39.9 ms

Volume:

read_data: 12115386

read_ops: 195

avg_read_latency: 10076 us

The op size are very different on the volume and protocol level. When normalizing the different layers, you will see the latency are about the same: 10 ms * 195 / 61 = 32 ms.

The protocol statistics seem to be end-to-end. The volume stats counter description explicitly mentions that it is without network:

Name: read_latency

Description: Average latency in microseconds for the WAFL filesystem to process read request to the volume; not including request processing or network communication time

Properties: average

Unit: microsec

    Base Name: read_ops

    Base Description: Number of reads per second to the volume

    Base Properties: rate

    Base Unit: per_sec

Re: stats - latency volume vs LUN

I'd pretty much echo what Pascal is saying.

What I was told by NetApp support at one point is that Volume I/Os are usually broken  up into about 64K I/Os, whereas your LUN IO size is whatever your client sends.  I had a case where I was sending 1 MB IOPS and seeing the same thing you did.  WHen I did the math, the volume I/O latency was MUCH less, but there were many MORE IOPS.  It kind of equalized out for me in the end. 

The problem I see is that I keep hearing the practice of normalizing this out is a bit of a science, and not very many know exactly how to do it.  What I do is concentrate more on LUN IOPS since that is what your client sees.  If I'm more concerned about disk contention or back-end performance issues, I may look closer at Aggregate stats.  I don't find volume stats very useful unless I don't have a LUN to look at.  Perhaps others will disagree.

Re: stats - latency volume vs LUN

bsti@plex.com wrote:

I don't find volume stats very useful unless I don't have a LUN to look at. 

Agree, unless there is a big latency gap between the LUN and volume latency after normalizing. That could indicate an external problem.

Re: stats - latency volume vs LUN

Good point. 

Re: stats - latency volume vs LUN

Thank you, this appears to be correct!

I also notice that there appears to be additional stats in OnTap 8.x (only avg_latency is available for the LUN object in 7.3) which is good, as well as updated 'explain text.  In 7.3 the explanation for read_latency does not inlude the information about network communication time.  Given that the latency numbers are close for volume and lun, even though the lun counters don't explicitly say that they aren't measuring network latency I will assume that they are not.

I've done some of testing and also discovered what bsti@plex was told by support.  Essentially, IOPs at the volume level are always 64kb, and IOPs at the LUN level are determined by the server. 

The discrepancy I was seeing on a disk basis was a mistake on my part.  It has nothing to do with the disk type (NLSAS vs SAS).  It is that the type of operations I am doing on SAS are MSSQL data operations, which are 64KB in size, and therefore correlate 1:1 with volume ops size.  The NLSAS operations I am doing are generally windows file copy/write operations, which can have op size up into the 32MB range, and thus gives an enormous difference between volume and LUN op count as well as latency.