Solved: How to use NetApp harvest to find root cause of high latency?

ACHOU_SIMG · ‎2017-01-04

Hi,

I've got NetApp Harvest working finally... I have three NetApp clusters. Two of them shows latency is low. The 3rd one (FAS8020 HA) shows in "NetApp Details:Volume" dashboard, I see in the upper left it says Average Latency is 85ms. Below it the graph "Top Average Latency" shows top 5 volumes with high latencies which are in between 70ms and 135ms. I then checked the "Top Volume end-to-end qos drilldown" and "Top volume QoS latency from drilldown" sections but see there are only single-digit ms #s listed for various volumes. What's more strange is these volumes in the two sections are totally differnt from the "Top Average Latency" graph in the upper right of the page. Am I missing something?

It'd be nice to be able to drill down directly a volume listed inthe "Top Average Latency" graph. It's much more intuitive that way. Is there such dashboard available?

Thank you.

madden · ‎2017-01-05

Hi @ACHOU_SIMG

I think I can explain what you see but first I have to give some background info. ONTAP tracks IOPs at many different layers; protocol layer, QoS layer, Volume layer, LUN layer, LIF layer, and more.

The volume layer is what I show on the top of the page and it is latency and IOPs from the volume (WAFL) layer lower. Latency includes work by the node that owns the volume to service the request, including queueing for CPU, fetching from disk, etc. The IOPs shown can be IOPs that originate from user workload coming in LIFs, or it can be system generated work like snapmirror or deduplication

The QoS layer shown are IOPs and latency that originate from user workload only. Latency includes time from the moment the IO was accepted by the LIF, and until the completed IO is Ack'd back out the LIF. So this is why I call it 'end-to-end'. It can however include delays that don't point to a storage system bottleneck, such as client network delays or intentional QoS throttling.

With these two layers explained next we have different kinds of IOPs: Read, Write, Other (anything not a read or write, could be from a protocol like a file open, or could be a background iop from snapmirror or deduplication). Further,

Volume counters include tracking of read/write/other IOPs and read/write/other Latency. There is also a total IOPs and avg latency.
QoS counters include read/write IOPs and read/write latency. There is also a total IOPs and avg latency.

And the last bit of background, different workloads will have vastly different IOP mixes. If ONTAP is serving SAN workloads, or NAS for applications/databases/virtualization, then you will see mostly Read and Write operations and very few Other operations. But if ONTAP is serving NAS file share workloads then you often see mostly Other operations, maybe 70% of total, with the remaining 30% mixed between Reads and Writes. So if you look to 'avg latency', for some workloads it will basically be the weighted average of read and write latency, whereas on others it might be mostly metadata IOP latency.

Getting back to your question, on the volume dashboard the volume avg latency is shown in the top graphs. This is an attempt to find a balance between the SAN workloads (mostly read/write) and NAS fileserving workloads (mostly other). In the more detailed breakdowns I tend to focus on read/write, because this is usually the most important for end-users. Also, Other IOPs can include background IOPs which were not issued or visible by the end user. Lastly, because of summarization issues, if the IOP count is very low (less than a hundred per second) the latency reported might be less accurate. Harvest will automatically not send latency if the IOPs < 10 because it is most definiately not representative, but if you have 50 IOPs it might still not be too accurate. This is adjustable by using the 'latency_io_reqd' parameter in the netapp-harvest.conf file; see the admin guide for more info.

I suspect in your situation you have some volumes with few Total IOPs, and the IOPs that do occur are Other IOPs. Because they are Other IOPs they don't show in the read/write in most of the other tables/graphs. And if they are background IOPs (snapmirror, deduplication, etc), they won't show in QoS IOP totals either. I would choose one of the worst volume in the "Top Average Latency" graph at the top of the page and set the template dropdown to only show this volume. Then look at the row "Per volume WAFL layer drilldown" and check the IOPs and Latency charts. If you see few IOPs, and almost all Other IOPs, and with high latency [which will cause avg latency to also be high], then my explanation makes sense. If you want to reduce these 'false positives' you could adjust the 'latency_io_reqd' I mentioned earlier.

I also have some more discussion on counters here that you might find helpful.

Regarding your second question about "[2017-01-04 12:18:02] [WARNING] No counter metadata found for: [nfsv3:node][latency]; check if valid counter for this DOT release", this looks to be a bug on my side. This counter (and also avg_read_latency, avg_write_latency) were introduced in 8.3.1 so they should not be in the collection template for 8.3.0. This warning will do no harm other than make for a messy logfile. To resolve you can either ignore the warnings, remove those counters from the template/default/cdot-8.3.0.conf file, or update your cluster to a more recent release. An upgrade from 8.3.0 to the latest in the 8.3 family would be my recommendation.

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

View solution in original post

ACHOU_SIMG · ‎2017-01-04

I just notice on the harvest server, the log file for the cluster shows repeatedly:

[2017-01-04 12:18:02] [WARNING] No counter metadata found for: [nfsv3:node][latency]; check if valid counter for this DOT release
[2017-01-04 12:18:02] [WARNING] No counter metadata found for: [nfsv3:node][latency]; check if valid counter for this DOT release

The cluster runs on cDOT 8.3GA. NetApp Harvest version 1.3.

Thanks,

madden · ‎2017-01-05

Hi @ACHOU_SIMG

I think I can explain what you see but first I have to give some background info. ONTAP tracks IOPs at many different layers; protocol layer, QoS layer, Volume layer, LUN layer, LIF layer, and more.

The volume layer is what I show on the top of the page and it is latency and IOPs from the volume (WAFL) layer lower. Latency includes work by the node that owns the volume to service the request, including queueing for CPU, fetching from disk, etc. The IOPs shown can be IOPs that originate from user workload coming in LIFs, or it can be system generated work like snapmirror or deduplication

The QoS layer shown are IOPs and latency that originate from user workload only. Latency includes time from the moment the IO was accepted by the LIF, and until the completed IO is Ack'd back out the LIF. So this is why I call it 'end-to-end'. It can however include delays that don't point to a storage system bottleneck, such as client network delays or intentional QoS throttling.

With these two layers explained next we have different kinds of IOPs: Read, Write, Other (anything not a read or write, could be from a protocol like a file open, or could be a background iop from snapmirror or deduplication). Further,

Volume counters include tracking of read/write/other IOPs and read/write/other Latency. There is also a total IOPs and avg latency.
QoS counters include read/write IOPs and read/write latency. There is also a total IOPs and avg latency.

And the last bit of background, different workloads will have vastly different IOP mixes. If ONTAP is serving SAN workloads, or NAS for applications/databases/virtualization, then you will see mostly Read and Write operations and very few Other operations. But if ONTAP is serving NAS file share workloads then you often see mostly Other operations, maybe 70% of total, with the remaining 30% mixed between Reads and Writes. So if you look to 'avg latency', for some workloads it will basically be the weighted average of read and write latency, whereas on others it might be mostly metadata IOP latency.

Getting back to your question, on the volume dashboard the volume avg latency is shown in the top graphs. This is an attempt to find a balance between the SAN workloads (mostly read/write) and NAS fileserving workloads (mostly other). In the more detailed breakdowns I tend to focus on read/write, because this is usually the most important for end-users. Also, Other IOPs can include background IOPs which were not issued or visible by the end user. Lastly, because of summarization issues, if the IOP count is very low (less than a hundred per second) the latency reported might be less accurate. Harvest will automatically not send latency if the IOPs < 10 because it is most definiately not representative, but if you have 50 IOPs it might still not be too accurate. This is adjustable by using the 'latency_io_reqd' parameter in the netapp-harvest.conf file; see the admin guide for more info.

I suspect in your situation you have some volumes with few Total IOPs, and the IOPs that do occur are Other IOPs. Because they are Other IOPs they don't show in the read/write in most of the other tables/graphs. And if they are background IOPs (snapmirror, deduplication, etc), they won't show in QoS IOP totals either. I would choose one of the worst volume in the "Top Average Latency" graph at the top of the page and set the template dropdown to only show this volume. Then look at the row "Per volume WAFL layer drilldown" and check the IOPs and Latency charts. If you see few IOPs, and almost all Other IOPs, and with high latency [which will cause avg latency to also be high], then my explanation makes sense. If you want to reduce these 'false positives' you could adjust the 'latency_io_reqd' I mentioned earlier.

I also have some more discussion on counters here that you might find helpful.

Regarding your second question about "[2017-01-04 12:18:02] [WARNING] No counter metadata found for: [nfsv3:node][latency]; check if valid counter for this DOT release", this looks to be a bug on my side. This counter (and also avg_read_latency, avg_write_latency) were introduced in 8.3.1 so they should not be in the collection template for 8.3.0. This warning will do no harm other than make for a messy logfile. To resolve you can either ignore the warnings, remove those counters from the template/default/cdot-8.3.0.conf file, or update your cluster to a more recent release. An upgrade from 8.3.0 to the latest in the 8.3 family would be my recommendation.

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

ACHOU_SIMG · ‎2017-01-05

Hi Chris,

Thank you SO much for the background information and detailed explanation! Indeed those high latency NAS volumes have very few IOPS so your explanation makes perfect sense.

Just to clarify on the "other latency:"

* Does it also include file metadata/attributes lookups? In our NAS environment, we often see high percentage of metadata lookups.

* Is there a way to show rough breakdown for snapmirroring, dedup, and else within the "other latency"? I ask because we evaluated Solarwinds last year and it turned out the "other latency" was so significant it skews the results.We could not figure a way to further break down in order to identify what type of operations caused the latency.

Thank you!

madden · ‎2017-01-05

Hi @ACHOU_SIMG,

>>* Does it also include file metadata/attributes lookups? In our NAS environment, we often see high percentage of metadata lookups.

Yes, it does. Usually metadata calls have very low latency, but sometimes not. For example if you had directories with millions of files in them the latency of those IOPs might be higher and would indeed be felt by the end-user.

>>* Is there a way to show rough breakdown for snapmirroring, dedup, and else within the "other latency"? I ask because we evaluated Solarwinds last year and it turned out the "other latency" was so significant it skews the results.We could not figure a way to further break down in order to identify what type of operations caused the latency.

No. The volume counters Other IOPs/latency give you no insight to what kind of IOP it was. This is one reason I don't show Other IOPs or Latency in many of the dashboard graphs. My suggestion would instead be to look at the protocol latency views and see which metadata IOPs have high latency, and if the count of those same IOPs is high, or has bad enough latency to push overall protocol latency up, investigate. If you start at protocol view (either SVM or node view) to identify an incident you could then use the 'related dashboards' feature (preserves timespan and template picks) to navigate over to the volume view and try to make a correlation. It's not ideal, but since we don't have vol counters per protocol IOP type its the best we can do.

Glad to help, especially when a thorough and thoughtful question comes in!

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

How to use NetApp harvest to find root cause of high latency?

Sign-up for Software Release Notifications!

Join us on Discord