Hi @ACHOU_SIMG
I think I can explain what you see but first I have to give some background info. ONTAP tracks IOPs at many different layers; protocol layer, QoS layer, Volume layer, LUN layer, LIF layer, and more.
The volume layer is what I show on the top of the page and it is latency and IOPs from the volume (WAFL) layer lower. Latency includes work by the node that owns the volume to service the request, including queueing for CPU, fetching from disk, etc. The IOPs shown can be IOPs that originate from user workload coming in LIFs, or it can be system generated work like snapmirror or deduplication
The QoS layer shown are IOPs and latency that originate from user workload only. Latency includes time from the moment the IO was accepted by the LIF, and until the completed IO is Ack'd back out the LIF. So this is why I call it 'end-to-end'. It can however include delays that don't point to a storage system bottleneck, such as client network delays or intentional QoS throttling.
With these two layers explained next we have different kinds of IOPs: Read, Write, Other (anything not a read or write, could be from a protocol like a file open, or could be a background iop from snapmirror or deduplication). Further,
- Volume counters include tracking of read/write/other IOPs and read/write/other Latency. There is also a total IOPs and avg latency.
- QoS counters include read/write IOPs and read/write latency. There is also a total IOPs and avg latency.
And the last bit of background, different workloads will have vastly different IOP mixes. If ONTAP is serving SAN workloads, or NAS for applications/databases/virtualization, then you will see mostly Read and Write operations and very few Other operations. But if ONTAP is serving NAS file share workloads then you often see mostly Other operations, maybe 70% of total, with the remaining 30% mixed between Reads and Writes. So if you look to 'avg latency', for some workloads it will basically be the weighted average of read and write latency, whereas on others it might be mostly metadata IOP latency.
Getting back to your question, on the volume dashboard the volume avg latency is shown in the top graphs. This is an attempt to find a balance between the SAN workloads (mostly read/write) and NAS fileserving workloads (mostly other). In the more detailed breakdowns I tend to focus on read/write, because this is usually the most important for end-users. Also, Other IOPs can include background IOPs which were not issued or visible by the end user. Lastly, because of summarization issues, if the IOP count is very low (less than a hundred per second) the latency reported might be less accurate. Harvest will automatically not send latency if the IOPs < 10 because it is most definiately not representative, but if you have 50 IOPs it might still not be too accurate. This is adjustable by using the 'latency_io_reqd' parameter in the netapp-harvest.conf file; see the admin guide for more info.
I suspect in your situation you have some volumes with few Total IOPs, and the IOPs that do occur are Other IOPs. Because they are Other IOPs they don't show in the read/write in most of the other tables/graphs. And if they are background IOPs (snapmirror, deduplication, etc), they won't show in QoS IOP totals either. I would choose one of the worst volume in the "Top Average Latency" graph at the top of the page and set the template dropdown to only show this volume. Then look at the row "Per volume WAFL layer drilldown" and check the IOPs and Latency charts. If you see few IOPs, and almost all Other IOPs, and with high latency [which will cause avg latency to also be high], then my explanation makes sense. If you want to reduce these 'false positives' you could adjust the 'latency_io_reqd' I mentioned earlier.
I also have some more discussion on counters here that you might find helpful.
Regarding your second question about "[2017-01-04 12:18:02] [WARNING] No counter metadata found for: [nfsv3:node][latency]; check if valid counter for this DOT release", this looks to be a bug on my side. This counter (and also avg_read_latency, avg_write_latency) were introduced in 8.3.1 so they should not be in the collection template for 8.3.0. This warning will do no harm other than make for a messy logfile. To resolve you can either ignore the warnings, remove those counters from the template/default/cdot-8.3.0.conf file, or update your cluster to a more recent release. An upgrade from 8.3.0 to the latest in the 8.3 family would be my recommendation.
Cheers,
Chris Madden
Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)
Blog: It all begins with data
If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!