Subscribe
Highlighted
Accepted Solution

difference of "qos_ops" and "total_ops"

Im usng OnCommand Perfomance Manager and NetApp Harvest for watch performance.

and I found two mertic about total IOPS, "qos_ops" and "total_ops".

 

but sometimes, qos_ops is different from total_ops!

when I limit volume to 1000IOPS, qos_ops is 1000IOPS but total_ops is over 1000(like 1600IOPS)

 

what is difference of "qos_ops" and "total_ops"?

 

thank you.

 

cap.JPG

Re: difference of "qos_ops" and "total_ops"

Hi @paso,

 

Good question!  total_ops tracks WAFL operations while qos_ops tracks protocol operations.  WAFL IOs have a max size of 64KB but client IOs can be much larger.  So if a 128KB protocol IO arrives it will cause 2 x 64KB WAFL IOs and you will see the counters deviate.  Both qos and normal volume counters can be interesting depending on your use case.  QoS ones are measuring from the frontend client perspective and can include network delay from a lossy network or busy host, latency introduced intentionally by QoS throttles, and are per client IO size which for big IOs can confuse things.  The non qos_  volume counters are measuring within WAFL and give you health of the storage system and are capped at 64KB.

 

 

 

Hope that helps and if you have a follow-on question fire away!

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO

 

Re: difference of "qos_ops" and "total_ops"

@madden

thank you for your answer!

and sorry for my easy english.

 

I understood that, We should watch non qos_ value if we want to know system pef.
if our customer wants to know their volume pef, we should check qos_ value.

 

But, I dont know NFS that with over64k block size.(We dont use protocol except NFS)
for instance, can "other ops" be big blocksize??

Re: difference of "qos_ops" and "total_ops"

Hi @paso

 

Yes, qos_ values will be closer to what the customer sees while the non-qos values will show internal values.

 

For latency the qos_ counters can include network latency, QoS throttling, frontend node CPU that owns the LIF, and are for the IO size of the client.  The non-qos latency counters measure inside the system, so the backend node CPU that owns the volume and wait time for accessing the disk.  So qos latency should always be more than the non-qos latency, but usually just by a little bit.

 

For ops the qos_ counters include the client perspective as well, so a 5 x 256 KB ops will be 5 ops.  The non-qos counters would be internal max of 64KB so it would be 20 ops.

 

For NFSv4 (and 4.1) other ops I can also imagine they might be different because NFSv4 has the concept of compound operations.  A compound operation allows the client to batch ops, so in NFSv3 to read a file you might have  LOOKUP, GETATTR, OPEN, SETATTR, CLOSE operations individually but in NFSv4 these could be batched in a single compound operation.  So for qos it might report 1, while for non-qos it might report 5.  I'm not sure on this one but could be.  Do you see the difference in IOP count for other ops?  

 

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO

 

Re: difference of "qos_ops" and "total_ops"

@madden

 

Thank you very much.

study for me.

 

>but in NFSv4 these could be batched in a single compound operation.

but We use NFSv3 

 

>I'm not sure on this one but could be.  Do you see the difference in IOP count for other ops?  

Yes,  and I found detail data.

 

I thought that, 'getattr' or 'lookup_total' or 'remove' OPS has double WAFL count Smiley Wink

 

thank you

 

cap2.png

Re: difference of "qos_ops" and "total_ops"

Hi @paso

 

I think volume 'other' ops might also include backend ops from things like the deduplication, snapmirror, reallocation, etc.  Were there any other activities on the volume that you can correlate with the timing?

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO

 

Re: difference of "qos_ops" and "total_ops"

>Were there any other activities on the volume that you can correlate with the timing?

No there werenot. We dont use dedupe, snapmirror run midnight, and this graph is same everyday.

How can I check these system activities??

 

 

sometimes, some customer that use many other IOPS say the volume is not deliver performance..Smiley Sad

 

 

Re: difference of "qos_ops" and "total_ops"

[ Edited ]

Hi @paso

 

Unfortunately the 'other_ops' bucket is a catch-all counter regardless of requester.  There are other counters at the volume level like 'nfs_other_ops' that could be used to see if these are caused by nfs, and the same exist for other protocols, but if the work is not protocol related then we don't have a further breakdown.

 

One feature I really like with cDOT is the views enabled by QoS counters which are labeled "latency_from_*" in Grafana.  If you have a problem with perf on a single volume I would check the "NetApp Detail: Volume" dashboard, pick from the template the group/cluster/svm/volume, and then look at the following row:

qos_1.png

 

From here you will see the overall workload characteristics (throughput, IOPs, latency), understand it's use of cache, and see what layer the latency is coming from.  

 

Zooming in on the Latency Breakdown:

qos_2.png

 

Each graph shows the avg latency breakdown for IOs by component in the data path where the 'from" is:

• Network: latency from the network outside NetApp like waiting on vscan for NAS, or SCSI XFER_RDY (which includes network and host delay, here for an example of a write) for SAN
• Throttle: latency from QoS throttle
• Frontend: latency to unpack/pack the protocol layer and translate to/from cluster messages occuring on the node that owns the LIF
• Cluster: latency from sending data over the cluster interconnect (the 'latency cost' of indirect IO)
• Backend: latency from the WAFL layer to process the message on the node that owns the volume
• Disk: latency from HDD/SSD access

 

So I can see at times I had nearly 30ms avg latency (around 0600) at at that moment the largest contributor was network.  Maybe doing the same for your trouble volume will yield some direction on where to investigate.  If you still need to go deeper the next step would be to collect a perfstat and open a support case asking the question and referencing the perfstat.  For the perfstat I would recommend 3 interations of 5 minutes each while the problem is occurring.  Perfstat has more diagnostic info than Harvest collects

 

Hope this helps!

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO

 

Re: difference of "qos_ops" and "total_ops"

thank you for your reply!

I checked Latency Breakdown on grafana,
and undorstood the latency come from QoS Limit.


I like Harvest&grafanna.
It's useful for maintain reliability of our cloudservice.

thank you very much!

Re: difference of "qos_ops" and "total_ops"

Hi Chris

 

First of all, great tool! I have been using this to monitor our client's environment with petabytes of netapp storage. very informational.

 

 

re. the latency_from_xxxx on volume QoS latency From drilldown. we're on cDOT 8.3.  I can see these are collected from the workload_details_volume section in the template. and the counters are litteraly latency_from_xxxx  however, I don't see those counters from ONTAP statistics catalog show counter -object ...or did I miss something obvious?

 

My question is what counters we are using to calculate each section?

In our environmnet, we don't get any value on lantency_from_network. There are no data in graphite hence there is nothing selected in the Metrics in grafana dashboard.

 

Thanks

Lisa