Solved: difference of "qos_ops" and "total_ops" - Page 2

paso · ‎2016-04-10

Im usng OnCommand Perfomance Manager and NetApp Harvest for watch performance.

and I found two mertic about total IOPS, "qos_ops" and "total_ops".

but sometimes, qos_ops is different from total_ops!

when I limit volume to 1000IOPS, qos_ops is 1000IOPS but total_ops is over 1000(like 1600IOPS)

what is difference of "qos_ops" and "total_ops"?

thank you.

madden · ‎2016-04-12

Hi @paso

Unfortunately the 'other_ops' bucket is a catch-all counter regardless of requester. There are other counters at the volume level like 'nfs_other_ops' that could be used to see if these are caused by nfs, and the same exist for other protocols, but if the work is not protocol related then we don't have a further breakdown.

One feature I really like with cDOT is the views enabled by QoS counters which are labeled "latency_from_*" in Grafana. If you have a problem with perf on a single volume I would check the "NetApp Detail: Volume" dashboard, pick from the template the group/cluster/svm/volume, and then look at the following row:

From here you will see the overall workload characteristics (throughput, IOPs, latency), understand it's use of cache, and see what layer the latency is coming from.

Zooming in on the Latency Breakdown:

Each graph shows the avg latency breakdown for IOs by component in the data path where the 'from" is:

• Network: latency from the network outside NetApp like waiting on vscan for NAS, or SCSI XFER_RDY (which includes network and host delay, here for an example of a write) for SAN
• Throttle: latency from QoS throttle
• Frontend: latency to unpack/pack the protocol layer and translate to/from cluster messages occuring on the node that owns the LIF
• Cluster: latency from sending data over the cluster interconnect (the 'latency cost' of indirect IO)
• Backend: latency from the WAFL layer to process the message on the node that owns the volume
• Disk: latency from HDD/SSD access

So I can see at times I had nearly 30ms avg latency (around 0600) at at that moment the largest contributor was network. Maybe doing the same for your trouble volume will yield some direction on where to investigate. If you still need to go deeper the next step would be to collect a perfstat and open a support case asking the question and referencing the perfstat. For the perfstat I would recommend 3 interations of 5 minutes each while the problem is occurring. Perfstat has more diagnostic info than Harvest collects

Hope this helps!

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO

View solution in original post

difference of "qos_ops" and "total_ops"

NetApp Support Site Wins Silver Stevie® Award

Sign-up for Software Release Notifications!