Active IQ Unified Manager Discussions

difference of "qos_ops" and "total_ops"

paso
12,910 Views

Im usng OnCommand Perfomance Manager and NetApp Harvest for watch performance.

and I found two mertic about total IOPS, "qos_ops" and "total_ops".

 

but sometimes, qos_ops is different from total_ops!

when I limit volume to 1000IOPS, qos_ops is 1000IOPS but total_ops is over 1000(like 1600IOPS)

 

what is difference of "qos_ops" and "total_ops"?

 

thank you.

 

cap.JPG

1 ACCEPTED SOLUTION

madden
12,805 Views

Hi @paso

 

Unfortunately the 'other_ops' bucket is a catch-all counter regardless of requester.  There are other counters at the volume level like 'nfs_other_ops' that could be used to see if these are caused by nfs, and the same exist for other protocols, but if the work is not protocol related then we don't have a further breakdown.

 

One feature I really like with cDOT is the views enabled by QoS counters which are labeled "latency_from_*" in Grafana.  If you have a problem with perf on a single volume I would check the "NetApp Detail: Volume" dashboard, pick from the template the group/cluster/svm/volume, and then look at the following row:

qos_1.png

 

From here you will see the overall workload characteristics (throughput, IOPs, latency), understand it's use of cache, and see what layer the latency is coming from.  

 

Zooming in on the Latency Breakdown:

qos_2.png

 

Each graph shows the avg latency breakdown for IOs by component in the data path where the 'from" is:

• Network: latency from the network outside NetApp like waiting on vscan for NAS, or SCSI XFER_RDY (which includes network and host delay, here for an example of a write) for SAN
• Throttle: latency from QoS throttle
• Frontend: latency to unpack/pack the protocol layer and translate to/from cluster messages occuring on the node that owns the LIF
• Cluster: latency from sending data over the cluster interconnect (the 'latency cost' of indirect IO)
• Backend: latency from the WAFL layer to process the message on the node that owns the volume
• Disk: latency from HDD/SSD access

 

So I can see at times I had nearly 30ms avg latency (around 0600) at at that moment the largest contributor was network.  Maybe doing the same for your trouble volume will yield some direction on where to investigate.  If you still need to go deeper the next step would be to collect a perfstat and open a support case asking the question and referencing the perfstat.  For the perfstat I would recommend 3 interations of 5 minutes each while the problem is occurring.  Perfstat has more diagnostic info than Harvest collects

 

Hope this helps!

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO

 

View solution in original post

13 REPLIES 13
Public