Issue with Grafana/Graphite and QOS counters?

James_Castro · ‎2016-01-11

Hoping someone has seen this before or could give me some tips in troubleshooting it. Please view the screenshot below. I am trying to figure out why ome of my QOS policies show a huge spike in the number of OPs. Below shows it spiking to 1.5 million ops. This can't be right.

madden · ‎2016-01-12

Hi @James_Castro,

Data ONTAP counters should be monotonically increasing, meaning they only go up. Harvest basically takes the value at T1, waits a bit, and again at T2, and then calcs T2-T1 to get the rate of change (and more fancy stuff depending on the counter type). If T2-T1 is negative then it was not monotonically increasing and Harvest assumes a counter reset occurred (reboot of node, max int size reached, etc) and skips the data point and uses that new value as the base for the next iteration. The logic Harvest uses is the same as any perf API calling app would use.

I have seen some scenarios where the counter is not monotonically increasing but where a reset didn't occur. One of them is with counter aggregations introduced in cDOT. With an aggegrated counter you can have something like volume:node object which is a summary of all volumes on a node. If you poll that object and get the values, then one volume goes offline and you poll again you might get a decreasing value because that vol's counters are not in the aggregation. So you skip posting the data and now have a new, lower counter value as your base for the next comparison. Then the volume comes online again, and you poll again, and now that volume is included in the calc resulting in an apparent spike in the counter. The spike would be equivalent to every 'tick' of that counter on that volume since it was created. I saw this with snapmirror destination volumes because as part of the 'jumpahead' after each update the volume is taken offline momentarily. If during that moment a poll occurs the volume is not included in the aggregation and on the next poll when it is you get the spike. I opened a bug 899768 on this if you want to request a fix.

From the screenshot though it looks like you have a policy group and from the naming flexclones are involved. So not an aggregation issue like the bug I already opened. Maybe flexclones inherit counters values on creation that causes us to see a spike? Or maybe somehow a destroy and creation again causes it? I would check what activity happened on a volume in that policy group at that point in time. If you provide what happened, and the DOT release, I can try to reproduce and determine root cause.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

P.S. Please select “Options” and then “Accept as Solution” if this response answered your question so that others will find it easily!

James_Castro · ‎2016-01-12

I think I understand. The customer uses a script that searches for clones older than 45 days and applies the policy to them. They also have another script that reclaims certain clones base on age too. I will verify exactly when these scripts are executed. Perhaps that's the answer. cDOT release 8.2.3