Solved: Harvest, storage-schemas and incorrect max values

mzp45 · ‎2016-08-16

Hi,

We noticed that in our grafana dashboards the max value of a metric can be inaccurate at times. For.e.g if during the week of 8/2/16 the max IOPs is 30k then the max for the entire month of 8/16 should be 30k or more. However we see values can be lower i.e. 25k as the max.

I traced this down to an issue with the way graphite reports max values - Here is the issue raised on the grafana forum - https://github.com/grafana/grafana/issues/1415. Thus the issue can be fixed by using consolidateBy(max) function.

However the problem arises again when you go from one retention period to another:

This is resolution/retention for netapp in storage-schemas.conf

pattern = ^netapp\.perf\.*
retentions = 60s:35d, 5m:100d, 15m:395d, 1h:5y

When you go from one retention bucket to the next, i.e. from 35d to 100d the max values now get averaged out thus you'll probably lose out the max value of 30k when week of 8/2/16 goes beyond 35 days

[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum

[default_average]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average

Does it make to sense to push the max values to a \.max$ file and then compute the max in Grafana? Right now it can be confusing to some. If anyone knows of another way to solve this problem that'd be really helpful

Thanks

madden · ‎2016-08-16

Hi @mzp45

Yes, the points you raise are valid; there is downsampling occuring at two places and each of them uses average by default.

I usually explain that the values in a Grafana panel table are for the datapoints shown in the panel, and Grafana by default displays as many data points as pixels [using Graphite's maxDataPoints API parameter]. So as you zoom in or out the max/min values may change. If you want a true max you can zoom in on the max value you see in the graph a few times until the timespan shows all individual data points.

I suppose you can wrap the metric calls using the the consolidateBy(max) function but I think that result will be misleading too since you essentially get the max sample of the group of data points being consolidated. So now your average is of the consolidated max points, and your overall graph will show much higher values than it should.

Here is an example:

If you zoom in to the last day or so the lines more or less overlap (in my lab), and further so that every data point is displayed natively you get entirely overlapping lines.

If you use storage-aggregation.conf to do the consolidating across retention archives I think you will have the same problem.

You could solve by increasing your retention of your most granular archive to keep the submitted values, then you could run queries without any maxDataPoints set and you will see true max. Or you need to create new metrics for max values. For this you'd have to write a script that would fetch the actual metric datapoints, consolidate, and post the max into some less granular archives (ex: hourly maxes). If you name them in a standard way and use storage-aggregation.conf you could ensure they are rolled up as maxes again as well. Sorry, but other than these [unapealing] ideas I'm not sure what to recommend.

Last thing, even the data points submitted by Harvest are averages of the sample period, which is 60s by default. So your actual single peak second IOPs won't be captured ever. To capture those you need to look at metrics that have histograms [which i don't typically collect] to see unaveraged data.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

View solution in original post

madden · ‎2016-08-16

Hi @mzp45

Yes, the points you raise are valid; there is downsampling occuring at two places and each of them uses average by default.

I usually explain that the values in a Grafana panel table are for the datapoints shown in the panel, and Grafana by default displays as many data points as pixels [using Graphite's maxDataPoints API parameter]. So as you zoom in or out the max/min values may change. If you want a true max you can zoom in on the max value you see in the graph a few times until the timespan shows all individual data points.

I suppose you can wrap the metric calls using the the consolidateBy(max) function but I think that result will be misleading too since you essentially get the max sample of the group of data points being consolidated. So now your average is of the consolidated max points, and your overall graph will show much higher values than it should.

Here is an example:

If you zoom in to the last day or so the lines more or less overlap (in my lab), and further so that every data point is displayed natively you get entirely overlapping lines.

If you use storage-aggregation.conf to do the consolidating across retention archives I think you will have the same problem.

You could solve by increasing your retention of your most granular archive to keep the submitted values, then you could run queries without any maxDataPoints set and you will see true max. Or you need to create new metrics for max values. For this you'd have to write a script that would fetch the actual metric datapoints, consolidate, and post the max into some less granular archives (ex: hourly maxes). If you name them in a standard way and use storage-aggregation.conf you could ensure they are rolled up as maxes again as well. Sorry, but other than these [unapealing] ideas I'm not sure what to recommend.

Last thing, even the data points submitted by Harvest are averages of the sample period, which is 60s by default. So your actual single peak second IOPs won't be captured ever. To capture those you need to look at metrics that have histograms [which i don't typically collect] to see unaveraged data.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

mzp45 · ‎2016-08-17

Hi Chris,

Thanks for the explanation. I think we are ok with the current setup since we now understand how the max values are displayed. We may look at increasing the retention to maybe 60 days or so.

Thanks