Harvest avg_processor_busy metrics discrepancy

PabloZorzoli · ‎2017-04-04

I noticed that Harvest has 2 places in the metrics path were it reports avg_processor_busy:

harvest.xx.cluster.node.node-xx.system.avg_processor_busy
harvest.xx.cluster.node.node-xx.processor.avg_processor_busy

The grafana dashboard for the Node, uses the processor one for the graph in the System Utilization panel.

From time to time (in my environment) it will get unexpected values, completely outside the 100% as in the screenshot below (same happens with Kahuna):

And also noticed, that if I rely in the avg_processor_busy under system, these annomalies don't seem to be there. So, I'm curious if this is a real issue in my environment or if the ontap counters are playing some games with Harvest.

I hope @madden will came to my rescue on this one.

Pablo

madden · ‎2017-04-06

Hi @PabloZorzoli

I looked at the code and the 'system' object avg_processor_busy is collected from the system:node object and passed unmodified (except normalization) through to graphite. The 'processor' avg_processor_busy is actually calculated in the cdot-processor plugin based on (sum of per core processor_busy) / (number of cores). Because you said Kahuna domain also gets wonky, and this one is from processor_busy of this same 'processor' object, it implies that the cluster is returning incorrect values to Harvest OR Harvest is processing/summarizing the data incorrectly.

If you can restart the poller with verbose logging enabled (-v option to netapp-worker or netapp-manager) then Harvest will log every response it gets from the cluster and we can investigate which component is to blame. Also, in Harvest v1.3 I added logfile rotation but forgot to document it! You might need to add these key/value pairs to your poller config with sufficiently high values to retain enough logs to capture the issue:

PARAMETER

DESCRIPTION

DEFAULT VALUE

logfile_rotate_mb

Size in MB per logfile before it is rotated

5

logfile_rotate_keep

Inactive log is archived to log.1, log.2 etc. Set number of archived logfiles to keep

4

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

PabloZorzoli · ‎2017-04-11

Thanks for the reply @madden I have restarted one of the poller's in verbose mode, and will try to fish out a re-occurrence of it.