Active IQ Unified Manager Discussions

Grafana plus Graphite plus Harvester issues

Varyanitsa
8,206 Views

Hello,

we have just installed that system. First off, it works great (mostly)! Unfortunately, we have several issues that ruin most of that system’s visualization features. At this moment we have 4 different clusters and several 2240 to be monitored; that gives us around 30k metrics updated per minute. We also use pre-created dashboards and receive data every 30 secs. We store these 30s metrics for 10 days.

Frequently I witness “Request errors” even at 6-12h graphs at LUN dashboard. This is followed by “Data-Cache miss” in carbon’s cache.log. No errors in carbon metrics dashboard though. Moreover, data is missing. Some LUNs don’t have average latency points written for hours whereas others write the whole (or most) data.

I have no idea where should I start digging and I would appreciate any help given.

7 REPLIES 7

madden
8,181 Views

Hi @Varyanitsa

 

>>“Request errors”

I believe this is a defect in a specific build of Graphite.  If using NAbox there was a time when this release would have been loaded.  @yannb, can you comment more?

 

>>Some LUNs don’t have average latency points written for hours whereas others write the whole (or most) data.

This is likely related to LUNs with very low IOPs and the latency_io_reqd feature of Harvest (see the admin guide) kicking in.  See this post where I explain more.

 

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

 

Varyanitsa
8,111 Views

Hi, madden!

Unfortunately setting latency_io_reqd to 0 didnt solve my problem; amount of updated rose to 33k but I still have these gaps. Any other tweaks to try?

 

Also, today I found this in worker's status:

 

[2017-02-27 11:45:32] [WARNING] [nfsv3] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:32] [WARNING] [nfsv3:node] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:38] [WARNING] [nfsv4] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:41] [WARNING] [nfsv4:node] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:41] [WARNING] [nfsv4_1] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:41] [WARNING] [nfsv4_1:node] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:41] [WARNING] [nic_common] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:41] [WARNING] [path] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:42] [WARNING] [processor] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:42] [WARNING] [system:node] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:42] [WARNING] [token_manager] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:43] [WARNING] [volume] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:43] [WARNING] [volume:node] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:43] [WARNING] [wafl] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:43] [WARNING] [wafl_hya_per_aggr] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:43] [WARNING] [wafl_hya_sizer] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:43] [WARNING] [workload] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:43] [WARNING] [workload_detail] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:45] [WARNING] [workload_volume] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]
[2017-02-27 11:45:48] [WARNING] [workload_detail_volume] data-list poller refresh overdue; skipped [1] poll(s) from [2017-02-27 11:46:00] to [2017-02-27 11:46:00]

madden
8,022 Views

Hi @Varyanitsa

 

Regarding the gaps, if you have LUNs that have 0 IOs the latency is also 0, and Harvest does not submit data points with a value of 0.  If Grafana requests those data points it gets a null value, which by default in Grafana shows as a connected line.  You can modify Grafana to show null values as 0:

 

null.png

 

Regarding the skipped polls, I think this occurs because the time it takes Harvest to collect all data from the cluster is greater than your configured poll time.  After Harvest collects data it schedules the next update, but if that update time is in the past it will skip it.  I would recommend increasing your poll frequency to 60s and see if you still have frequent skipped messages.  You would also need to update your Graphite storage-schemas.conf file with this new frequency, and either remove all metric files and let them be recreated or use the whisper-resize.py utility to resize them.  This post has more info about resizing.

 

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

 

 

Brian-T
7,956 Views

> "Regarding the gaps, if you have LUNs that have 0 IOs the latency is also 0, and Harvest does not submit data points with a value of 0"

 

Is there a way to toggle that behavior besides changing the grafana side? Looking over the graphite data it is hard to tell if there is an issue when there is a large gap in data collection or if there was no activity on a volume/lun/network, etc.

 

Brian

madden
7,925 Views

Hi @Brian-T

 

There is no config option in Harvest to submit 0 values; these are always skipped.  The reason is there are many counters that will never have a value in the lifetime of the system and if 0 was sent this would consume disk space and CPU cycles on the Graphite server.  If you are trying to determine an idle resource vs. no data points the best you can do is check some other resources.  For example, if you see that all vols had no data points then you know there was a collection problem, but if you see some vols had data points and your volume of interest did not, then you can conclude there was no activity on your volume [and not that data points were skipped].

 

Hope this helps!

 

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

Brian-T
7,905 Views

Hi Chris,

  Thanks for the explanation. I see what you means about the massive number of datapoints that are collected. The issue that I keep running into is that it is possible for a idle volume to not show up on the graph at all. When the volume is performing iops it shows up, but when it's quiet, it will dissapear. Example, the root volume shows all the latency numbers, but not a test idle volume.

 

Thanks

-Brian

 

 

Brian-T
7,891 Views

I was able to make a work-around for it. Thanks for your help!

Public