Re: No metrics are being shown in Grafana or Graphite

CHARLES_L · ‎2016-06-28

Recently installed Grafana and Graphite but the only metrics that are showing up are the "Graphite Carbon Metrics".

Any other metrics are not showing up, I've posted a screenshot of the "NetApp Dashboard Cluster" below.

In Graphite, the Carbon metrics are the only ones that are showing up as well. Does anyone have an idea on what may be the problem here?

Thanks

madden · ‎2016-06-29

Hi @CHARLES_L

The NetApp metrics are sent by NetApp Harvest, downloadable from the Support toolchest. Have you installed Harvest? In the user guide (found on the Toolchest page next to the software download link) are some troubleshooting steps. The first is generally to check the poller logfile in /opt/netapp-harvest/log. I would check in there to see if it is able to collect and forward to Graphite correctly.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

bavy · ‎2016-07-05

Chris:

Going by this link, went through and checked all of

http://community.netapp.com/t5/OnCommand-Storage-Management-Software-Discussions/netapp-harvest-ocum-capacity-metrics/m-p/110608/highlight/true#M19507

I have confirmed that on the graphite server:

1$ service carbon-cache status
carbon-cache (instance a) is running with pid 1837 [ OK ]

I have attached the logs files from /opt/netapp-harvest/log. Reviewed the troubleshooting section of the document and compared the the authorization failed, connection refused errors from my log files to document but no luck getting past the issue. I have reviewed the possible suspects but don't seem to add up those.

bash-4.1$ service netapp-harvest status
STATUS          POLLER               SITE
############### #################### ##################
[RUNNING]       OCUM                 dfm
[RUNNING]       chvpk-cmode-poc      poc
[RUNNING]       hou150v3240cmode     hou-poc

Have verified the services., would be great to learn what am I missing.

Graphite is loading the information from the filers, but I don't seem to get any graphs to update in Graphana.

madden · ‎2016-07-07

Hi @bavy

The logs show entries like:

[2016-07-05 09:20:07] [NORMAL ] Poller status: status, secs=14400, api_time=666, plugin_time=15, metrics=191721, skips=0, fails=0

This is the update every 4 hrs and you can see it says it sent 191721 metric updates with no skips or failures. So Harvest is able to talk to Carbon. Since you don't see any metrics my guess is that Carbon is failing to write the metrics to disk.

Check the logfile named 'creates.log' in

Ubuntu: /var/log/carbon

Red Hat/Centos: /opt/graphite/storage/log/carbon-cache/carbon-cache-a

I suspect you'll have errors, maybe file permissions related on the data directory:

Ubuntu: /var/lib/graphite

Red Hat/Centos: /opt/graphite/storage

Also make sure you have enough disk space. You can also check the data directory structure to see if you have netapp/perf/<site>/<cluster> where your cluster data should be.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

bavy · ‎2016-07-07

Thank you Chris.

Checked the creates.log file. Do not see any errors.

$ cat creates.log
04/07/2016 20:05:05 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cporal99.nfsv3.symlink_avg_latency matched schema default_1min_for_1day
04/07/2016 20:05:05 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cporal99.nfsv3.symlink_avg_latency matched aggregation schema default_average
04/07/2016 20:05:05 :: creating database file /opt/graphite/storage/whisper/netapp/perf/poc/chvpk-cmode-poc/svm/vs-cporal99/nfsv3/symlink_avg_latency.wsp (archive=[(60,
04/07/2016 20:05:06 :: new metric netapp.perf.poc.chvpk-cmode-poc.node.chvpkv3240-08.nfsv3.symlink_avg_latency matched schema default_1min_for_1day
04/07/2016 20:05:06 :: new metric netapp.perf.poc.chvpk-cmode-poc.node.chvpkv3240-08.nfsv3.symlink_avg_latency matched aggregation schema default_average
04/07/2016 20:05:06 :: creating database file /opt/graphite/storage/whisper/netapp/perf/poc/chvpk-cmode-poc/node/chvpkv3240-08/nfsv3/symlink_avg_latency.wsp (archive=[(
$ tail -f creates.log
04/07/2016 20:05:05 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cporal99.nfsv3.symlink_avg_latency matched schema default_1min_for_1day
04/07/2016 20:05:05 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cporal99.nfsv3.symlink_avg_latency matched aggregation schema default_average
04/07/2016 20:05:05 :: creating database file /opt/graphite/storage/whisper/netapp/perf/poc/chvpk-cmode-poc/svm/vs-cporal99/nfsv3/symlink_avg_latency.wsp (archive=[(60,
04/07/2016 20:05:06 :: new metric netapp.perf.poc.chvpk-cmode-poc.node.chvpkv3240-08.nfsv3.symlink_avg_latency matched schema default_1min_for_1day
04/07/2016 20:05:06 :: new metric netapp.perf.poc.chvpk-cmode-poc.node.chvpkv3240-08.nfsv3.symlink_avg_latency matched aggregation schema default_average
04/07/2016 20:05:06 :: creating database file /opt/graphite/storage/whisper/netapp/perf/poc/chvpk-cmode-poc/node/chvpkv3240-08/nfsv3/symlink_avg_latency.wsp (archive=[(

$ tail -f creates.log.2016_7_3
03/07/2016 02:01:09 :: creating database file /opt/graphite/storage/whisper/netapp/perf/poc/chvpk-cmode-poc/svm/vs-cporal99/vol/cporal99_dbs5/write_latency.wsp (archive
03/07/2016 19:31:09 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cpsapl02.vol.icc_ar.write_latency matched schema default_1min_for_1day
03/07/2016 19:31:09 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cpsapl02.vol.icc_ar.write_latency matched aggregation schema default_average
03/07/2016 19:31:09 :: creating database file /opt/graphite/storage/whisper/netapp/perf/poc/chvpk-cmode-poc/svm/vs-cpsapl02/vol/icc_ar/write_latency.wsp (archive=[(60,
03/07/2016 21:25:10 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cpsapl02.vol.r32_08.avg_latency matched schema default_1min_for_1day
03/07/2016 21:25:10 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cpsapl02.vol.r32_08.avg_latency matched aggregation schema default_average
03/07/2016 21:25:10 :: creating database file /opt/graphite/storage/whisper/netapp/perf/poc/chvpk-cmode-poc/svm/vs-cpsapl02/vol/r32_08/avg_latency.wsp (archive=[(60, 14
03/07/2016 21:25:10 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cpsapl02.vol.r32_08.other_latency matched schema default_1min_for_1day
03/07/2016 21:25:10 :: new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cpsapl02.vol.r32_08.other_latency matched aggregation schema default_average
03/07/2016 21:25:10 :: creating database file /opt/graphite/storage/whisper/netapp/perf/poc/chvpk-cmode-poc/svm/vs-cpsapl02/vol/r32_08/other_latency.wsp (archive=[(60,

Permissions Look OK too.

$ cd /opt/graphite/storage
$ ls -al
total 104
drwxr-xr-x 6 apache     apache      4096 Jun 28 19:29 .
drwxr-xr-x 8 appharvest appharvest 4096 Jun 23 13:07 ..
-rw-r--r-- 1 root       root           4 Jun 23 16:51 carbon-cache-a.pid
-rw-r--r-- 1 apache     apache     69632 Jun 28 19:29 graphite.db
-rw-r--r-- 1 apache     apache      1130 Jun 14 15:27 index
drwxr-xr-x 2 apache     apache      4096 May 19 17:47 lists
drwxr-xr-x 4 apache     apache      4096 Jun 14 15:09 log
drwxr-xr-x 2 apache     apache      4096 May 19 17:47 rrd
drwxr-xr-x 5 apache     apache      4096 Jun 14 16:11 whisper

We are also seeing some graphs in Graphana with a Red ! in top left corner of some graphs and not all graphs of a given dashboard.

ERror says :{
"error": "Internal Server Error",
"message": "Internal Server Error"
}

And we have a lot of avaiable space on both RHEL VMs.

madden · ‎2016-07-07

Hi @bavy

I see a problem with the log entries:

new metric netapp.perf.poc.chvpk-cmode-poc.svm.vs-cpsapl02.vol.r32_08.avg_latency matched schema default_1min_for_1day

If you read the Harvest admin guide you made an edit to /opt/graphite/conf/storage-schemas.conf to add some entries. The order of that file is important; the "default_1min_for_1day" must be the LAST rule in file. From the create log I can see it's not (or you didn't copy the snippets from the Harvest admin guide in there at all). The result is that you will only have 1 day of history. So you need to reorder that file. Check this post for a much longer discussion and two options to fix.

Now still as-is you should get 1 day of retention. Since we know the data is getting written to disk, then either your graphite-web can't read the files, or the Grafana data source is not setup or able to talk to Graphite correctly. Can you load the native graphite interface in your web browser and see the metrics (i.e. http://graphite:81)? If yes, can you go to the edit data source page in grafana, click the test connection button, and does it say success?

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

CHARLES_L · ‎2016-07-08

Hi @madden

We haven't made the change you have suggested yet, we'll get that done by the end of today.

We have however uploaded some new dashboards and are receiving data from them. There is a new problem we are running into. We open a dashboard and can see the data, but as soon as we change groups it appears that Grafana won't load the data. If we try to change back to the first group that did show data before, it will no longer show data and the graph refresh button won't work. The only thing that does work is either refreshing the browser or going to a different dashboard and back.

This isn't an issue on all dashboards, just some of them. Also for most of the dashboards that are having this issue if you switch groups, save the dashboard, and then refresh the browser, we do see data for the groups we didn't see before. Since this is the case for most of them I believe that switching the group is causing the problem, I just don't know how or why.

Some of the graphs also show the red exclamation "Internal Server Error". I do not know if these problems are related, I think that they may each be their own thing. We are running the 3.0.2 version of Grafana, do you think that an upgrade to the 3.0.4 version would fix some of what we are seeing?

Many thanks for your response and help.

madden · ‎2016-07-08

Hi @CHARLES_L

You can load the latest Grafana version to see if it helps. The template items are cascaded based on the selection from the previous. So if you change the group for example the others will have to refresh. For the 'internal server error' I think if you click on the ! you can see a more detailed message. Also if you check the webserver logs from Graphite you should see the details of the error over there too. If some queries are failing that could also cause the dropdown list boxes to not populate correctly/consistently.

If your server is really slow perhaps they aren't populating before you click them giving you the odd behavior. Does your graphite server have plenty of resources? If you look at top is the load average ok?

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

CHARLES_L · ‎2016-08-08

Just wanted to provide an answer and close this question. We ended up solving most of the issues by upgrading from Grafana 3.0.2 to 3.1.1. According to issues number 5103, 5119, and 5120, there was an issue with switching the groups that was solved in version 3.0.3. We deleted the dashboards that weren't working and re-imported them after the upgrade, they are all working now.