Re: NetApp Harvest - brilliant but some 7-mode issues...

DARREN_REED · ‎2015-12-30

I've installed NetApp Harvest in two environments now, one with RHEL 6.6 (monitoring OnTAP 8.2.1 C-Mode) and one with RHEL 6.7 (monitoring OnTAP 8.2.1 7-Mode). When and where it works it is brilliant - especially for c-Dot - far better than anything else NetApp have put our way recently!

The RHEL 6.6/C-Mode VM is working flawlessly. Love it!

However the RHEL 6.7/7-Mode is having a few problems.

These rows work without error:

Highlights, Node Workload Drilldown, Node Disk And Cache Drilldown, Top FCP Drilldown (FCP is not used :), Top NFSv3 Drilldown, Top Uptime Drilldown

These rows partially work with some graphs showing a red triangle plus exclamation mark:

Node CPU Drilldown, Node Network Drilldown (Ethernet Port Throughput is "!"), Top ISCSI Drilldown (iSCSI Read * is "!"), Top CIFS Drilldown (CIFS Write IOPs & CIFS Write Latency are "!"),

These rows do not work at all (all elements have the red triangle plus "!"):

Top Volumes Drilldown, Top LUNs Drilldown,

If I hover over the red triangle plus exclamation mark I get "Timeseries data request error" and click on it gives the following:

====================================================

Graphite encountered an unexpected error while handling your request.
Please contact your site administrator if the problem persists.

Request:

Request details

Url /api/datasources/proxy/1/render
Method POST
Content-Type application/x-www-form-urlencoded
Accept application/json, text/plain, */*
Request parameters

target aliasByNode(highestAverage(netapp.perf7.*.*.vol.*.read_data, 5), 3, 5)
from -6h
until now
format json
maxDataPoints 503

Response:

Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/django/core/handlers/base.py", line 109, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/opt/graphite/webapp/graphite/render/views.py", line 122, in renderView
seriesList = evaluateTarget(requestContext, target)
File "/opt/graphite/webapp/graphite/render/evaluator.py", line 10, in evaluateTarget
result = evaluateTokens(requestContext, tokens)
File "/opt/graphite/webapp/graphite/render/evaluator.py", line 21, in evaluateTokens
return evaluateTokens(requestContext, tokens.expression)
File "/opt/graphite/webapp/graphite/render/evaluator.py", line 28, in evaluateTokens
args = [evaluateTokens(requestContext, arg) for arg in tokens.call.args]
File "/opt/graphite/webapp/graphite/render/evaluator.py", line 21, in evaluateTokens
return evaluateTokens(requestContext, tokens.expression)
File "/opt/graphite/webapp/graphite/render/evaluator.py", line 28, in evaluateTokens
args = [evaluateTokens(requestContext, arg) for arg in tokens.call.args]
File "/opt/graphite/webapp/graphite/render/evaluator.py", line 21, in evaluateTokens
return evaluateTokens(requestContext, tokens.expression)
File "/opt/graphite/webapp/graphite/render/evaluator.py", line 24, in evaluateTokens
return fetchData(requestContext, tokens.pathExpression)
File "/opt/graphite/webapp/graphite/render/datalib.py", line 372, in fetchData
dbResults = dbFile.fetch(startTime, endTime, now)

====================================================

Clues?

Is this a case of version X of A needs version Y of B, etc, and an yum update will fix or...?

DARREN_REED · ‎2015-12-30

Some more digging and I found this in "/opt/graphite/storage/log/webapp/error.log":

[Wed Dec 23 13:30:59 2015] [error] mod_wsgi (pid=5773): Target WSGI script '/opt/graphite/conf/graphite.wsgi' cannot be loaded as Python module.
[Wed Dec 23 13:30:59 2015] [error] mod_wsgi (pid=5773): Exception occurred processing WSGI script '/opt/graphite/conf/graphite.wsgi'.

which is suggestive of a "wrong version of python" issue.

# yum info mod_wsgi
Loaded plugins: product-id, security, subscription-manager
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
rhel-6-updates | 2.9 kB 00:00
Installed Packages
Name : mod_wsgi
Arch : x86_64
Version : 3.2
Release : 7.el6
Size : 177 k
Repo : installed
From repo : rhel-6-updates
Summary : A WSGI interface for Python web applications in Apache
URL : http://modwsgi.org
License : ASL 2.0
Description : The mod_wsgi adapter is an Apache module that provides a WSGI compliant
: interface for hosting Python based web applications within Apache. The
: adapter is written completely in C code against the Apache C runtime and
: for hosting WSGI applications within Apache has a lower overhead than using
: existing WSGI adapters for mod_python or CGI.

# yum deplist mod_wsgi

...

dependency: libpython2.6.so.1.0()(64bit)

So it appears that wsgi is correctly matched with python.

If I try run it directly...

# python --version
Python 2.6.6

# python /opt/graphite/conf/graphite.wsgi
/usr/lib/python2.6/site-packages/django/conf/__init__.py:75: DeprecationWarning: The ADMIN_MEDIA_PREFIX setting has been removed; use STATIC_URL instead.
"use STATIC_URL instead.", DeprecationWarning)

but no "cannot load" error.

A more comprehensive extract from that erorr file is:

[Wed Dec 23 13:30:59 2015] [error] /usr/lib/python2.6/site-packages/django/conf/__init__.py:75: DeprecationWarning: The ADMIN_MEDIA_PREFIX setting has been removed; use STATIC_URL instead.
[Wed Dec 23 13:30:59 2015] [error] "use STATIC_URL instead.", DeprecationWarning)
[Wed Dec 23 13:30:59 2015] [error] mod_wsgi (pid=5773): Target WSGI script '/opt/graphite/conf/graphite.wsgi' cannot be loaded as Python module.
[Wed Dec 23 13:30:59 2015] [error] mod_wsgi (pid=5773): Exception occurred processing WSGI script '/opt/graphite/conf/graphite.wsgi'.
[Wed Dec 23 13:30:59 2015] [error] Traceback (most recent call last):
[Wed Dec 23 13:30:59 2015] [error] File "/opt/graphite/conf/graphite.wsgi", line 25, in <module>
[Wed Dec 23 13:30:59 2015] [error] import graphite.metrics.search
[Wed Dec 23 13:30:59 2015] [error] File "/opt/graphite/webapp/graphite/metrics/search.py", line 6, in <module>
[Wed Dec 23 13:30:59 2015] [error] from graphite.storage import is_pattern, match_entries
[Wed Dec 23 13:30:59 2015] [error] File "/opt/graphite/webapp/graphite/storage.py", line 9, in <module>
[Wed Dec 23 13:30:59 2015] [error] from graphite.remote_storage import RemoteStore
[Wed Dec 23 13:30:59 2015] [error] File "/opt/graphite/webapp/graphite/remote_storage.py", line 8, in <module>
[Wed Dec 23 13:30:59 2015] [error] from graphite.util import unpickle
[Wed Dec 23 13:30:59 2015] [error] File "/opt/graphite/webapp/graphite/util.py", line 82, in <module>
[Wed Dec 23 13:30:59 2015] [error] defaultUser = User.objects.create_user('default','default@localhost.localdomain',randomPassword)
[Wed Dec 23 13:30:59 2015] [error] File "/usr/lib/python2.6/site-packages/django/contrib/auth/models.py", line 160, in create_user
[Wed Dec 23 13:30:59 2015] [error] user.save(using=self._db)
[Wed Dec 23 13:30:59 2015] [error] File "/usr/lib/python2.6/site-packages/django/db/models/base.py", line 463, in save
[Wed Dec 23 13:30:59 2015] [error] self.save_base(using=using, force_insert=force_insert, force_update=force_update)
[Wed Dec 23 13:30:59 2015] [error] File "/usr/lib/python2.6/site-packages/django/db/models/base.py", line 551, in save_base
[Wed Dec 23 13:30:59 2015] [error] result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
[Wed Dec 23 13:30:59 2015] [error] File "/usr/lib/python2.6/site-packages/django/db/models/manager.py", line 203, in _insert
[Wed Dec 23 13:30:59 2015] [error] return insert_query(self.model, objs, fields, **kwargs)
[Wed Dec 23 13:30:59 2015] [error] File "/usr/lib/python2.6/site-packages/django/db/models/query.py", line 1593, in insert_query
[Wed Dec 23 13:30:59 2015] [error] return query.get_compiler(using=using).execute_sql(return_id)
[Wed Dec 23 13:30:59 2015] [error] File "/usr/lib/python2.6/site-packages/django/db/models/sql/compiler.py", line 912, in execute_sql
[Wed Dec 23 13:30:59 2015] [error] cursor.execute(sql, params)
[Wed Dec 23 13:30:59 2015] [error] File "/usr/lib/python2.6/site-packages/django/db/backends/sqlite3/base.py", line 344, in execute
[Wed Dec 23 13:30:59 2015] [error] return Database.Cursor.execute(self, query, params)
[Wed Dec 23 13:30:59 2015] [error] IntegrityError: column username is not unique
[Wed Dec 23 13:31:04 2015] [error] mod_wsgi (pid=5771): Target WSGI script '/opt/graphite/conf/graphite.wsgi' cannot be loaded as Python module.
[Wed Dec 23 13:31:04 2015] [error] mod_wsgi (pid=5771): Exception occurred processing WSGI script '/opt/graphite/conf/graphite.wsgi'.

There is also a bunch of these:

[Wed Dec 23 15:24:29 2015] [error] No handlers could be found for logger "cache"
[Wed Dec 23 15:24:29 2015] [error] No handlers could be found for logger "cache"
[Wed Dec 23 15:24:29 2015] [error] No handlers could be found for logger "cache"
[Wed Dec 23 15:24:29 2015] [error] No handlers could be found for logger "cache"
[Wed Dec 23 15:24:29 2015] [error] No handlers could be found for logger "cache"

and the timing of these is closest to matching when the problem occurs in the web UI.

Ok, now this is a bit more curious:

whisper.CorruptWhisperFile: Unable to read header (/opt/graphite/storage/whisper/netapp/perf7/xxx/wafl/cp_phase_times/P2V_SNAP.wsp)

... and there are quite a few of those! And they're all 0 sized files... likely a result of /opt running out of space!

how to regrow those files? Just remove the empty ones and restart netapp-harvest?

It seems likely...

Removing all of the zero sized files caused them to be recreated (now that there was disk space aplenty) after restarting all of the services:

# tail /opt/graphite/storage/log/carbon-cache/carbon-cache-a/creates.log
31/12/2015 18:09:10 :: creating database file /opt/graphite/storage/whisper/netapp/perf7/XXX/avg_latency.wsp (archive=[(60, 50400), (300, 28800), (900, 37920), (3600, 43800)] xff=0.5 agg=average)
31/12/2015 18:12:13 :: new metric netapp.perf7.XXX.write_align_histo.4 matched schema netapp.perf
31/12/2015 18:12:13 :: new metric netapp.perf7.XXX.write_align_histo.4 matched aggregation schema default_average
31/12/2015 18:12:13 :: creating database file /opt/graphite/storage/whisper/netapp/perf7/XXX/write_align_histo/4.wsp (archive=[(60, 50400), (300, 28800), (900, 37920), (3600, 43800)] xff=0.5 agg=average)
31/12/2015 18:15:08 :: new metric netapp.perf7.XXX.write_ops matched schema netapp.perf
31/12/2015 18:15:08 :: new metric netapp.perf7.XXX.write_ops matched aggregation schema default_average
31/12/2015 18:15:08 :: creating database file /opt/graphite/storage/whisper/netapp/perf7/XXX/write_ops.wsp (archive=[(60, 50400), (300, 28800), (900, 37920), (3600, 43800)] xff=0.5 agg=average)
31/12/2015 18:15:09 :: new metric netapp.perf7.XXX.write_data matched schema netapp.perf
31/12/2015 18:15:09 :: new metric netapp.perf7.XXX.write_data matched aggregation schema default_average
31/12/2015 18:15:09 :: creating database file /opt/graphite/storage/whisper/netapp/perf7/XXX/write_data.wsp (archive=[(60, 50400), (300, 28800), (900, 37920), (3600, 43800)] xff=0.5 agg=average)

... and now every row *except* "TOP VOLUMES DRILLDOWN" from the dashboard page is ok. On "TOP VOLUMES", there are still 4 with the red triangle, including "Read Latency".

madden · ‎2016-01-02

Hi @DARREN_REED

Indeed if disk space runs out graphite-carbon will still create datafiles at 0 bytes and these then cannot be parsed by graphite-web giving you errors.

After you resize your disk larger I would do the following as the root user to see if there are still 0 byte data files:

cd /opt/graphite/storage/whisper/

find -name '*.wsp' -size 0

If there are, then do the same with the -delete flag to remove those files:

find -name '*.wsp' -size 0 -delete

As you mentioned, at the next polling interval they will get created again (up to the max creates per min set in carbon.conf) so give it a few minutes to handle the new creates. If you still have errors in Grafana I would use 'tail -f error.log' and reload the Grafana page that has the error and you should see some new messages get logged. Similar info should however also be in the debug window of Grafana should you click the ! and check one of those tabs. If you can share the details maybe I can give some tips.

Also, if you run the lastest Grafana (2.5 or newer) you will also need updated dashboards. See this article for more.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

P.S. Please select “Options” and then “Accept as Solution” if this response answered your question so that others will find it easily!

DARREN_REED · ‎2016-01-03

I cleaned up all of the 0 sized files and data is now populating those.

I have updated the installation of NetApp harvest with the new db_*.json files - that got rid of the "multiple data series" problems.

The error that I now see (but not everywhere) is "Timeseries data request error"

....

This was because some files were still 0 in size but on another filesystem so when I did a "find -size 0", it didn't follow the symbolic link and discover the extra 0 sized files.

DARREN_REED · ‎2016-01-10

.

rsr_72 · ‎2016-01-13

I too see this in spots - The error that I now see (but not everywhere) is "Timeseries data request error". Is that just the nature of some selections?

madden · ‎2016-01-13

@rsr_72,

"Timeseries data request error" is the error Grafana throws anytime it gets an error from Graphite. To see the details of the true error you have to click the "!" and see what's in there. My experience is the failures are usually one of:

1) An underlying metric file is corrupt which causes a parse error during rendering by Graphite. I haven't seen a file corrupt itself once it was active but have seen newly created files be zero bytes long. To resolve make sure there is enough space in the filesystem and delete the offending metrics.

2) A metrics query is so large (like a wildcard across thousands of volumes) that it times out. To resolve try reducing the instances by using the template picklist to narrow in closer to some resource and see if it can now load in a reasonable time.

3) There are no matching metrics in Graphite for that panel and the metrics query included a function which errors because the input list is empty. There is no need to resolve, simply ignore the panel and treat it the same as a panel that says "No datapoints". If I use the Flash Pool panel on the node dashboard as an example, if you don't have Flash Pool you get the "Timeseries data request error". Looking at the metric strings the sumSeries function doesn't return an error on no datapoints but the averageSeries function does. So I suppose you could submit a bug to Graphite to have it be silent instead of erroring. There may be more like this.

In all three cases the graphite-web error log should contain why the render failed. If it's not one of these reasons then giving the exact panel would help me to troubleshoot and fix the template.

Hope this helps!

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data