Solved: netapp-harvest graphs not showing more than 1 day

mikecurrin · ‎2015-10-08

Hi,

I have Graphite working pretty well except for a few things which I'm still trying to fix. this is something I've really been wanting to see from our systems for a while.

The main thing is that my graphs will only display data for max of 24 hour period even if I zoom out more (eg screenshot below is showing a 48hr period). I copied all the configs as per your instructions from the Quick Start guide but when I zoom out I don't see more data.

I'm not sure if the data is actually being collected and just not being displayed or if I don't have the data stored/collected.

Where could I start looking to get more of an idea?

Thanks,

Mike

madden · ‎2015-10-08

Hi Mike,

Glad you like it so far and I think I know the problem.

Graphite's storage-schemas.conf file controls the frequency and retention of stored metrics. That file can have many entries and each entry has a regex expression that is compared against the incoming metrics string. The file is processed in order and the first regex that matches will cause the metrics file to be created with those retentions. So having correct entries, in the correct order (especially not having a 'catch all' as the first one), is critical.

The file if you don't edit it looks like this:

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

As you can see, this is 1min samples for 1 day retention. Maybe you forgot to edit the file and add the strings from the Harvest install guide 1.2.2 section 7.1? Or, maybe you pasted them at the end of the file and not in front of the default catch-all entry?

If this is the case, fix up the file and then future metrics will be created with the correct settings which look like this:

[netapp_perf]
pattern = ^netapp(\.poller)?\.perf7?\.
retentions = 1m:35d,5m:100d,15m:395d,1h:5y

[netapp_capacity]
pattern = ^netapp(\.poller)?\.capacity\.
retentions = 15m:100d,1d:5y

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

For existing metrics you have two options: (A) delete them and they will be created again automatically or (B) resize the existing db files.

For (A) to delete the files (and lose your existing 24hrs of data) so they will be created again automatically use one of these (depending on your metrics storage location):

rm -rf /opt/graphite/storage/whisper/netapp
or 
rm -rf /var/lib/graphite/whisper/netapp

For (B) to resize them and retain your 24hrs of data&colon;

1. Make sure you have plenty of freespace on your filesystem because the files will get much bigger (from 17KB to a little less than 2MB each) with the correct retention.

2. Change directory to your stoarge location (depending on your metrics storage location):

cd /opt/graphite/storage/whisper/netapp
or
cd /var/lib/graphite/whisper/netapp

3. Update db files (whisper-resize or whisper-resize.py depending on your distribution, just run that command by itself to see which one you have):

find perf -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;
cd poller;
find perf -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;

or 

find perf -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;
cd poller;
find perf -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;

4. Reset permissions so the files can be read/written by the carbon and webserver processes:

chown -R apache:apache *

or

chown -R _graphite:_graphite *

Hope this helps! You aren't the first one to have this problem so I will also update the install guide to be more clear to help others in the future.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

View solution in original post

madden · ‎2015-10-08

Hi Mike,

Glad you like it so far and I think I know the problem.

Graphite's storage-schemas.conf file controls the frequency and retention of stored metrics. That file can have many entries and each entry has a regex expression that is compared against the incoming metrics string. The file is processed in order and the first regex that matches will cause the metrics file to be created with those retentions. So having correct entries, in the correct order (especially not having a 'catch all' as the first one), is critical.

The file if you don't edit it looks like this:

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

As you can see, this is 1min samples for 1 day retention. Maybe you forgot to edit the file and add the strings from the Harvest install guide 1.2.2 section 7.1? Or, maybe you pasted them at the end of the file and not in front of the default catch-all entry?

If this is the case, fix up the file and then future metrics will be created with the correct settings which look like this:

[netapp_perf]
pattern = ^netapp(\.poller)?\.perf7?\.
retentions = 1m:35d,5m:100d,15m:395d,1h:5y

[netapp_capacity]
pattern = ^netapp(\.poller)?\.capacity\.
retentions = 15m:100d,1d:5y

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

For existing metrics you have two options: (A) delete them and they will be created again automatically or (B) resize the existing db files.

For (A) to delete the files (and lose your existing 24hrs of data) so they will be created again automatically use one of these (depending on your metrics storage location):

rm -rf /opt/graphite/storage/whisper/netapp
or 
rm -rf /var/lib/graphite/whisper/netapp

For (B) to resize them and retain your 24hrs of data&colon;

1. Make sure you have plenty of freespace on your filesystem because the files will get much bigger (from 17KB to a little less than 2MB each) with the correct retention.

2. Change directory to your stoarge location (depending on your metrics storage location):

cd /opt/graphite/storage/whisper/netapp
or
cd /var/lib/graphite/whisper/netapp

3. Update db files (whisper-resize or whisper-resize.py depending on your distribution, just run that command by itself to see which one you have):

find perf -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;
cd poller;
find perf -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize.py --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;

or 

find perf -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;
cd poller;
find perf -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;

4. Reset permissions so the files can be read/written by the carbon and webserver processes:

chown -R apache:apache *

or

chown -R _graphite:_graphite *

Hope this helps! You aren't the first one to have this problem so I will also update the install guide to be more clear to help others in the future.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

mikecurrin · ‎2015-10-08

Hi Chris,

You were right, I had the [default_1min_for_1day] section at the top of the storage-schemas.conf file still. I removed it and added the entry from that other site (that had a year retention - just going to use that for now) to the bottom of the file after the [netapp.*] sections

I removed the old whisper/netapp folder and then restarted the carbon-cache agent. So that should hopefully do it.

I'll check on it tomorrow (is 10pm here now so I better call it a day) to see if things are graphing as I expect. I'll probably need to add some more disk to my server too as the whisper/netapp folder seem to be growing decently; this was only a test install anyway so I can do that easily.

I have a few other small things I'm struggling with, I'm assuming posting to the discussions here is the best way to ask (and hopefully) get them resolved.

Oh ... ah yes - Graphite is fantastic, this has been something I've been looking for since we put in CDOTA and haven't found anything that could do it like I wanted. Now is just a matter of interpreting the results to sort out some of out storage "issues".

Regards,
Mike

madden · ‎2015-10-08

You can always check if the files are the retention you want by looking in the create log (/opt/graphite/storage/log/carbon-cache/carbon-cache-a/creates.log), or you can use the whisper utility whisper-info.py, which should be installed with Graphite.

Run it like this:

root@sdt-graphite:/# whisper-info.py /opt/graphite/storage/whisper/netapp/perf/nl/blob1/node/blob1-01/system/avg_processor_busy.wsp 

maxRetention: 157680000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 1931104

Archive 0
retention: 3024000
secondsPerPoint: 60
points: 50400
size: 604800
offset: 64

Archive 1
retention: 8640000
secondsPerPoint: 300
points: 28800
size: 345600
offset: 604864

Archive 2
retention: 34128000
secondsPerPoint: 900
points: 37920
size: 455040
offset: 950464

Archive 3
retention: 157680000
secondsPerPoint: 3600
points: 43800
size: 525600
offset: 1405504

So in that you see each 'archive' including how many seconds per point and number of points to save. You also see immediately how much space each is consuming.

For your space utilization the big space grab occurs during initial discovery of everything since the files are populated out for their full filesize. If you have it on NetApp storage though the zero's it fills at create time are detected and you only consume storage on the array as it's actually filled with real metric data. Note you do need dedupe enabled on the vol (but no necessarily a scheduled job) for zero detection to work.

Good luck and indeed if you have more questions post 'em in the communities!

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

GMOORE · ‎2015-12-14

I had the same issue. Looking back at the install doc it doesn't say to remove the default retention of 1d. Although I change the config file based on section 7 of the install document everything was getting caught by the default retention. This post help me figure it out. Thanks for the help, NetApp Harvest is great!

GMOORE · ‎2015-12-16

Chris,

After making the changes you suggested, I think I may have accidentally messed up something with the configuration. Harvest is no longer collecting any day since I made the changes. See the screenshot below.

Any ideas how I can remedy this?

madden · ‎2015-12-16

@GMOORE

If data isn't being displayed either there is a problem collecting and sending it (Harvest), receiving and storing it (Graphite), or displaying it (Grafana). My guess is it's one of the 1st two.

I would check the logfiles from Harvest and Graphite for more. The logfiles for Harvest are in /opt/netapp-harvest/log/<poller>.log, and the ones for Graphite carbon vary a bit depending on the OS and installation method used, but I have these doc'd the locations in the Graphite and Grafana Quick start guide.

I'd also make sure Harvest is running, because that could be the most basic reason you have no data (/opt/netapp-harvest/netapp-manager -status, and then with -start option to start them if not running).

If this isn't enough please open a new communities thread with the errors you find in the logfiles.

Cheers,

Chris Madden

markymarcc · ‎2016-08-12

Chris

I had same issue of default 1 day, plan to delete the netapp tree and change default.

I just want to make sure i have all the step and order .

Stop the netapp harvest poller.

sudo service apache2 stop

sudo service grafana-server stop

sudo service carbon-cache

then

rm -rf /var/lib/graphite/whisper/netapp

then edit

etc/carbon/storage-schemas.conf

then start serviced back up

madden · ‎2016-08-13

Hi @markymarcc

Yes, your steps seem ok. You probably don't even need to stop any services but doing so won't hurt.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

rcasero · ‎2016-09-30

Hi Chris, i maded the change to the database and it stopped displaying my graphs in Grafana, I have checked all poller for errors and even removed the .wsp. I resrtated server and conform i had no errors, and still nothing. Here is what I did exactly. I ran the set of commands you illustrated below.

find perf -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;
cd poller;
find perf -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find perf7 -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 1m:35d 5m:100d 15m:395d 1h:5y \;
find capacity -name *.wsp -exec whisper-resize --xFilesFactor=0.0 --nobackup {} 15m:100d 1d:5y \;

After that it stopped reporting to grafana but is definitely pulling data. not sure how to open a different thread as you mentioned to someone in this thread, but we both definitely got he sameoutcome.

Thank you in advanced, Ralph.

madden · ‎2016-10-11

Hi @rcasero

Did you get it working? Maybe file permissions or owner was too restrictive after your resize?

Cheers,

Chris

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

dan_brummer · ‎2016-10-19

Just FYI if you go with option B (find and resize) you will have to reset the permissions to get data again:

cd /var/lib/graphite/whisper/netapp

chown -R _graphite._graphite *

-db

mattclayton · ‎2016-10-20

Thanks, Dan! This tip helped me fix it. Except for RedHat the ownership change is as follows:

chown -R apache:apache /opt/graphite/storage/

madden · ‎2016-10-21

Thanks @mattclayton, @dan_brummer for the chown step. I edited my earlier post so people will succeed the first time 🙂

rcasero · ‎2016-10-25

Thank you... Is working well now.

Mal-R · ‎2017-07-16

Hi Chris,

I have a similar problem however only one cluster out of 12 holds data for 24 hours, the rest look normal.

Can you assist with how I fix this?

Thanks

Malcolm