I've followed the Graphite_Grafana_Quick_Start_v1.4.pdf and NetApp_Harvest_IAG_1.2.2.pdf, added some clusters and the stats/graphs are all good for most of the clusters added. However for the 1st 6 node cluster added to the setup, I see a lot of N/As and "!" in grafana dashboards, hovering over the ! i see the detail "timeseries data request errors".
If I use the grafana node dashboard, and look at 'node1' there are only a couple of graphs with a !. However if I look at 'node4' it's pretty much every graph with a !.
If I look at top CPU domains specifically for the same cluster and nodes in Graphite, for the nodes drawing that graph correctly in granfana I see the stats in Graphite. But as soon as I add a CPU domain stat for a node that produces ! in grafana, I lose all stats in the graphite browser/composer.
I assume some of the stats are not being polled into Graphite correctly, but am unsure where to look next to RCA. Can anbody help?
whisper.CorruptWhisperFile: Unable to read header (/opt/graphite/storage/whisper/netapp/perf/EUDC/eu-cnas01/svm/eu-vnasd-01/vol/db_tst_a18/qos_latency.wsp)
So that is telling you that you have a corrupt whisper file. If a file is corrupt accesses to it from the API will fail. Likely if you clicked on the Grafana timeseries warning notice and clicked through the debug tabs you would have seen the same error. Given there is a whole traceback it isn't so obvious.
I have seen a corrupt metrics file at just one other installation since working with Graphite at dozens of customers. Maybe your disk was full and this could be the cause?
Searching google for that error above might give you some tips on how to find them all or maybe grepping them from the logs could work too. Once you have them, assuming you have little history, just delete them and let them be recreated.
Cheers, Chris Madden
Storage Architect, NetApp EMEA (and author of Harvest)
1st off.. Thanks for this (harvest, graphite, & grafana) setup . It's amazing.. and certainly fills a big hole in perf stats from 7 mode, that is not presently avail direct from Netapp for cdot (OPM is coming along, but this is way better :), and OCI is mega $$$s.)
Re the orginal issue, yes i did hit 100% on /opt, and wondered if this was part of the issue (sorry for not mentioning it before!). I will find the effected wsp files and delete them and see where I get...