Solved: Netapp-harvest + Graphite + grafana, throughput is way off

DingbatCA · ‎2015-11-13

This is an odd one, and I must admit I am very new to this whole tool set.

Followed this install process

http://blog.pkiwi.com/netapp-advanced-performance-monitoring-with-harvest-graphite-and-grafana/

System is running RedHat 7.1 64X

Here is what I am seeing. Throughput for everything seems to be badly off.

In Graphite I see the correct speeds. But in grafana, everything is off. As best I can tell this seems to apply to all throughput values. So NFS/CIFS/FCP/Disk...

Any ideas?

madden · ‎2015-11-23

Hi,

In the netapp-harvest.conf file you will find a default key/value like this:

normalized_xfer = mb_per_sec

What it will do is normalize all throughput numbers to MB/s. So in Graphite and Grafana you are viewing in MB/s and not that of the native Data ONTAP counter manager counter being graphed. I found normalizing data to be a much easier way of working; you can always scale back to whatever unit you want if needed for your use case.

Regarding throughput being off, sometimes it is just user confusion because with cDOT the node that does the frontend protocol work is not necessarily the same that does the backend volume work. Depending on the object you're looking at you may see frontend or backend numbers. In the default "node" dashboard you will see "protocol backend drilldown" and then things like "FCP frontend drilldown" to show these both.

So in the "frontend" views you see very detailed information about the IOPs arriving that node. Those IOPs are then translated into WAFL messages and sent to the backend (on the same or different node) to be serviced. At the "backend" the messages are tagged with protocol but otherwise are only tracked as read/write/other vs much more detail tracked at the "frontend" node. If all traffic is direct (IOPs arrive on a LIF on the same node that owns the volume) then the "frontend" and "backend" numbers should agree, but if you have indirect traffic they will be different.

Maybe you can check your setup taking the above info into account and let us know if that helped?

--If it does, please also "accept as answer" the post that answered your question so that others will see the Q/A is answered.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

View solution in original post

cbiebers · ‎2015-11-20

It's been a week so not sure if you're still looking for some insight but I see part of your issue here.

If you look carefully at your first image, with the raw graphite data, you'll see that while on the left you've highlighted write_data, in the legend for the graph you're showing the metrics:

...aggr.total_transfers and ...aggr.Node02_SSD.total_transfers. Remember that the graphite interface will show each metric you double click, and it removes them when you again double click them. Its easy to end up looking at a bunch of unrelated items this way.

In your grafana dashboard you're looking at Node.xxx.fcp.read_data and Node.xxx.fcp.write_data. These don't measure the same things. Aggregate total transfers does not equal protocol reads+writes for a node. Each aggregate is measuring its own operations to and from disk. The protocols are each [FCP, iSCSI, CIFS, NFS] measuring their operations from and to the client. Depending on what you're wanting to measure you'd look at one or the other or both together but they will not show the same values.

I hope that helps.

DingbatCA · ‎2015-11-20

Yep, still hunting for an answer.

Non of the throughput values seem correct in grafana. Not just that one. I know I am pushing 400~600MB/s on average through my FAS8060, but grafana is showing 0.4~1.2MB/s. So there is something off, I just have no clue where to look.

cbiebers · ‎2015-11-20

I just did a comparison to the values below in grafana with Brocade Switch View for the physical ports in my environment. (Because I'm comparing to physical ports, I have to look at the node's physical port values, not the SVM LIF values, or I'd have to do a bunch of math. This is the metric that you show on the left side in the initial screenshot of graphite, but not the metric you were actually displaying.)

I use the Network Port dashboard and reference these metrics in the Fibre Channel row. (Metrics captured by choosing edit on the graph).

netapp perf $Group $Cluster node $Node fcp_port $Port write_data highestAverage($TopResources) aliasByNode(5, 7)
netapp perf $Group $Cluster node $Node fcp_port $Port read_data highestAverage($TopResources) aliasByNode(5, 7)

When I check these values against the Brocade values I'm in alignment, with minor variation because the Brocade switchview GUI is displaying 30second averages to Harvest's 1 minute averages.

What do you see when you look at the values above in Grafana either on your own dashboard, or using the one provided in the package?

EDIT: Note the bold piece is different than what you were showing in grafana.

madden · ‎2015-11-23

Hi,

In the netapp-harvest.conf file you will find a default key/value like this:

normalized_xfer = mb_per_sec

What it will do is normalize all throughput numbers to MB/s. So in Graphite and Grafana you are viewing in MB/s and not that of the native Data ONTAP counter manager counter being graphed. I found normalizing data to be a much easier way of working; you can always scale back to whatever unit you want if needed for your use case.

Regarding throughput being off, sometimes it is just user confusion because with cDOT the node that does the frontend protocol work is not necessarily the same that does the backend volume work. Depending on the object you're looking at you may see frontend or backend numbers. In the default "node" dashboard you will see "protocol backend drilldown" and then things like "FCP frontend drilldown" to show these both.

So in the "frontend" views you see very detailed information about the IOPs arriving that node. Those IOPs are then translated into WAFL messages and sent to the backend (on the same or different node) to be serviced. At the "backend" the messages are tagged with protocol but otherwise are only tracked as read/write/other vs much more detail tracked at the "frontend" node. If all traffic is direct (IOPs arrive on a LIF on the same node that owns the volume) then the "frontend" and "backend" numbers should agree, but if you have indirect traffic they will be different.

Maybe you can check your setup taking the above info into account and let us know if that helped?

--If it does, please also "accept as answer" the post that answered your question so that others will see the Q/A is answered.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

DingbatCA · ‎2015-11-23

I feel like a n00b. Thanks for pointing me to the "normalized_xfer"

cat netapp-harvest.conf | grep normalized_xfer

normalized_xfer = mb_per_sec

normalized_xfer = gb_per_sec

Changed to mb_per_sec, for the 3 involving netapp and everthing is now lining up perfectly!

madden · ‎2015-11-23

Great to hear! The default dashboards assume you normalize perf info to mb_per_sec and capacity info (UCUM) to gb_per_sec so it was probably a copy/paste mistake between pollers of different server types. I can see myself doing this too so I'll have to think about improving usability here...

Thanks!

Chris