Active IQ Unified Manager Discussions

Netapp-harvest + Graphite + grafana, throughput is way off

DingbatCA
7,370 Views

This is an odd one, and I must admit I am very new to this whole tool set.

 

Followed this install process

http://blog.pkiwi.com/netapp-advanced-performance-monitoring-with-harvest-graphite-and-grafana/

 

System is running RedHat 7.1 64X

 

Here is what I am seeing.  Throughput for everything seems to be badly off.

 

In Graphite I see the correct speeds.  But in grafana, everything is off.  As best I can tell this seems to apply to all throughput values.  So NFS/CIFS/FCP/Disk...graphite.pnggrafana.png

 

Any ideas?

1 ACCEPTED SOLUTION

madden
7,204 Views

Hi,

 

In the netapp-harvest.conf file you will find a default key/value like this:

normalized_xfer   = mb_per_sec   

 

What it will do is normalize all throughput numbers to MB/s.  So in Graphite and Grafana you are viewing in MB/s and not that of the native Data ONTAP counter manager counter being graphed.  I found normalizing data to be a much easier way of working; you can always scale back to whatever unit you want if needed for your use case.

 

Regarding throughput being off, sometimes it is just user confusion because with cDOT the node that does the frontend protocol work is not necessarily the same that does the backend volume work.  Depending on the object you're looking at you may see frontend or backend numbers.  In the default "node" dashboard you will see "protocol backend drilldown" and then things like "FCP frontend drilldown" to show these both.

 

So in the "frontend" views you see very detailed information about the IOPs arriving that node.  Those IOPs are then translated into WAFL messages and sent to the backend (on the same or different node) to be serviced.  At the "backend" the messages are tagged with protocol but otherwise are only tracked as read/write/other vs much more detail tracked at the "frontend" node.  If all traffic is direct (IOPs arrive on a LIF on the same node that owns the volume) then the "frontend" and "backend" numbers should agree, but if you have indirect traffic they will be different.

 

Maybe you can check your setup taking the above info into account and let us know if that helped?  

--If it does, please also "accept as answer" the post that answered your question so that others will see the Q/A is answered.

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

View solution in original post

6 REPLIES 6

cbiebers
7,311 Views

It's been a week so not sure if you're still looking for some insight but I see part of your issue here.

 

If you look carefully at your first image, with the raw graphite data, you'll see that while on the left you've highlighted write_data, in the legend for the graph you're showing the metrics:

...aggr.total_transfers  and ...aggr.Node02_SSD.total_transfers.    Remember that the graphite interface will show each metric you double click, and it removes them when you again double click them.   Its easy to end up looking at a bunch of unrelated items this way.

 

In your grafana dashboard you're looking at Node.xxx.fcp.read_data and  Node.xxx.fcp.write_data.     These don't measure the same things.    Aggregate total transfers does not equal protocol reads+writes for a node.    Each aggregate is measuring its own operations to and from disk.   The protocols are each [FCP, iSCSI, CIFS, NFS] measuring their operations from and to the client.    Depending on what you're wanting to measure you'd look at one or the other or both together but they will not show the same values.

 

 

I hope that helps.   

DingbatCA
7,309 Views

Yep, still hunting for an answer.

 

Non of the throughput values seem correct in grafana.  Not just that one.  I know I am pushing 400~600MB/s on average through my FAS8060, but grafana is showing 0.4~1.2MB/s.  So there is something off, I just have no clue where to look.

cbiebers
7,286 Views

I just did a comparison to the values below in grafana with Brocade Switch View for the physical ports in my environment.  (Because I'm comparing to physical ports, I have to look at the node's physical port values, not the SVM LIF values, or I'd have to do a bunch of math.   This is the metric that you show on the left side in the initial screenshot of graphite, but not the metric you were actually displaying.)

 

I use the  Network Port dashboard  and  reference these metrics in the Fibre Channel row.   (Metrics captured by choosing edit on the graph).

 

netapp perf $Group $Cluster node $Node fcp_port $Port write_data highestAverage($TopResources) aliasByNode(5, 7)
netapp perf $Group $Cluster node $Node fcp_port $Port read_data highestAverage($TopResources) aliasByNode(5, 7)

 

When I check these values against the Brocade values I'm in alignment, with minor variation because the Brocade switchview GUI is displaying 30second averages to Harvest's 1 minute averages.

 

What do you see when you look at the values above in Grafana either on your own dashboard, or using the one provided in the package?

 

 

 EDIT:  Note the bold piece is different than what you were showing in grafana.

 

 

 

madden
7,205 Views

Hi,

 

In the netapp-harvest.conf file you will find a default key/value like this:

normalized_xfer   = mb_per_sec   

 

What it will do is normalize all throughput numbers to MB/s.  So in Graphite and Grafana you are viewing in MB/s and not that of the native Data ONTAP counter manager counter being graphed.  I found normalizing data to be a much easier way of working; you can always scale back to whatever unit you want if needed for your use case.

 

Regarding throughput being off, sometimes it is just user confusion because with cDOT the node that does the frontend protocol work is not necessarily the same that does the backend volume work.  Depending on the object you're looking at you may see frontend or backend numbers.  In the default "node" dashboard you will see "protocol backend drilldown" and then things like "FCP frontend drilldown" to show these both.

 

So in the "frontend" views you see very detailed information about the IOPs arriving that node.  Those IOPs are then translated into WAFL messages and sent to the backend (on the same or different node) to be serviced.  At the "backend" the messages are tagged with protocol but otherwise are only tracked as read/write/other vs much more detail tracked at the "frontend" node.  If all traffic is direct (IOPs arrive on a LIF on the same node that owns the volume) then the "frontend" and "backend" numbers should agree, but if you have indirect traffic they will be different.

 

Maybe you can check your setup taking the above info into account and let us know if that helped?  

--If it does, please also "accept as answer" the post that answered your question so that others will see the Q/A is answered.

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

DingbatCA
7,189 Views

I feel like a n00b.  Thanks for pointing me to the "normalized_xfer"

 

cat netapp-harvest.conf | grep normalized_xfer

normalized_xfer   = mb_per_sec     

normalized_xfer   = gb_per_sec     

normalized_xfer   = gb_per_sec     

normalized_xfer   = gb_per_sec     

 

Changed to mb_per_sec, for the 3 involving netapp and everthing is now lining up perfectly!

madden
7,172 Views

Great to hear!  The default dashboards assume you normalize perf info to mb_per_sec and capacity info (UCUM) to gb_per_sec so it was probably a copy/paste mistake between pollers of different server types. I can see myself doing this too so I'll have to think about improving usability here...

 

Thanks!

Chris

Public