Harvest in high network latency scenarios

CHMOELLER · ‎2016-06-27

Hey guys,

it might be a bit of a trivial question but I was wondering ... I am very pleased with the Harvest-Grafana setup. I have used it to monitor our domestic sites quite successfuly and would like to expand it to our international sites as well. However latencies will come to play here. A couple of hundred ms to our remote locations! (yarp ...)

Now I wonder what would be preferrable:

- Don't worry at all and let Harvest collect via the higher-latency MPLS lines

- Have local Havest instances at the remote sites so they collect locally but talk to the Graphite database via the higher-latency MPLS lines

- Do neither and create seperate Harvest-Grafana setup for (and in) each location

Obviously 3 is not really an option I'd like to consider.

I guess the question really is - is "Harvest-to-filer" or "Harvest-to-Graphite" better suited for slower speeds and higher latencies ...

And a side question, what kind of traffic could be expected "per filer"?

Side note: there are Riverbeed Steelhead appliances in the loop in case that makes much of a difference here.

Thanks!

Regards

Chris

madden · ‎2016-06-28

Hi @CHMOELLER

In the Harvest admin guide 1.2.2 section 2.1 I have this snippet:

Typical bandwidth usage from Harvest to the monitored node is ~ 15Kbps, and from the monitored node 
to Harvest 90Kbps.  Again, as instance count increases the bandwidth used will as well. 

If you have remote nodes with many monitored instances (i.e. many vols, luns, lifs, etc) and significant 
network latency (20ms+)  it may be beneficial to deploy a Harvest poller host local to those nodes and 
send metrics over the WAN to a central Graphite server.  In this way the Harvest polls will not be 
unnecessarily delayed by network latency. To determine if having a local poller would be beneficial, test 
running Harvest from the remote site and compare the poll duration to the poll update frequency (use the 
Grafana Harvest dashboard or start netapp-worker with the -v flag).  If the poll duration is much less 
than the frequency then it is fine to poll from the central site.  But if not, placing the poller on a host near 
the monitored system is recommended.

Maybe this helps?

The communication between Harvest and the cluster is quite chatty (lots of API request/responses) and the WAN latency adds to each request. For a small cluster, local collection might take 10s locally vs 30s over the WAN. Still, as long as less than 60s no worries! Communication between Harvest and Graphite is one-way and less data so I don't think WAN latency will matter much.

For simplicity I would first try to collect over the WAN. If you see skipped polls then setup a Harvest local to the cluster. In all cases I think a central Graphite server will be fine.

Hope this helps!

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

CHMOELLER · ‎2016-09-22

To follow up on this in case anyone might ever wonder.

I have had the collectors to the remote sites going offline multiple times. Maybe due to latency spikes and resulting timeouts? The "base latency" is around 180ms but it can spike to 500ms+ at times. I don't know if that was actually the cause but as suggested I installed local collectors and let them push the data to the database in our HQ.

No more issues! So when in doubt this seems to be the more stable way to do it.