Active IQ Unified Manager Discussions

NetApp-Harvest: Performance Monitoring frequency

pogranovich
3,923 Views

Hello,

 

Default Harvest performance monitoring frequency seems to be 1 minute. This is good from data collecting perspective, but would there be any concern about putting additional load on cluster by monitoring that frequently (especially in situations when cluster’s load already approaches critical levels) ?

 

Thanks,

Paul Ogranovich

1 ACCEPTED SOLUTION

madden
3,881 Views

Hi Paul,

 

Harvest is not multi-threaded so it can have at most 1 API call active on a cluster.  I would argue that if a single active API call can cause a noticable impact on the cluster that is a defect in Data ONTAP and not that the polling frequency is too high.  So far (knock on wood) I haven't heard of a complaint about Harvest impacting the servicing of data by the cluster.  API work is handled by mgwd process which runs in the HostOS domain.  This domain can run on multiple CPUs concurrently, so responding to APIs is not blocking to other work but is consuming from total CPU resource.  Responding to API requests is also lower priority vs data access so the API should get slow rather than impacting data access.  Also, the data collected each polling interval is from the counter manager subsystem which already has the counter values in memory.  So there isn't any new/heavy work to collect data but rather it is just read from memory, processed, and provided back to the API caller Harvest in this case.

 

But, theory and reality don't always match so your question spurred me to do a quick test against my 2 node cDOT 8.2 lab cluster with different collection frequencies.  Normally I run the poller at 60s so I have this as a baseline, and then modified to 10s for a while, and then to 120s for a while.  For the test I used the default template and also noticed that at 10s there were skips (i.e. it wasn't done collecting yet so some objects had a skipped sample).

 

Here is a graph of HostOS CPU as a proxy for how much CPU responding to APIs costs:

hostOS.CPU.png

The averages when I zoomed in to each period were:

60s until 17:00: 7.5% avg HostOS util per node

10s 17:00-19:18: 15% avg HostOS util per node (outside of huge spike at 18:15-18:20; this is probably some other mgt work like creating a performance ASUP or archiving/compressing logs)

120s after 19:20: 6.5% avg HostOS util per node

Note: These percentages are out of total CPU.  So if you have an FAS8080 with 20 cores you have 2000% total CPU so using the worse case it would be 15% of 2000%, or 0.75% of each CPU core for all HostOS, which includes Harvest workload

 

There was varying frontend workload, so for completeness here was avg CPU use as well:

avg.cpu.png

 

 

So while not scientific hopefully the above gives some idea for the cost of monitoring with Harvest at various frequencies including worse case (i.e. a frequency that results in constant skips because an API call is always active).  My takeaway is the CPU impact is so small you won't notice it.

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

P.S.  Please select “Options” and then “Accept as Solution” if this response answered your question so that others will find it easily!

 

View solution in original post

2 REPLIES 2

madden
3,882 Views

Hi Paul,

 

Harvest is not multi-threaded so it can have at most 1 API call active on a cluster.  I would argue that if a single active API call can cause a noticable impact on the cluster that is a defect in Data ONTAP and not that the polling frequency is too high.  So far (knock on wood) I haven't heard of a complaint about Harvest impacting the servicing of data by the cluster.  API work is handled by mgwd process which runs in the HostOS domain.  This domain can run on multiple CPUs concurrently, so responding to APIs is not blocking to other work but is consuming from total CPU resource.  Responding to API requests is also lower priority vs data access so the API should get slow rather than impacting data access.  Also, the data collected each polling interval is from the counter manager subsystem which already has the counter values in memory.  So there isn't any new/heavy work to collect data but rather it is just read from memory, processed, and provided back to the API caller Harvest in this case.

 

But, theory and reality don't always match so your question spurred me to do a quick test against my 2 node cDOT 8.2 lab cluster with different collection frequencies.  Normally I run the poller at 60s so I have this as a baseline, and then modified to 10s for a while, and then to 120s for a while.  For the test I used the default template and also noticed that at 10s there were skips (i.e. it wasn't done collecting yet so some objects had a skipped sample).

 

Here is a graph of HostOS CPU as a proxy for how much CPU responding to APIs costs:

hostOS.CPU.png

The averages when I zoomed in to each period were:

60s until 17:00: 7.5% avg HostOS util per node

10s 17:00-19:18: 15% avg HostOS util per node (outside of huge spike at 18:15-18:20; this is probably some other mgt work like creating a performance ASUP or archiving/compressing logs)

120s after 19:20: 6.5% avg HostOS util per node

Note: These percentages are out of total CPU.  So if you have an FAS8080 with 20 cores you have 2000% total CPU so using the worse case it would be 15% of 2000%, or 0.75% of each CPU core for all HostOS, which includes Harvest workload

 

There was varying frontend workload, so for completeness here was avg CPU use as well:

avg.cpu.png

 

 

So while not scientific hopefully the above gives some idea for the cost of monitoring with Harvest at various frequencies including worse case (i.e. a frequency that results in constant skips because an API call is always active).  My takeaway is the CPU impact is so small you won't notice it.

 

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

P.S.  Please select “Options” and then “Accept as Solution” if this response answered your question so that others will find it easily!

 

pogranovich
3,858 Views

Hi Chris,

 

  Thanks for the detailed explanation.

 

Paul Ogranovich.

Public