NetApp Harvest api request rejected System Busy

paul_p2lab · ‎2022-04-05

Hello,

I've had Harvest version 21.11.1-1 installed for several months now pulling metrics from my NetApp cluster without issue.

Recently, we noticed that our Grafana dashboards were showing "No Data".

Further inspection of Harvest logs showed many timeouts and "api request rejected => System Busy" messages.

We set timeout to 60s (default appears to be 10) for zapiperf and zapi in default.yml, then restarted harvest. We did receive some data but not long after (1 hour) we are back to many errors in the Harvest log file with, for example: "api request rejected => System busy: 12 requests on table \"perf_object_get_instances\" have been pending for 1514620 seconds." which in turn leads to "No Data" in our Grafana dashboards.

Our NetApp cluster where we see this issue has alot of luns, aggregates, volumes etc. We have another, much smaller, cluster and the same Harvest instance is pulling data from it just fine.

Has anyone come across this before and do you have suggestions on a fix? I was going to look into reducing the polling interval but not sure if this is a known issue with a proper fix.

Thanks for any input!

Here's an example log entry:

{"level":"error","Poller":"netapp-cluster","collector":"ZapiPerf:Disk","stack":[{"func":"New","line":"35","source":"errors.go"},{"func":"(*Client).invoke","line":"428","source":"client.go"},{"func":"(*Client).InvokeWithTimers","line":"334","source":"client.go"},{"func":"(*ZapiPerf).PollData","line":"257","source":"zapiperf.go"},{"func":"(*task).Run","line":"61","source":"schedule.go"},{"func":"(*AbstractCollector).Start","line":"293","source":"collector.go"},{"func":"goexit","line":"1581","source":"asm_amd64.s"}],"error":"api request rejected => System busy: 12 requests on table \"perf_object_get_instances\" have been pending for 1514620 seconds. The last completed call took 25 seconds.","task":"data","caller":"goharvest2/cmd/poller/collector/collector.go:340","time":"2022-04-05T16:17:38Z"}

paul_stejskal · ‎2022-04-06

Make sure the cluster management LIF is on a node with lower CPU utilization on the average. See if that helps. If you need to, you could actually open a case against ONTAP for this because ONTAP has the problem not Harvest it seems (but I am not a Harvest expert so I could be wrong).