Hello,
I've had Harvest version 21.11.1-1 installed for several months now pulling metrics from my NetApp cluster without issue.
Recently, we noticed that our Grafana dashboards were showing "No Data".
Further inspection of Harvest logs showed many timeouts and "api request rejected => System Busy" messages.
We set timeout to 60s (default appears to be 10) for zapiperf and zapi in default.yml, then restarted harvest. We did receive some data but not long after (1 hour) we are back to many errors in the Harvest log file with, for example: "api request rejected => System busy: 12 requests on table \"perf_object_get_instances\" have been pending for 1514620 seconds." which in turn leads to "No Data" in our Grafana dashboards.
Our NetApp cluster where we see this issue has alot of luns, aggregates, volumes etc. We have another, much smaller, cluster and the same Harvest instance is pulling data from it just fine.
Has anyone come across this before and do you have suggestions on a fix? I was going to look into reducing the polling interval but not sure if this is a known issue with a proper fix.
Thanks for any input!
Here's an example log entry:
{"level":"error","Poller":"netapp-cluster","collector":"ZapiPerf:Disk","stack":[{"func":"New","line":"35","source":"errors.go"},{"func":"(*Client).invoke","line":"428","source":"client.go"},{"func":"(*Client).InvokeWithTimers","line":"334","source":"client.go"},{"func":"(*ZapiPerf).PollData","line":"257","source":"zapiperf.go"},{"func":"(*task).Run","line":"61","source":"schedule.go"},{"func":"(*AbstractCollector).Start","line":"293","source":"collector.go"},{"func":"goexit","line":"1581","source":"asm_amd64.s"}],"error":"api request rejected => System busy: 12 requests on table \"perf_object_get_instances\" have been pending for 1514620 seconds. The last completed call took 25 seconds.","task":"data","caller":"goharvest2/cmd/poller/collector/collector.go:340","time":"2022-04-05T16:17:38Z"}