We have a multi-cluster environment, and the Netapp harvest worker for our busiest one will work well for several days, but will die with the following two messages in the log:
[2016-07-08 06:01:35] [WARNING] [workload_detail] update of data cache failed with reason: Server returned HTTP Error: [2016-07-08 06:01:35] [WARNING] [workload_detail] data-list update failed.
This will require a restart of the poller, which will continue to operate for several days. There are no error messages in the Carbon logs. I have previously extended the 'data_update_freq' to 180 on this cluster due to polling refresh warnings in the logs, which did seem to eliminate those errors. Environment is RHEL6.
The design intent of Harvest is that it will never die. It may fail to start if some required info is not in the conf file (like the cluster hostname to monitor) but even if it can't connect due to a DNS resolution failure it will keep trying hoping someone fixes dns 🙂
So if it dies then either there is a bug in Harvest (we found one recently that was a divide by zero error in the situation you had a port online at 10Mbit) or some other situation is causing a failure in some other module (NetApp SDK or module it uses) resulting in it dieing. In principle a busy cluster can still be monitored it just might not be able to keep up with all counters every 60s. Maybe though because the cluster is very busy some API call responses are incomplete or truncated. If you can send me the entire poller logfile (via private message is fine) I can potentially figure out what is going on.
Until that time a workaround would be to add a crontab entry to run netapp-manager -start every hour (or minute if you really want to minimize missed data in case it dies). The netapp-manager will basically find any pollers that are not running and start them. If all are already running it does nothing.
Cheers, Chris Madden
Storage Architect, NetApp EMEA (and author of Harvest)
Thank you for the reply. Next time it happens I'll pull the logfile & send it to you.
I have already implemented something very similar to your solution as a workaround, so I have that in place with the addition of an e-mail when it has to restart the worker, so I'll know when it happens.