Active IQ Unified Manager Discussions
Active IQ Unified Manager Discussions
I've recently starting using netapp-harvest to monitor 17 HA pairs running in 7-Mode. My filers are running 8.1.4P7 and 8.2.3P4.
It is working great on all but two HA pairs. One of the failing pairs is identical to a pair that works running 8.2.3P4.
The failed pollers report the following in the log file:
[2015-12-01 09:03:45] [NORMAL ] WORKER STARTED [Version: 1.2.2] [Conf: netapp-harvest.conf] [Poller: fit01b09np037] [2015-12-01 09:03:45] [NORMAL ] [main] Poller will monitor a [FILER] at [fit01b09np037:443] [2015-12-01 09:03:45] [NORMAL ] [main] Poller will use [password] authentication with username [netapp-harvest] and password [**********] [2015-12-01 09:03:52] [NORMAL ] [main] Collection of system info from [fit01b09np037] running [NetApp Release 8.2.3P4 7-Mode] successful. [2015-12-01 09:03:52] [NORMAL ] [main] Using best-fit collection template: [7dot-8.2.0.conf] [2015-12-01 09:04:57] [WARNING] [all] Update of poller preferred-counter cache failed with reason: Timeout. Could not read API response. [2015-12-01 09:04:57] [WARNING] [main] object-list update failed; will try again in 10 seconds. [2015-12-01 09:06:09] [WARNING] [all] Update of poller preferred-counter cache failed with reason: Timeout. Could not read API response. [2015-12-01 09:06:09] [WARNING] [main] object-list update failed; will try again in 10 seconds. . . . [2015-12-01 09:15:49] [WARNING] [all] Update of poller preferred-counter cache failed with reason: Timeout. Could not read API response. [2015-12-01 09:15:49] [WARNING] [main] object-list update failed; will try again in 10 seconds. [2015-12-01 09:16:39] [NORMAL ] [main] Using graphite_root [netapp.perf7.DC9X.fit01b09np037] [2015-12-01 09:16:39] [NORMAL ] [main] Using graphite_meta_metrics_root [netapp.poller.perf7.DC9X.fit01b09np037] [2015-12-01 09:16:39] [NORMAL ] [main] Startup complete. Polling for new data every [60] seconds. [2015-12-01 09:17:42] [WARNING] [cifs] update of counter_list cache failed with reason: Timeout. Could not read API response. [2015-12-01 09:17:42] [WARNING] [cifs] counter-list update failed. [2015-12-01 09:18:46] [WARNING] [cifs_stats] update of counter_list cache failed with reason: Timeout. Could not read API response. [2015-12-01 09:18:46] [WARNING] [cifs_stats] counter-list update failed. [2015-12-01 09:18:59] [WARNING] [cifsdomain] data-list poller next refresh at [2015-12-01 09:18:00] not scheduled because it occurred in the past [2015-12-01 09:20:02] [WARNING] [disk] update of counter_list cache failed with reason: Timeout. Could not read API response. [2015-12-01 09:20:02] [WARNING] [disk] counter-list update failed. [2015-12-01 09:21:05] [WARNING] [ext_cache_obj] update of counter_list cache failed with reason: Timeout. Could not read API response. [2015-12-01 09:21:05] [WARNING] [ext_cache_obj] counter-list update failed. [2015-12-01 09:21:56] [WARNING] [fcp] data-list poller next refresh at [2015-12-01 09:18:00] not scheduled because it occurred in the past [2015-12-01 09:21:56] [WARNING] [fcp] data-list poller next refresh at [2015-12-01 09:19:00] not scheduled because it occurred in the past . . .
The difference between the pairs of filers is the network paths between the filers and the harvest server. I have added the harvest server to the local host files on the filers to rule out reverse DNS resolution. I'm beginning suspect a firewall issue related to specific tcp/ip ports. Any other clues?
Thanks, Sam
Hi,
I think you are on the right track to suspect something in the network. When Harvest starts it connects to learn the DOT release and mode of the system. It identifes the best fit collection template and then collects some metadata about all the object types it needs to collect. That work is done when you see the "Startup complete" message. So usually the following two lines would be seperated by a few seconds at most:
[2015-12-01 09:03:52] [NORMAL ] [main] Collection of system info from [fit01b09np037] running [NetApp Release 8.2.3P4 7-Mode] successful.
and
[2015-12-01 09:16:39] [NORMAL ] [main] Startup complete. Polling for new data every [60] seconds.
In your case instead of seconds we see more than 10 minutes. So I think the TCP connections that get created during each API call are somehow getting blocked or interupted. Maybe a packet trace could help. If you found a solution please share!
Cheers,
Chris Madden
Storage Architect, NetApp EMEA (and author of Harvest)
Blog: It all begins with data
Is this a network problem or simply that there is sometimes a lot of data to connect that keeps harvest from collecting it all in a reasonable time frame?
If this is seen should the various _update_freq variables be increased?
I've got this working in 3 environments now (1x7-Mode, 2xCluster) and only see it in one of the cluster environments. However that environment is a large configuration - in less than 2 hours, the storage used was at 41G!
Which counters it is does not seem to be random, e.g:
# sed -n -e '/not scheduled because/s/.*ING\] \(.*\)\[2016.*\(not scheduled.*\)/\1\2/p' cluster_netapp-harvest.log | sort | uniq -c
32 [cifs:node] data-list poller next refresh at not scheduled because it occurred in the past
32 [cifs:vserver] data-list poller next refresh at not scheduled because it occurred in the past
32 [disk] data-list poller next refresh at not scheduled because it occurred in the past
31 [ext_cache_obj] data-list poller next refresh at not scheduled because it occurred in the past
31 [fcp_lif] data-list poller next refresh at not scheduled because it occurred in the past
33 [iscsi_lif] data-list poller next refresh at not scheduled because it occurred in the past
33 [lif] data-list poller next refresh at not scheduled because it occurred in the past
33 [lun] data-list poller next refresh at not scheduled because it occurred in the past
33 [nfsv3] data-list poller next refresh at not scheduled because it occurred in the past
33 [nfsv3:node] data-list poller next refresh at not scheduled because it occurred in the past
33 [nic_common] data-list poller next refresh at not scheduled because it occurred in the past
33 [processor] data-list poller next refresh at not scheduled because it occurred in the past
32 [system] data-list poller next refresh at not scheduled because it occurred in the past
32 [token_manager] data-list poller next refresh at not scheduled because it occurred in the past
32 [volume] data-list poller next refresh at not scheduled because it occurred in the past
33 [volume:node] data-list poller next refresh at not scheduled because it occurred in the past
32 [wafl] data-list poller next refresh at not scheduled because it occurred in the past
32 [wafl_hya_per_aggr] data-list poller next refresh at not scheduled because it occurred in the past
32 [wafl_hya_sizer] data-list poller next refresh at not scheduled because it occurred in the past
32 [workload] data-list poller next refresh at not scheduled because it occurred in the past
33 [workload_detail] data-list poller next refresh at not scheduled because it occurred in the past
The poller scheduler alphabetically sorts all object types to be collected and then steps serially through each of them issuing one (or more dependig on instance count) API calls to get the counter values. When the poll for a given object finishes it schedules the next run. If next run minute has already past it skips that minute, logs the message to the logfile, and tries again (next minute, log message if fails) iteratively until the scheduled minute is in the future at which time the object will be scheduled. So the idea is that all objects get polled 'as best possible' and no object gets 'starved' if the configured polling interval is too agressive for the counters and cluster being monitored. Actions to take if you see constant warnings like this are discussed in the Harvest Admin guide so rather than copy/pasting that text here please search for that 'occurred in the past' warning message in the Admin guide.
Cheers,
Chris Madden
Storage Architect, NetApp EMEA (and author of Harvest)
Blog: It all begins with data
P.S. Please select “Options” and then “Accept as Solution” if this response answered your question so that others will find it easily!
Ah yes...
"API monitored time per system" ... avg 43, max 64, next 56. Average of over 9000 metrics per minute.
Editing storage-schemas.conf would seem to be the key to working out what's needed every minute, etc?
Along with data_update_freq?
I will need to work out which ones are critical for minte-level resolution and which are not.
Hi @DARREN_REED,
You can set the update frequency on a perf object basis in the collection template but I would probably set it on a per poller basis in netapp-harvest.conf. The reason is if you start tweaking update frequency of individual objects you will likely still run out of time and get some skipped polls on minute boundaries where everything needs to be collected. So I would prefer to disable collection of objects with a high API time that are non-essential, or change the overall polling interval to be something higher. You could also split into two pollers, with two different collection templates and update frequenceies, to maximize data collected and minimize skips. Personally I would just increase the polling frequency to a 2 or 3 or 5 minutes at the poller level and see how that goes.
Also remember that graphite db files are created according to the settings in storage-schemas.conf the first time the metric comes in. So if you will manage different frequencies you should also modify the storage-schemas.conf file and resize any existing graphite db files to match it. If you don't update the graphite db settings you will just use more diskspace to store them but otherwise Grafana, with the default 'collected line mode', will just connect your valid data points and skip the null values not being sent.
Hope this helps.
Cheers,
Chris Madden
Storage Architect, NetApp EMEA (and author of Harvest)
Blog: It all begins with data
P.S. Please select “Options” and then “Accept as Solution” if this response answered your question so that others will find it easily!