Data Infrastructure Management Software Discussions

NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Hi there -  Starting 12/31/2018, we’re having a problem with NetApp-Harvest receiving SVM capacity metrics from UM 9.4. SVM capacity metrics from UM server are impacted however node/aggr capacity metrics ARE being received properly. Things we've tried to no joy:

 

1. Restarting of UM NetApp Harvest pollers on both Grafana/NetApp-Harvest servers
2. Reboot of UM server
3. SVM and node Curl commands successfully ran from Grafana/NetApp-Harvest servers (they're receiving UM APIs.)

 

Configuration:

ONTAP 9.3P4

NetApp-Harvest 1.4

UM 9.4

 

 

[2019-01-03 00:32:02] [WARNING] [qtree] Cluster name for aggr [svm-nas-oma-c01] (3751f7fb-30fe-11e7-963c-90e2bac3282c:type=vserver,uuid=290f9c35-3211-11e7-963c-90e2bac282c) not found in cache; skipping

 

[2019-01-03 09:16:01] [WARNING] [volume] update failed with reason: Timeout. Could not read API response.

[2019-01-03 09:16:01] [WARNING] [volume] data-list update failed.

 

We need to get this resolved as soon as possible as this is affecting team operations. NetApp Support case was opened last week but they referred us here as it's community-supported.

14 REPLIES 14

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

All metrics (both Node/Cluster and SVM) are available within UM. NetApp-Harvest poller logs are all showing as "...not found in cache; skipping"

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

What you can do to try I get more debug information is run Harvest manually with the debug option for that specific worker, without -daemon, and with -v

 

Usage: /opt/netapp-harvest/netapp-worker -poller <str> [-conf <str>] [-confdir <str>] [-logdir <str>] [-daemon] [-v] [-h]

PURPOSE:
Collect performance data from Data ONTAP or OCUM and submit to Graphite.
VERSION:
1.4
ARGUMENTS:
Required:
-poller <str> Poller section to run
Optional:
-conf <str> Name of config file to use to find poller name
(default: netapp-harvest.conf)
-confdir <str> Name of directory where config file is located
(default: /opt/netapp-harvest)
-logdir <str> Name of directory where log files should be written
(default: /opt/netapp-harvest/log)
-h Output this help text
-v Output verbose output to stdout and logfile
-daemon Daemonize process (linux only)
EXAMPLE:
Run poller netapp-1 interactively in verbose mode
netapp-worker -poller netapp-1 -v
Run poller netapp-2 as a daemon
netapp-worker -poller netapp-2 -daemon
Run poller netapp-5 from conf file test.conf as a daemon
netapp-worker -poller netapp-5 -conf test.conf -daemon

 

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Restarted poller with -v option. Nothing jumps out at me. Metrics collection dropped even lower than before...

 

Normal: ~ 100K metrics

Jan 3 Poller Restart: 12.5K metrics

Jan 9 Poller Restart: 1.5K metrics

 

$ sudo ./netapp-manager --restart --poller nc2pwnaocum01 -v
[OK ] Line [18] is Section [global]
[OK ] Line [26] in Section [global] has Key/Value pair [grafana_api_key]=[**********]
[OK ] Line [28] in Section [global] has Key/Value pair [grafana_url]=[http://nc2plgrafana01:3000]
[OK ] Line [29] in Section [global] has Key/Value pair [grafana_dl_tag]=[]
[OK ] Line [35] is Section [default]
[OK ] Line [37] in Section [default] has Key/Value pair [graphite_enabled]=[1]
[OK ] Line [38] in Section [default] has Key/Value pair [graphite_server]=[10.65.44.73]
[OK ] Line [39] in Section [default] has Key/Value pair [graphite_port]=[2003]
[OK ] Line [40] in Section [default] has Key/Value pair [graphite_proto]=[tcp]
[OK ] Line [41] in Section [default] has Key/Value pair [normalized_xfer]=[mb_per_sec]
[OK ] Line [42] in Section [default] has Key/Value pair [normalized_time]=[millisec]
[OK ] Line [43] in Section [default] has Key/Value pair [graphite_root]=[default]
[OK ] Line [44] in Section [default] has Key/Value pair [graphite_meta_metrics_root]=[default]
[OK ] Line [47] in Section [default] has Key/Value pair [host_type]=[FILER]
[OK ] Line [48] in Section [default] has Key/Value pair [host_port]=[443]
[OK ] Line [49] in Section [default] has Key/Value pair [host_enabled]=[1]
[OK ] Line [50] in Section [default] has Key/Value pair [template]=[default]
[OK ] Line [51] in Section [default] has Key/Value pair [data_update_freq]=[60]
[OK ] Line [52] in Section [default] has Key/Value pair [ntap_autosupport]=[0]
[OK ] Line [53] in Section [default] has Key/Value pair [latency_io_reqd]=[10]
[OK ] Line [54] in Section [default] has Key/Value pair [auth_type]=[password]
[OK ] Line [55] in Section [default] has Key/Value pair [username]=[netapp-harvest]
[OK ] Line [56] in Section [default] has Key/Value pair [password]=[**********]
[OK ] Line [71] is Section [NC2DACSTORE02]
[OK ] Line [72] in Section [NC2DACSTORE02] has Key/Value pair [hostname]=[nc2dacstore02-mgmt.us.ad.lfg.com]
[OK ] Line [73] in Section [NC2DACSTORE02] has Key/Value pair [group]=[gso_dev]
[OK ] Line [78] is Section [nc1pacstore01]
[OK ] Line [79] in Section [nc1pacstore01] has Key/Value pair [password]=[**********]
[OK ] Line [80] in Section [nc1pacstore01] has Key/Value pair [hostname]=[nc1pacstore01-mgmt.us.ad.lfg.com]
[OK ] Line [81] in Section [nc1pacstore01] has Key/Value pair [group]=[gso]
[OK ] Line [83] is Section [NC2PACSTORE01]
[OK ] Line [84] in Section [NC2PACSTORE01] has Key/Value pair [hostname]=[nc2pacstore01-mgmt.us.ad.lfg.com]
[OK ] Line [85] in Section [NC2PACSTORE01] has Key/Value pair [password]=[**********]
[OK ] Line [86] in Section [NC2PACSTORE01] has Key/Value pair [data_update_freq]=[150]
[OK ] Line [87] in Section [NC2PACSTORE01] has Key/Value pair [group]=[gso]
[OK ] Line [89] is Section [NC2PACSTORE02]
[OK ] Line [90] in Section [NC2PACSTORE02] has Key/Value pair [hostname]=[nc2pacstore02-mgmt.us.ad.lfg.com]
[OK ] Line [91] in Section [NC2PACSTORE02] has Key/Value pair [password]=[**********]
[OK ] Line [92] in Section [NC2PACSTORE02] has Key/Value pair [data_update_freq]=[150]
[OK ] Line [93] in Section [NC2PACSTORE02] has Key/Value pair [group]=[gso]
[OK ] Line [95] is Section [NC2PACSTORE03]
[OK ] Line [96] in Section [NC2PACSTORE03] has Key/Value pair [hostname]=[nc2pacstore03-mgmt.us.ad.lfg.com]
[OK ] Line [97] in Section [NC2PACSTORE03] has Key/Value pair [password]=[**********]
[OK ] Line [98] in Section [NC2PACSTORE03] has Key/Value pair [group]=[gso]
[OK ] Line [100] is Section [NC2PACSTORE04]
[OK ] Line [101] in Section [NC2PACSTORE04] has Key/Value pair [hostname]=[nc2pacstore04-mgmt.us.ad.lfg.com]
[OK ] Line [102] in Section [NC2PACSTORE04] has Key/Value pair [password]=[**********]
[OK ] Line [103] in Section [NC2PACSTORE04] has Key/Value pair [group]=[gso]
[OK ] Line [105] is Section [nc2pacstore05]
[OK ] Line [106] in Section [nc2pacstore05] has Key/Value pair [hostname]=[nc2pacstore05-mgmt.us.ad.lfg.com]
[OK ] Line [107] in Section [nc2pacstore05] has Key/Value pair [password]=[**********]
[OK ] Line [108] in Section [nc2pacstore05] has Key/Value pair [group]=[gso]
[OK ] Line [124] is Section [ga1pacstore01]
[OK ] Line [125] in Section [ga1pacstore01] has Key/Value pair [hostname]=[ga1pacstore01-mgmt.us.ad.lfg.com]
[OK ] Line [126] in Section [ga1pacstore01] has Key/Value pair [password]=[**********]
[OK ] Line [127] in Section [ga1pacstore01] has Key/Value pair [group]=[atl]
[OK ] Line [133] is Section [il3pzcstore001]
[OK ] Line [134] in Section [il3pzcstore001] has Key/Value pair [hostname]=[il3pzcstore001-mgmt.us.ad.lfg.com]
[OK ] Line [135] in Section [il3pzcstore001] has Key/Value pair [group]=[il3]
[OK ] Line [150] is Section [va1pzcstore001]
[OK ] Line [151] in Section [va1pzcstore001] has Key/Value pair [hostname]=[va1pzcstore001-mgmt.us.ad.lfg.com]
[OK ] Line [152] in Section [va1pzcstore001] has Key/Value pair [group]=[va1]
[OK ] Line [178] is Section [nh1pacstore01]
[OK ] Line [179] in Section [nh1pacstore01] has Key/Value pair [hostname]=[nh1pacstore01-mgmt.us.ad.lfg.com]
[OK ] Line [180] in Section [nh1pacstore01] has Key/Value pair [password]=[**********]
[OK ] Line [181] in Section [nh1pacstore01] has Key/Value pair [group]=[cnc]
[OK ] Line [187] is Section [pa1pacstore01]
[OK ] Line [188] in Section [pa1pacstore01] has Key/Value pair [hostname]=[pa1pacstore01-mgmt.us.ad.lfg.com]
[OK ] Line [189] in Section [pa1pacstore01] has Key/Value pair [password]=[**********]
[OK ] Line [190] in Section [pa1pacstore01] has Key/Value pair [group]=[rad]
[OK ] Line [196] is Section [in2pacstore01]
[OK ] Line [197] in Section [in2pacstore01] has Key/Value pair [hostname]=[in2pacstore01-mgmt.us.ad.lfg.com]
[OK ] Line [198] in Section [in2pacstore01] has Key/Value pair [password]=[**********]
[OK ] Line [199] in Section [in2pacstore01] has Key/Value pair [group]=[fwa]
[OK ] Line [380] is Section [ct1pacstore01]
[OK ] Line [381] in Section [ct1pacstore01] has Key/Value pair [hostname]=[ct1pacstore01-mgmt.us.ad.lfg.com]
[OK ] Line [382] in Section [ct1pacstore01] has Key/Value pair [password]=[**********]
[OK ] Line [383] in Section [ct1pacstore01] has Key/Value pair [group]=[hfd]
[OK ] Line [388] is Section [nc2pwnaocum01]
[OK ] Line [389] in Section [nc2pwnaocum01] has Key/Value pair [password]=[**********]
[OK ] Line [390] in Section [nc2pwnaocum01] has Key/Value pair [hostname]=[nc2pwnaocum01.us.ad.lfg.com]
[OK ] Line [391] in Section [nc2pwnaocum01] has Key/Value pair [group]=[gso]
[OK ] Line [392] in Section [nc2pwnaocum01] has Key/Value pair [host_type]=[OCUM]
[OK ] Line [393] in Section [nc2pwnaocum01] has Key/Value pair [data_update_freq]=[900]
[OK ] Line [394] in Section [nc2pwnaocum01] has Key/Value pair [normalized_xfer]=[gb_per_sec]
STATUS POLLER GROUP
############### #################### ##################
[STOPPED] nc2pwnaocum01 gso
[STARTED] nc2pwnaocum01 gso

$

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Restarted the UM collector but it's still throwing the same error messages in the logs. Output attached to the NetApp Support case.

 

sudo ./netapp-manager --restart--poller <UM_Server> -v

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

It's important that you run netapp-worker with -v and not in daemon, then redirect the output. There are other ways but I think it's easier.

 

netapp-manager is the tool to manage daemon start-stop but not producing meaningful debug from Harvest.

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Restarted netapp-worker. Same error: "update failed with reason: Timeout. Could not read API response". Full UM Poller Log attached to the NetApp Support Case.

 

[2019-01-10 08:11:02] [NORMAL ] Creating output plugins
[2019-01-10 08:11:02] [NORMAL ] Created output plugins
[2019-01-10 08:11:02] [DEBUG ] [lun] data-list poller first poll at [2019-01-10 08:15:00]
[2019-01-10 08:11:02] [DEBUG ] [qtree] data-list poller first poll at [2019-01-10 08:15:00]
[2019-01-10 08:11:02] [DEBUG ] [volume] data-list poller first poll at [2019-01-10 08:15:00]
[2019-01-10 08:11:02] [DEBUG ] [aggregate] data-list poller first poll at [2019-01-10 08:15:00]
[2019-01-10 08:11:02] [NORMAL ] [main] Startup complete. Polling for new data every [900] seconds.
[2019-01-10 08:11:02] [DEBUG ] Sleeping [238] seconds
[2019-01-10 08:15:00] [DEBUG ] [aggregate] Found instance [netapp.capacity.gso.NC2PACSTORE04.node.NC2PACSTORE04-02.aggr.nc2pacstore04_n02_a1_sp3_sas10k900_esx] metric [size-used-per-day] with value [18448096445]

...

[2019-01-10 08:15:01] [DEBUG ] M= netapp.poller.capacity.gso.nc2pwnaocum01.aggregate.pluginTime 0 1547126100
[2019-01-10 08:15:01] [DEBUG ] [aggregate] Issuing new socket connect to Graphite server.
[2019-01-10 08:16:01] [WARNING] [volume] update failed with reason: Timeout. Could not read API response.
[2019-01-10 08:16:01] [DEBUG ] [volume] data-list poller next refresh at [2019-01-10 08:30:00]
[2019-01-10 08:16:01] [WARNING] [volume] data-list update failed.

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Thank you,

 

This indicates an error talking to OCUM. Are you positive that Harvest successfully connects to OCUM server?

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Yes - curl returns output when ran against OCUM server.

 

[prcpa8@nc2plgrafana02 log]$ curl -k -X GET --header 'Accept: application/vnd.netapp.object.inventory.hal+json' 'https://nc2pwnaocum01/rest/v1/svms?limit=20'
<html><head><title>OnCommand Unified Manager | Error</title></head><body><h1>Error 401 - Unauthorized</h1><p>Please go back to the <a href='/'>homepage</a> and try agan.</p></body></html>[


prcpa8@nc2plgrafana02 log]$ curl -k -X GET --header 'Accept: application/vnd.netapp.object.inventory.hal+json' 'https://nc2pwnaocum01/rest/svms?li<html><head><title>OnCommand Unified Manager | Error</title></head><body><h1>Error 401 - Unauthorized</h1><p>Please go back to the <a href='/'>homepage</a> and try agan.</p></body></html>[prcpa8@nc2plgrafana02 log]$

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Harvest is still receiving some OCUM metrics so they're communicating. However we've seen a drop in metrics collection: 

 

Dec 31: 100K

Jan 3: 12.5K

Jan 9: 1.5K 

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

For anyone following this thread, we found that this is an OCUM ssl certificate validation issue and not related to Harvest / Grafana. The error message that we get when trying send requests to the OCUM system is:

 

FAILED (13001) with reason [[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618)].

We checked OCUM's HTTPS certificate and it's still valid til 2021. I asked @prcpa8w3p to run  curl -k -v https://hostname/ and send us the results.

 

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Might be a similiar issue as this here.

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

Your requests returned error because login is required to get volume counters, please run instead this two commands:

 

export HISTCONTROL=ignoreboth

then:

 curl -v -u 'USERNAME':'PASSWORD' -k -X GET --header 'Accept: application/vnd.netapp.object.inventory.hal+json' 'https://nc2pwnaocum01/rest/v1/svms?limit=2'

Note the space before " curl", don't omit this otherwise your password and username will be stored in bash history. Please send us the console output, it will give us a better understanding why OCUM is rejecting connection.

Re: NetApp-Harvest 1.4 poller not working - Unified Manager 9.4

In netapp-worker file section titled: "Establish connection parameters for Data ONTAP and OCUM", changed the set_timeout value from 60 (default) to 300 and restarted the OCUM pollers and all collecting normally.

 

Have a NetApp support case open for root cause as to why OCUM server taking long time to calculate/process cDOT metrics...

 

## Establish connection parameters for Data ONTAP and OCUM
##
sub connect_naserver($@)
{
my $major = shift;
my $minor = shift;

#Check hostname resolution as this is a common mistake that will prevent workfer from collecting
if ($connection{'hostname'} =~ /^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$/)
{
;
}
elsif ( my $ip_packed = gethostbyname($connection{'hostname'}) ) # If a name try resolution
{
my ($a,$b,$c,$d) = unpack('C4',$ip_packed);
logger ("DEBUG", "[connect] Resolved hostname [$connection{'hostname'}] to IP address [$a.$b.$c.$d]");
}
else # We were a name but failed to resolve. Maybe later resolution will work, so just warn about it.
{
logger ("WARNING", "[connect] Unable to resolve hostname [$connection{'hostname'}]. Ensure correct hostname resolution or use IP address instead.");
}

my $s = NaServer->new ($connection{'hostname'}, $major, $minor);

# Check if SDK supports HTTP/1.1 and it is the default version, and for a reverse hostname resolution failure.
# If both are true it will cause NMSDK failures later (see bug id 881464) and we need to set the less efficient HTTP/1.0
if ( (eval { $s->get_http_version() }) && ( $s->get_http_version() eq '1.1') )
{
if ( ! gethostbyaddr( inet_aton($connection{'hostname'}), AF_INET ))
{
$s->set_http_version('1.0');
logger ("WARNING", "[connect] Setting HTTP/1.0 because reverse hostname resolution (IP -> hostname) fails. To enable HTTP/1.1 ensure reverse hostname resolution succeeds.");
}
else
{
logger ("DEBUG", "[connect] Reverse hostname lookup successful. Using HTTP/1.1 for communication.");
}
}
else
{
logger ("DEBUG", "[connect] Using HTTP/1.0 for communication (either set earlier or only version supported by SDK).");
}

## Force HTTPS
my $out = $s->set_transport_type('HTTPS');
if (ref ($out) eq "NaElement") {if ($out->results_errno != 0) {
my $r = $out->results_reason();
logger ("ERROR", "Connection to $connection{'hostname'} failed: $r");
exit (1);
}
}

## Use SSL certificates if possible
if ($connection{'auth_type'} eq 'ssl_cert')
{
$out = $s->set_style('CERTIFICATE');
if (ref ($out) eq "NaElement") {
if ($out->results_errno != 0) {
my $r = $out->results_reason();
logger ("ERROR", "Connection to $connection{'hostname'} failed: $r");
exit (1);
}
}
$out = $s->set_server_cert_verification('0');
$out = $s->set_client_cert_and_key($Bin."/cert/".$connection{'ssl_cert'}, $Bin."/cert/".$connection{'ssl_key'});
if (ref ($out) eq "NaElement") {
if ($out->results_errno != 0) {
my $r = $out->results_reason();
logger ("ERROR", "Connection to $connection{'hostname'} failed: $r");
exit (1);
}
}
}
else # Assume username / password
{
$out = $s->set_style('LOGIN');
if (ref ($out) eq "NaElement") {
if ($out->results_errno != 0) {
my $r = $out->results_reason();
logger ("ERROR", "Connection to $connection{'hostname'} failed: $r");
exit (1);
}
}
$out = $s->set_admin_user($connection{'username'}, $connection{'password'});
}

$out = $s->set_server_type($connection{'host_type'});
$out = $s->set_port($connection{'host_port'});

# Max of 60 seconds before we reset the connection. Especially if API is accepted but reply never comes (Ex. cf takeover
# occurs mid call) the timeout will ensure we don't block forever
$out = $s->set_timeout('300');

return $s;
}

View solution in original post

Cloud Volumes ONTAP
Review Banner
All Community Forums
Public