Solved: NetApp Harvest error: [nic_common] plugin failed to compile

dlmaldonado · ‎2016-06-23

Collection on one of my 8.3.2.P2 clusters stopped with the errors below logged. All other clusters seem to be fine. Has anyone seen this?

[2016-06-21 11:36:00] [NORMAL ] Poller status: status, secs=14400, api_time=8170, plugin_time=274, metrics=1978019, skips=587, fails=0
[2016-06-21 13:00:42] [WARNING] [nic_common] plugin failed to compile: Illegal division by zero at /opt/netapp-harvest/plugin/cdot-nic-common line 86.

[2016-06-21 13:00:42] [ERROR ] [nic_common] Restarting netapp-worker as an attempt to clear issue
[2016-06-21 13:00:42] [NORMAL ] WORKER STARTED [Version: 1.2.2] [Conf: netapp-harvest.conf] [Poller: ntap-cla01]
[2016-06-21 13:00:42] [NORMAL ] [main] Poller will monitor a [FILER] at [192.168.94.1:443]
[2016-06-21 13:00:42] [NORMAL ] [main] Poller will use [password] authentication with username [netapp-harvest] and password [**********]
[2016-06-21 13:00:43] [NORMAL ] [main] Collection of system info from [192.168.94.1] running [NetApp Release 8.3.2P2] successful.
[2016-06-21 13:00:43] [NORMAL ] [main] Using best-fit collection template: [cdot-8.3.0.conf]
[2016-06-21 13:00:43] [NORMAL ] [main] Using graphite_root [netapp.perf.springfield.ntap-cla01]
[2016-06-21 13:00:43] [NORMAL ] [main] Using graphite_meta_metrics_root [netapp.poller.perf.springfield.ntap-cla01]
[2016-06-21 13:00:43] [NORMAL ] [smb2:node] Collection of object not enabled; skipping
[2016-06-21 13:00:43] [NORMAL ] [smb2:vserver] Collection of object not enabled; skipping
[2016-06-21 13:00:43] [NORMAL ] [main] Startup complete. Polling for new data every [60] seconds.
[2016-06-21 13:02:39] [WARNING] [nic_common] plugin failed to compile: Illegal division by zero at /opt/netapp-harvest/plugin/cdot-nic-common line 86.

madden · ‎2016-06-27

Hi @dlmaldonado

There appears to be an issue with the link_speed counter value on some interface(s) on your cluster. My guess is either something changed in 8.3.2P2, or after the upgrade/reboot some unused interface didn't get a value set as it should (which could also be a new behavior in 8.3.2P2).

Can you restart the poller in verbose mode, wait for 5 minutes, and then restart again in normal mode?:

/opt/netapp-harvest/netapp-manager -restart -poller <clustername> -v

<wait 5 minutes>

/opt/netapp-harvest/netapp-manager -restart -poller <clustername>

Then provide the logfile in /opt/netapp-harvest/log/<poller>_netapp-harvest.log

From that log I can see what the incoming link_speed values are and hopefully explain why it's not working as it should.

I will also send you a private message in case you prefer to share the logs privately.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

View solution in original post

J_curl · ‎2016-06-23

no, and I was about to push P2 to a DEV cluster. This was running fine against P1?

dlmaldonado · ‎2016-06-23

It's working on other 832P2 clusters and had been working fine after we upgraded. For at least 2 weeks. Not sure why this stopped collecting.

dlmaldonado · ‎2016-06-24

FYI, in order to pull back any metrics I had to comment out these lines in "/opt/netapp-harvest/plugin/cdot-nic-common"

my $rx_pct = sprintf ("%.2f", $h{$start}{$port}{rx_bytes_per_sec} / $link_speed * 100 );
my $tx_pct = sprintf ("%.2f", $h{$start}{$port}{tx_bytes_per_sec} / $link_speed * 100 );
my $pct = sprintf ("%.2f", $tx_pct);
$pct = sprintf ("%.2f", $rx_pct) if ($rx_pct > $tx_pct);
push @emit_items, "$start.$port.rx_pct_util $rx_pct $timestamp";
push @emit_items, "$start.$port.tx_pct_util $tx_pct $timestamp";
push @emit_items, "$start.$port.link_pct_util $pct $timestamp";

I realize this is not a solution, but I need to collect something vs nothing and as I said, I only experienced this on one cluster. The others are fine. And it had been working previously after 8.3.2P2 upgrade. It's a 14 node NFS cluster. After a certain date, collection failed with [WARNING] [nic_common] plugin failed to compile: Illegal division by zero at /opt/netapp-harvest/plugin/cdot-nic-common.

madden · ‎2016-06-27

Hi @dlmaldonado

There appears to be an issue with the link_speed counter value on some interface(s) on your cluster. My guess is either something changed in 8.3.2P2, or after the upgrade/reboot some unused interface didn't get a value set as it should (which could also be a new behavior in 8.3.2P2).

Can you restart the poller in verbose mode, wait for 5 minutes, and then restart again in normal mode?:

/opt/netapp-harvest/netapp-manager -restart -poller <clustername> -v

<wait 5 minutes>

/opt/netapp-harvest/netapp-manager -restart -poller <clustername>

Then provide the logfile in /opt/netapp-harvest/log/<poller>_netapp-harvest.log

From that log I can see what the incoming link_speed values are and hopefully explain why it's not working as it should.

I will also send you a private message in case you prefer to share the logs privately.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

hashiya1112 · ‎2016-08-26

Hi

We are using harvest to get performance on cDOT8.2.3 and 8.3

Aug 23 8.2.3 upgrading to 8.2.4P4

after upgrade the same error has occurred

When you confirm the netapp-dashboard-cluster of grafana
eth port utilization is greater than 3000 percent

dlmaldonado wrote

"/opt/netapp-harvest/plugin/cdot-nic-common"

--------------------------------------------------------------------------------------------------------

my $rx_pct = sprintf ("%.2f", $h{$start}{$port}{rx_bytes_per_sec} / $link_speed * 100 );
my $tx_pct = sprintf ("%.2f", $h{$start}{$port}{tx_bytes_per_sec} / $link_speed * 100 );
my $pct = sprintf ("%.2f", $tx_pct);
$pct = sprintf ("%.2f", $rx_pct) if ($rx_pct > $tx_pct);
push @emit_items, "$start.$port.rx_pct_util $rx_pct $timestamp";
push @emit_items, "$start.$port.tx_pct_util $tx_pct $timestamp";
push @emit_items, "$start.$port.link_pct_util $pct $timestamp";

--------------------------------------------------------------------------------------------------------

How to fix this code?

madden · ‎2016-08-26

Hi @hashiya1112

Actually, we resolved offline. One of the ports was link up but at 10Mbit and the plugin logic was not able to convert this correctly. I have added a fix and it will ship in the next Harvest release on the toolchest. In the meantime perhaps you can just find the port(s) that are online at 10Mbit and fix that to be 100Mbit or faster?

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

hashiya1112 · ‎2016-08-26

Hi Chris

thank you for reply

There is an offline port

Not 10Mbit in online port

Do I change the network port modify command?

hashiya1112 · ‎2016-08-26

I've nic_common tried to edit it as follows

		{
			$link_speed = 1.25 if ($h{$start}{$port}{link_speed} == 10000000 );  #10Mbit
			$link_speed = 12.5 if ($h{$start}{$port}{link_speed} == 100000000 );  #100Mbit
			$link_speed = 125  if ($h{$start}{$port}{link_speed} == 1000000000 );  #1Gbit
			$link_speed = 1250 if ($h{$start}{$port}{link_speed} == 10000000000 ); #10Gbit
		}
		elsif ($connection{normalized_xfer} eq 'kb_per_sec')
		{
			$link_speed = 1250 if ($h{$start}{$port}{link_speed} == 10000000 );  #10Mbit
			$link_speed = 12500 if ($h{$start}{$port}{link_speed} == 100000000 );  #100Mbit
			$link_speed = 125000  if ($h{$start}{$port}{link_speed} == 1000000000 );  #1Gbit
			$link_speed = 1250000 if ($h{$start}{$port}{link_speed} == 10000000000 ); #10Gbit
		}
		elsif ($connection{normalized_xfer} eq 'b_per_sec')
		{
			$link_speed = 1250000 if ($h{$start}{$port}{link_speed} == 10000000 );  #10Mbit
			$link_speed = 12500000 if ($h{$start}{$port}{link_speed} == 100000000 );  #100Mbit
			$link_speed = 125000000  if ($h{$start}{$port}{link_speed} == 1000000000 );  #1Gbit
			$link_speed = 1250000000 if ($h{$start}{$port}{link_speed} == 10000000000 ); #10Gbit
		}
		elsif ($connection{normalized_xfer} eq 'gb_per_sec')
		{
			$link_speed = .00125 if ($h{$start}{$port}{link_speed} == 10000000 );  #10Mbit
			$link_speed = .0125 if ($h{$start}{$port}{link_speed} == 100000000 );  #100Mbit
			$link_speed = .125  if ($h{$start}{$port}{link_speed} == 1000000000 );  #1Gbit
			$link_speed = 1.25  if ($h{$start}{$port}{link_speed} == 10000000000 ); #10Gbit
		}

error is no longer out

but Calculation of eth port utilization percent became strange

e0M(node management port) utilization 3820 percent....

e0M is 100Mbit port

Hmm....

madden · ‎2016-08-26

Hi @hashiya1112

Maybe give this a try:

		my $link_speed = 1;
		if ($connection{normalized_xfer} eq 'mb_per_sec')
		{
			$link_speed = $h{$start}{$port}{link_speed} / 8000000;
		}
		elsif ($connection{normalized_xfer} eq 'kb_per_sec')
		{
			$link_speed = $h{$start}{$port}{link_speed} / 8000;
		}
		elsif ($connection{normalized_xfer} eq 'b_per_sec')
		{
			$link_speed = $h{$start}{$port}{link_speed} / 8;
		}
		elsif ($connection{normalized_xfer} eq 'gb_per_sec')
		{
			$link_speed = $h{$start}{$port}{link_speed} / 8000000000;
		}
		next if ($link_speed == 1); # Skip posting utilization if we couldn't normalize

If you still see a weird utilization check higher in this post for instructions on how to collect logs needed to understand what is happening. Send me these logs in a private message.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

hashiya1112 · ‎2016-08-26

Hi Chris

Thanks for the code!!

I There is also another project

using the harvest even in a other project

We're using the cDOT8.2.3P3 and cODT8.2.4P4 in another project

In that case

cp /opt/netapp-harvest/plugin/cdot-nic-common /opt/netapp-harvest/plugin/cdot-nic-common-8.2.4

vi /opt/netapp-harvest/plugin/cdot-nic-common-8.2.4
------------------Fix to this code-------------------------
		my $link_speed = 1;
		if ($connection{normalized_xfer} eq 'mb_per_sec')
		{
			$link_speed = $h{$start}{$port}{link_speed} / 8000000;
		}
		elsif ($connection{normalized_xfer} eq 'kb_per_sec')
		{
			$link_speed = $h{$start}{$port}{link_speed} / 8000;
		}
		elsif ($connection{normalized_xfer} eq 'b_per_sec')
		{
			$link_speed = $h{$start}{$port}{link_speed} / 8;
		}
		elsif ($connection{normalized_xfer} eq 'gb_per_sec')
		{
			$link_speed = $h{$start}{$port}{link_speed} / 8000000000;
		}
		next if ($link_speed == 1); # Skip posting utilization if we couldn't normalize
-------------------------------------------

cp /opt/netapp-harvest/template/default/cdot-8.2.0.conf /opt/netapp-harvest/template/default/cdot-8.2.4.conf

vi /opt/netapp-harvest/template/default/cdot-8.2.4.conf

	'nic_common' =>
			{ 
				counter_list     => [ qw(node_name node_uuid instance_name
									rx_bytes_per_sec tx_bytes_per_sec 
									link_speed link_up_to_downs
									) ],
				graphite_leaf    => 'node.{node_name}.eth_port.{instance_name}',
				plugin           => 'cdot-nic-common-8.2.4',
				enabled          => '1'							
			},

Modify the part of the plugin to create a new cdot-8.2.4.conf and cdot-nic-common-8.2.4?

Apart from the nic_common file

because Calculation of eth port utilization percent became strange of cDOT8.2.3P3....

Regards.

madden · ‎2016-08-28

Hi @hashiya1112

I think the issue is related to bug 915637. The counters in nic_common that track tx/rx are stored as 4bit numbers which means they rollover quite frequently which can impact display. New 8 bit counters were added in 8.2.4 and 8.3.2 and Harvest v1.3 will include this fix. I will contact you offline to provide a patch in the meantime.

Cheers,
Chris Madden

Storage Architect, NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!