Solved: High Node Latencies Netapp Harvest 1.4.1 since ONTAP 9.4

jtoaspern · ‎2018-11-26

Hello,

After upgrading our clusters from ONTAP 9.3P4 to 9.4P3 last week, the node latency data from netapp-harvest in Grafana seems to be unrealistically high (NetApp Dashboard: Cluster -> Highlights -> Latency).

We are using netapp-harvest 1.4.1 without the hotfix, as the link seems to be expired:

https://community.netapp.com/t5/OnCommand-Storage-Management-Software-Discussions/NetApp-Harvest-1-4-1-Hotfix-to-fix-2-bugs/m-p/144160#M26247

We copied cdot-9.3.0.conf to cdot-9.4.0.conf and restarted netapp-harvest's pollers.

On the ONTAP CLI the Latencies of all nodes are constantly within the 100-1500 us range (statistics node show -interval 5 -iterations 50 -max 4), which is much lower than the latencies reported by netapp-harvest.

In the attached picture the Latency increase since the upgrade to 9.4 is clearly visible.

How are these latencies calculated?

There are no entries in the /opt/netapp-harvest/log/*.log file of the cluster, other than NORMAL Poller status messages.

This has been an off-topic discussion in a few other threads, which are marked as solved, which is why I am opening a new one.

Edit: Latencies reported by OCUMs graphs are in line with the values seen on the CLI.

Kind Regards

Joel

jtoaspern · ‎2019-02-25

Looks like Netapp-Harvest 1.4.2 fixed the issue, the node latencies shown in Grafana are "realistic" again.

View solution in original post

suren · ‎2018-11-28

Hi Joel,

Can you please try to access the link now. Its updated.

https://community.netapp.com/t5/OnCommand-Storage-Management-Software-Discussions/NetApp-Harvest-1-4-1-Hotfix-to-fix-2-bugs/m-p/144160#M26247

Thanks & Regards,

Surendra

jtoaspern · ‎2018-11-29

Hi Surendra,

thanks for reuploading the hotfix. I applied it earlier today, but it seems to not have affected the node latency statistics in any way.

There is a distinct pattern in the latency spikes, every 5 minutes it goes up (10-30ms), while staying at relatively calm 0-2 ms between those spikes. 0-2ms would be in line with the data seen with "statistics node show -iterations 50". I kept a close eye on the statistics shown on the CLI, no pattern visible there, continuously 0.3-1.5 ms on all nodes.

The Latency graphs under "Dashboard: Node", which show read/write/other for a single node are pretty close to the CLI and OCUM statistics, only occasionally showing dips into the >4ms range.

I attached a screenshot relevant to the matter.

The spikes may be an indicator to some other factor not yet considered here.

Please note that this does not have any actual negative impact on our production, the behaviour since our 9.3->9.4 update just seems odd.

Regards

Joel

jtoaspern · ‎2019-02-25

Looks like Netapp-Harvest 1.4.2 fixed the issue, the node latencies shown in Grafana are "realistic" again.