Node resources over-utilized

bghanson · ‎2017-05-03

I have been getting the following message off and on for about the last two weeks. When I have looked at the ”OnCommand System Manager“ everything looked green. Today I looked in to the charts more and could see where the utilization exceeded 100%. I asked my local VAR and he had said I was probably over committing my deduped volumes and if I needed to rehydrate a “VMDK” I would not have enough room. My cluster is connected to and used as the Data Store for a cluster of five (5) VMware hosts, all the guests are using thin provisioning. My question is how do I determine which resource is being over utilized? I have plenty of unused space so I can increase one of the volumes if it needs more space. If it is a different resource how do I identify which resource is being over utilized?

Warning System-defined Threshold Event

Summary

Trigger Time 12:29 am, 3 May CDT

Description 1 new warning system-defined threshold(s) breached on Cluster s00nacluster1

p-sdt-s00nacluster1-nod-784 Policy Name: Node resources over-utilized

Perf. Capacity Used value of 159% on s00nacluster1-01 has triggered a WARNING event based on threshold setting of 100%.

colsen · ‎2017-05-03

Hello,

The first place to look is at your "scheduled tasks" (i.e. snapmirrors, scheduled snaps, dedupes, etc). See if any of those events (or multiple events) are occurring when you're seeing your spikes. We had a few scenarios where our dedupes were kicking off on top of our snap-then-mirror schedules which was throwing the spindles and cores into the "over-utilized" zone.

If nothing like that jumps out, you'll need to see if there's an obvious pattern you can find with the over-utilized spikes. If convenient (i.e. not 2AM) jump on the console and run a statit at the node level during the period the spike would normally show up. If you have the time, run a systat -M to see what your cores are doing during this time.

If none of that turns up the smoking gun and/or it's just a pain to do that kind of gather due to time/inconsistency, open up a case and look at running a perfstat. Here's a very useful link I got from Daniel Savino for doing a long-running perfstat:

https://kb.netapp.com/support/s/article/ka31A00000012gpQAA/how-to-collect-performance-statistics-for-intermittent-issues?language=en_US

The GUI tool is pretty nice - much easier for a Windoze guy like me to figure out. The TSE you work with might also be able to get some good information from the ASUP performance gather (if you're generating full/HTTPS ASUPs).

Finally (and maybe even first) look at your main workloads and see if your latency is straying out of what you'd consider to be acceptable ranges. We have a couple of our nodes that regularly report a few periods of over-utilization, but those periods tend to be after-hours (our snapmirror times) but our user I/O stays in <5ms latency range even at those times, so we've honestly stopped worrying about it.

Hope that helps,

Chris

bghanson · ‎2017-05-03

Thanks