ONTAP8 Cluster Mode Performance issues

BruceD · ‎2015-09-27

So I'm seeing an issue where we have 99% processor usage on Node 1 of our cluster. This is happening every Sunday at 2:00 AM server time. Now the real fun comes when trying to track down what's running that might be contributing to the issue. When running the job show command we show a couple jobs as dormant but the rest as queued.

Node::*> job show -fields name,starttime,endtime,queuetime,completion,restarted,progress,state,jobtype,category,priority
id vserver name priority queuetime starttime endtime restarted state completion jobtype category progress
-- --------------- -------------------------- -------- ---------------- ---------------- ---------------- --------- ------ ---------- ---------------------- -------------------- ---------
1 Node "Certificate Expiry Check" Low "09/27 00:00:05" "09/27 00:00:05" "09/27 00:00:05" false Queued DONE "Security Certificate" "Certificate Expiry" Unclaimed
2 node"SnapMirror Service Job" Low "11/11 14:46:09" "11/11 14:46:09" - false Dormant
"" smServiceJob SnapMirror "Waiting for next session"
4 node Licensing Low "09/27 00:00:05" "09/27 00:00:05" "09/27 00:00:05" false Queued Succeeded "Cluster Licenses" License Unclaimed
5 node "Vol Reaper" High "09/27 02:51:44" "09/27 02:51:44" "09/27 02:51:44" false Queued "" "Vol Reaper" VOPL Unclaimed
6 node "CLUSTER BACKUP AUTO 8hour"
Medium "09/27 02:23:05" "09/27 02:15:00" "09/27 02:23:05" false Queued "" "CLUSTER BACKUP" "CLUSTER BACKUP" Unclaimed
7 node "CLUSTER BACKUP AUTO daily"
Medium "09/27 00:18:01" "09/27 00:10:00" "09/27 00:18:01" false Queued "" "CLUSTER BACKUP" "CLUSTER BACKUP" Unclaimed
8 node "CLUSTER BACKUP AUTO weekly"
Medium "09/27 00:21:20" "09/27 00:15:00" "09/27 00:21:20" false Queued "" "CLUSTER BACKUP" "CLUSTER BACKUP" Unclaimed
9 node "SnapMirror Service Job" Low "11/11 15:01:34" "11/11 15:01:34" - false Dormant
"" smServiceJob SnapMirror "Waiting for next session"
11 node "Network Consistency Diagnostic - weekly"
Low "09/27 00:15:01" "09/27 00:15:00" "09/27 00:15:01" false Queued "" "Consistency Checker" "Networking Diagnostic"
Queued
49 node "Network Consistency Diagnostic - weekly"
Low "09/27 00:15:01" "09/27 00:15:00" "09/27 00:15:01" false Queued "" "Consistency Checker" "Networking Diagnostic"
Queued
334
node scmstJob Medium "09/27 02:58:48" "09/27 02:58:48" "09/27 02:58:48" false Queued Done "Message Sequence Timer"
"Shadow Copy" Unclaimed
2789
node "Peer Manager for cluster b7c05ff9-9497-11e3-be08-123478563412"
Medium "09/27 02:58:20" "09/27 02:58:20" "09/27 02:58:20" false Queued "Periodic task(s) succeeded."
X-Cluster system Unclaimed

id vserver name priority queuetime starttime endtime restarted state completion jobtype category progress
-- --------------- -------------------------- -------- ---------------- ---------------- ---------------- --------- ------ ---------- ---------------------- -------------------- ---------
4189
node "SnapVault Verification Job"
High "09/27 02:33:01" "09/27 02:33:01" "09/27 02:33:01" false Queued Succeeded "SnapVault Verification"
SnapVault Unclaimed
7904
node "Peer Manager for cluster 5e15a1cd-220c-11e4-a455-123478563412"
Medium "09/27 02:58:53" "09/27 02:58:52" "09/27 02:58:53" false Queued "Periodic task(s) succeeded."
X-Cluster system -
14 entries were displayed.

As you can see, even calling out additional details it still doesn't give us the data we need regarding what's hogging the resources. So my question to you all, how better can we retrieve the data we need to identify the problem job?

- Bruce D

RPHELANIN · ‎2015-09-28

Bruce,

Its not uncommon to see a processor running at 99% and no evident performance issues. Are all CPUs running at 99% or just a single one? Also, I am pretty sure at 2am on a Sunday disk scrub kicks which is expected.