ONTAP Discussions
ONTAP Discussions
I am seeing CPU activity 75% - 90% on average, using Harvester I can see that the node aggr0 is unusually busy. Autosupport reports this
bug (see url below) is present, which I suspect maybe the cause of the high CPU and aggr0 / root volume high activity.
https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496
However I am not sure how to fix it ? Its my understanding access to the systemshell is limited to support ?
so how can I get access to the file /var/etc/periodic.conf.local to carry out the work around?
Problem node:
hncl1::*> node run -node hncl1-01 -command sysstat -d
CPU NFS CIFS HTTP Net kB/s HDD kB/s SSD kB/s Tape kB/s Cache
in out read write read write read write age
89% 15055 212 0 57821 20892 666163 21 2496 0 0 0 17s
92% 15433 232 0 128528 17175 715690 1153 2505 0 0 0 17s
86% 15850 520 0 49546 30543 546324 16 2647 0 0 0 28s
82% 16405 254 0 21778 38242 462387 5 1889 0 0 0 14s
89% 15776 232 0 82837 14417 700850 16 2069 0 0 0 14s
89% 15344 234 0 45265 12588 661101 153785 6578 10992 0 0 14s
87% 13749 533 0 20167 13852 650736 132215 1648 5481 0 0 14s
Good Node as example:
hncl1::*> node run -node hncl1-02 -command sysstat -d
CPU NFS CIFS HTTP Net kB/s HDD kB/s SSD kB/s Tape kB/s Cache
in out read write read write read write age
35% 2540 3 0 13430 14434 130278 16 3182 0 0 0 22s
37% 3007 1 0 15732 18484 43973 0 4699 0 0 0 24s
32% 3123 9 0 13962 14882 16851 21 4424 0 0 0 24s
31% 3213 2 0 13745 14351 12757 0 4600 0 0 0 24s
31% 2933 6 0 13607 14751 15683 16 4115 0 0 0 19s
32% 2839 1 0 27327 15809 27155 5 3539 0 0 0 19s
39% 3673 3 0 19517 17364 39400 65240 10589 14541 0 0 23s
35% 3361 5 0 16045 19323 17016 6499 4811 909 0 0 23s
37% 2846 3 0 20705 19875 19470 21 5300 0 0 0 23s
Having the HIGH CPU also means I cannot perform a NDU to get to a higher level where the bug is fixed.
Once I can get my CPU under control I get performance the NDU.
I hope someone can please provide some guidance.
Solved! See The Solution
The problem was indeed caused by :
https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496
top output:
1942 root 1 8 - 0K 64K CPU4 4 313.4H 33.40% NwkThd_01
1945 root 1 8 - 0K 64K CPU7 7 313.5H 32.08% NwkThd_04
1944 root 1 8 - 0K 64K WAIT 1 312.9H 31.88% NwkThd_03
1941 root 1 8 - 0K 64K WAIT 7 313.3H 31.64% NwkThd_00
1943 root 1 8 - 0K 64K WAIT 4 313.3H 31.59% NwkThd_02
62704 root 1 -1 0 12552K 3516K spin_r 1 117:32 13.62% du
88252 root 1 -1 0 45320K 37540K spin_r 4 413:44 10.06% du
51860 root 1 -1 0 51464K 30676K spin_r 5 517:23 8.01% du
26842 root 1 -1 0 18696K 8372K spin_r 3 293:33 8.01% du
71060 root 1 -1 0 448M 226M spin_r 3 34.3H 7.76% du
2065 root 1 8 - 0K 32K WAIT 6 99.6H 6.59% gbuf_free_process
2512 root 1 8 - 0K 32K WAIT 6 67.3H 5.57% CsmMpAgentThread
1387 root 32 40 0 305M 29236K uwait 3 207:53 5.47% notifyd
3314 root 1 12 - 0K 64K WAIT 3 65.8H 5.08% wafl_exempt03
3317 root 1 12 - 0K 64K WAIT 3 65.8H 5.03% wafl_exempt06
3313 root 1 12 - 0K 64K WAIT 7 65.8H 5.03% wafl_exempt02
Last night I did a failover and failback under an emergency change. This as addressed
the CPU issue temporary for now. We plan to upgrade to 9.1 P10 in the next available change
control window.
Thanks to everyone that replied.
the busy root volume is unlikely to be the main cause of the high CPU
see private message on what to collect
While workarounds are possible, ONTAP 9.1P10 is our most recent release of the 9.1 codebase, and would be a recommended upgrade from 9.1P1, and would resolve this issue. The upgrade advisor functionality of ActiveIQ can walk through the upgrade process step by step, which should be non-disruptive (CIFS users may need to reconnect)
The problem was indeed caused by :
https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496
top output:
1942 root 1 8 - 0K 64K CPU4 4 313.4H 33.40% NwkThd_01
1945 root 1 8 - 0K 64K CPU7 7 313.5H 32.08% NwkThd_04
1944 root 1 8 - 0K 64K WAIT 1 312.9H 31.88% NwkThd_03
1941 root 1 8 - 0K 64K WAIT 7 313.3H 31.64% NwkThd_00
1943 root 1 8 - 0K 64K WAIT 4 313.3H 31.59% NwkThd_02
62704 root 1 -1 0 12552K 3516K spin_r 1 117:32 13.62% du
88252 root 1 -1 0 45320K 37540K spin_r 4 413:44 10.06% du
51860 root 1 -1 0 51464K 30676K spin_r 5 517:23 8.01% du
26842 root 1 -1 0 18696K 8372K spin_r 3 293:33 8.01% du
71060 root 1 -1 0 448M 226M spin_r 3 34.3H 7.76% du
2065 root 1 8 - 0K 32K WAIT 6 99.6H 6.59% gbuf_free_process
2512 root 1 8 - 0K 32K WAIT 6 67.3H 5.57% CsmMpAgentThread
1387 root 32 40 0 305M 29236K uwait 3 207:53 5.47% notifyd
3314 root 1 12 - 0K 64K WAIT 3 65.8H 5.08% wafl_exempt03
3317 root 1 12 - 0K 64K WAIT 3 65.8H 5.03% wafl_exempt06
3313 root 1 12 - 0K 64K WAIT 7 65.8H 5.03% wafl_exempt02
Last night I did a failover and failback under an emergency change. This as addressed
the CPU issue temporary for now. We plan to upgrade to 9.1 P10 in the next available change
control window.
Thanks to everyone that replied.