Solved: Re: ONTAP 9.1.P1 HIGH CPU and Busy aggr0

parkea2 · ‎2017-12-14

I am seeing CPU activity 75% - 90% on average, using Harvester I can see that the node aggr0 is unusually busy. Autosupport reports this

bug (see url below) is present, which I suspect maybe the cause of the high CPU and aggr0 / root volume high activity.

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496

However I am not sure how to fix it ? Its my understanding access to the systemshell is limited to support ?

so how can I get access to the file /var/etc/periodic.conf.local to carry out the work around?

https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-vsmg%2FGUID-A60F4F83-0034-4DB8-838B-06E9D4BEF9A4.html&resultof=%22%73%79%73%7...

Problem node:

hncl1::*> node run -node hncl1-01 -command sysstat -d
CPU     NFS    CIFS    HTTP     Net   kB/s     HDD   kB/s     SSD   kB/s    Tape   kB/s Cache
                                  in    out    read write    read write    read write    age
89%   15055     212       0   57821 20892 666163     21    2496      0       0      0    17s
92%   15433     232       0 128528 17175 715690   1153    2505      0       0      0    17s
86%   15850     520       0   49546 30543 546324     16    2647      0       0      0    28s
82%   16405     254       0   21778 38242 462387      5    1889      0       0      0    14s
89%   15776     232       0   82837 14417 700850     16    2069      0       0      0    14s
89%   15344     234       0   45265 12588 661101 153785    6578 10992       0      0    14s
87%   13749     533       0   20167 13852 650736 132215    1648   5481       0      0    14s

Good Node as example:

hncl1::*> node run -node hncl1-02 -command sysstat -d
CPU NFS CIFS HTTP Net kB/s HDD kB/s SSD kB/s Tape kB/s Cache
in out read write read write read write age
35% 2540 3 0 13430 14434 130278 16 3182 0 0 0 22s
37% 3007 1 0 15732 18484 43973 0 4699 0 0 0 24s
32% 3123 9 0 13962 14882 16851 21 4424 0 0 0 24s
31% 3213 2 0 13745 14351 12757 0 4600 0 0 0 24s
31% 2933 6 0 13607 14751 15683 16 4115 0 0 0 19s
32% 2839 1 0 27327 15809 27155 5 3539 0 0 0 19s
39% 3673 3 0 19517 17364 39400 65240 10589 14541 0 0 23s
35% 3361 5 0 16045 19323 17016 6499 4811 909 0 0 23s
37% 2846 3 0 20705 19875 19470 21 5300 0 0 0 23s

Having the HIGH CPU also means I cannot perform a NDU to get to a higher level where the bug is fixed.

Once I can get my CPU under control I get performance the NDU.

I hope someone can please provide some guidance.

parkea2 · ‎2017-12-20

The problem was indeed caused by :

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496

top output:

1942 root             1   8    -     0K    64K CPU4    4 313.4H 33.40% NwkThd_01
1945 root             1   8    -     0K    64K CPU7    7 313.5H 32.08% NwkThd_04
1944 root             1   8    -     0K    64K WAIT    1 312.9H 31.88% NwkThd_03
1941 root             1   8    -     0K    64K WAIT    7 313.3H 31.64% NwkThd_00
1943 root             1   8    -     0K    64K WAIT    4 313.3H 31.59% NwkThd_02
62704 root             1 -1    0 12552K 3516K spin_r 1 117:32 13.62% du
88252 root             1 -1    0 45320K 37540K spin_r 4 413:44 10.06% du
51860 root             1 -1    0 51464K 30676K spin_r 5 517:23 8.01% du
26842 root             1 -1    0 18696K 8372K spin_r 3 293:33 8.01% du
71060 root             1 -1    0   448M   226M spin_r 3 34.3H 7.76% du
2065 root             1   8    -     0K    32K WAIT    6 99.6H 6.59% gbuf_free_process
2512 root             1   8    -     0K    32K WAIT    6 67.3H 5.57% CsmMpAgentThread
1387 root            32 40    0   305M 29236K uwait   3 207:53 5.47% notifyd
3314 root             1 12    -     0K    64K WAIT    3 65.8H 5.08% wafl_exempt03
3317 root             1 12    -     0K    64K WAIT    3 65.8H 5.03% wafl_exempt06
3313 root             1 12    -     0K    64K WAIT    7 65.8H 5.03% wafl_exempt02

Last night I did a failover and failback under an emergency change. This as addressed

the CPU issue temporary for now. We plan to upgrade to 9.1 P10 in the next available change

control window.

Thanks to everyone that replied.

View solution in original post

aborzenkov · ‎2017-12-14

Contact support to verify, that high CPU is indeed caused by this bug and they should also guide you how to work around it.

kahuna · ‎2017-12-16

the busy root volume is unlikely to be the main cause of the high CPU

see private message on what to collect

AlexDawson · ‎2017-12-17

While workarounds are possible, ONTAP 9.1P10 is our most recent release of the 9.1 codebase, and would be a recommended upgrade from 9.1P1, and would resolve this issue. The upgrade advisor functionality of ActiveIQ can walk through the upgrade process step by step, which should be non-disruptive (CIFS users may need to reconnect)

parkea2 · ‎2017-12-20

The problem was indeed caused by :

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496

top output:

1942 root             1   8    -     0K    64K CPU4    4 313.4H 33.40% NwkThd_01
1945 root             1   8    -     0K    64K CPU7    7 313.5H 32.08% NwkThd_04
1944 root             1   8    -     0K    64K WAIT    1 312.9H 31.88% NwkThd_03
1941 root             1   8    -     0K    64K WAIT    7 313.3H 31.64% NwkThd_00
1943 root             1   8    -     0K    64K WAIT    4 313.3H 31.59% NwkThd_02
62704 root             1 -1    0 12552K 3516K spin_r 1 117:32 13.62% du
88252 root             1 -1    0 45320K 37540K spin_r 4 413:44 10.06% du
51860 root             1 -1    0 51464K 30676K spin_r 5 517:23 8.01% du
26842 root             1 -1    0 18696K 8372K spin_r 3 293:33 8.01% du
71060 root             1 -1    0   448M   226M spin_r 3 34.3H 7.76% du
2065 root             1   8    -     0K    32K WAIT    6 99.6H 6.59% gbuf_free_process
2512 root             1   8    -     0K    32K WAIT    6 67.3H 5.57% CsmMpAgentThread
1387 root            32 40    0   305M 29236K uwait   3 207:53 5.47% notifyd
3314 root             1 12    -     0K    64K WAIT    3 65.8H 5.08% wafl_exempt03
3317 root             1 12    -     0K    64K WAIT    3 65.8H 5.03% wafl_exempt06
3313 root             1 12    -     0K    64K WAIT    7 65.8H 5.03% wafl_exempt02

Last night I did a failover and failback under an emergency change. This as addressed

the CPU issue temporary for now. We plan to upgrade to 9.1 P10 in the next available change

control window.

Thanks to everyone that replied.