ONTAP Discussions

ONTAP 9.1.P1 HIGH CPU and Busy aggr0

parkea2
6,887 Views

I am seeing CPU activity 75% - 90% on average,  using Harvester I can see that the node aggr0 is unusually busy.  Autosupport reports this

bug (see url below) is present, which I suspect maybe the cause of the high CPU and aggr0 / root volume high activity.

 

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496

 

However I am not sure how to fix it ? Its my understanding access to the systemshell is limited to support ? 

so how can I get access to the file /var/etc/periodic.conf.local to carry out the work around?

 

https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-vsmg%2FGUID-A60F4F83-0034-4DB8-838B-06E9D4BEF9A4.html&resultof=%22%73%79%73%7...

 

 

Problem node:

hncl1::*> node run -node hncl1-01 -command sysstat -d   
 CPU     NFS    CIFS    HTTP     Net   kB/s     HDD   kB/s     SSD   kB/s    Tape   kB/s  Cache
                                  in    out    read  write    read  write    read  write    age
 89%   15055     212       0   57821  20892  666163     21    2496      0       0      0    17s
 92%   15433     232       0  128528  17175  715690   1153    2505      0       0      0    17s
 86%   15850     520       0   49546  30543  546324     16    2647      0       0      0    28s
 82%   16405     254       0   21778  38242  462387      5    1889      0       0      0    14s
 89%   15776     232       0   82837  14417  700850     16    2069      0       0      0    14s
 89%   15344     234       0   45265  12588  661101 153785    6578  10992       0      0    14s
 87%   13749     533       0   20167  13852  650736 132215    1648   5481       0      0    14s

 

Good Node as example:

 

hncl1::*> node run -node hncl1-02 -command sysstat -d
CPU NFS CIFS HTTP Net kB/s HDD kB/s SSD kB/s Tape kB/s Cache
in out read write read write read write age
35% 2540 3 0 13430 14434 130278 16 3182 0 0 0 22s
37% 3007 1 0 15732 18484 43973 0 4699 0 0 0 24s
32% 3123 9 0 13962 14882 16851 21 4424 0 0 0 24s
31% 3213 2 0 13745 14351 12757 0 4600 0 0 0 24s
31% 2933 6 0 13607 14751 15683 16 4115 0 0 0 19s
32% 2839 1 0 27327 15809 27155 5 3539 0 0 0 19s
39% 3673 3 0 19517 17364 39400 65240 10589 14541 0 0 23s
35% 3361 5 0 16045 19323 17016 6499 4811 909 0 0 23s
37% 2846 3 0 20705 19875 19470 21 5300 0 0 0 23s

 

Having the HIGH CPU also means I cannot perform a NDU to get to a higher level where the bug is fixed.

Once I can get my CPU under control I get performance  the NDU.

 

I hope someone can please provide some guidance. 

 

 

 

1 ACCEPTED SOLUTION

parkea2
6,605 Views

The problem was indeed caused by :

 

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496

 

top output:

 

 1942 root             1   8    -     0K    64K CPU4    4 313.4H 33.40% NwkThd_01
 1945 root             1   8    -     0K    64K CPU7    7 313.5H 32.08% NwkThd_04
 1944 root             1   8    -     0K    64K WAIT    1 312.9H 31.88% NwkThd_03
 1941 root             1   8    -     0K    64K WAIT    7 313.3H 31.64% NwkThd_00
 1943 root             1   8    -     0K    64K WAIT    4 313.3H 31.59% NwkThd_02
62704 root             1  -1    0 12552K  3516K spin_r  1 117:32 13.62% du
88252 root             1  -1    0 45320K 37540K spin_r  4 413:44 10.06% du
51860 root             1  -1    0 51464K 30676K spin_r  5 517:23  8.01% du
26842 root             1  -1    0 18696K  8372K spin_r  3 293:33  8.01% du
71060 root             1  -1    0   448M   226M spin_r  3  34.3H  7.76% du
 2065 root             1   8    -     0K    32K WAIT    6  99.6H  6.59% gbuf_free_process
 2512 root             1   8    -     0K    32K WAIT    6  67.3H  5.57% CsmMpAgentThread
 1387 root            32  40    0   305M 29236K uwait   3 207:53  5.47% notifyd
 3314 root             1  12    -     0K    64K WAIT    3  65.8H  5.08% wafl_exempt03
 3317 root             1  12    -     0K    64K WAIT    3  65.8H  5.03% wafl_exempt06
 3313 root             1  12    -     0K    64K WAIT    7  65.8H  5.03% wafl_exempt02

 

 

Last night I did a failover and failback under an emergency change.  This as addressed

the CPU issue temporary for now.  We plan to upgrade to 9.1 P10 in the next available change

control window.

 

Thanks to everyone that replied.

 

View solution in original post

4 REPLIES 4

aborzenkov
6,859 Views
Contact support to verify, that high CPU is indeed caused by this bug and they should also guide you how to work around it.

kahuna
6,798 Views

the busy root volume is unlikely to be the main cause of the high CPU

 

see private message on what to collect

 

 

AlexDawson
6,659 Views

While workarounds are possible, ONTAP 9.1P10 is our most recent release of the 9.1 codebase, and would be a recommended upgrade from 9.1P1, and would resolve this issue. The upgrade advisor functionality of ActiveIQ can walk through the upgrade process step by step, which should be non-disruptive (CIFS users may need to reconnect)

parkea2
6,606 Views

The problem was indeed caused by :

 

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1061496

 

top output:

 

 1942 root             1   8    -     0K    64K CPU4    4 313.4H 33.40% NwkThd_01
 1945 root             1   8    -     0K    64K CPU7    7 313.5H 32.08% NwkThd_04
 1944 root             1   8    -     0K    64K WAIT    1 312.9H 31.88% NwkThd_03
 1941 root             1   8    -     0K    64K WAIT    7 313.3H 31.64% NwkThd_00
 1943 root             1   8    -     0K    64K WAIT    4 313.3H 31.59% NwkThd_02
62704 root             1  -1    0 12552K  3516K spin_r  1 117:32 13.62% du
88252 root             1  -1    0 45320K 37540K spin_r  4 413:44 10.06% du
51860 root             1  -1    0 51464K 30676K spin_r  5 517:23  8.01% du
26842 root             1  -1    0 18696K  8372K spin_r  3 293:33  8.01% du
71060 root             1  -1    0   448M   226M spin_r  3  34.3H  7.76% du
 2065 root             1   8    -     0K    32K WAIT    6  99.6H  6.59% gbuf_free_process
 2512 root             1   8    -     0K    32K WAIT    6  67.3H  5.57% CsmMpAgentThread
 1387 root            32  40    0   305M 29236K uwait   3 207:53  5.47% notifyd
 3314 root             1  12    -     0K    64K WAIT    3  65.8H  5.08% wafl_exempt03
 3317 root             1  12    -     0K    64K WAIT    3  65.8H  5.03% wafl_exempt06
 3313 root             1  12    -     0K    64K WAIT    7  65.8H  5.03% wafl_exempt02

 

 

Last night I did a failover and failback under an emergency change.  This as addressed

the CPU issue temporary for now.  We plan to upgrade to 9.1 P10 in the next available change

control window.

 

Thanks to everyone that replied.

 

Public