CPU getting killed on v3240 w/ONTAP 8.01 7-Mode

ogdenclinic · ‎2011-04-19

We recently installed a new v3240. I have created all of 4 2.5 TB LUNs connected to two physical Win 2008 R2 hosts. There is hardly any I/O going on as you can see from the output of sysstat-x below. Dedupe is not configured for any of the volumes, snapmirror is not running, and compression is turned off. We only use FCP--no NFS nor iSCSI. WTF is killing my CPU here? The chassis fans are blowing at full bore. Thanks.

sysstat -x -c 10 1 output:

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s
                                       in    out    read  write    read  write    age    hit  time  ty  util                            in    out      in    out
100%      0      0      0      75       1      1      32     24       0      0     1    100%    0%  -     3%      24     51      0     201    344       0      0
100%      0      0      0      38       1      1       0      8       0      0     1    100%    0%  -     1%       0     38      0     259     50       0      0
100%      0      0      0      41       1      1       0      0       0      0     1      -     0%  -     0%       0     41      0     128      1       0      0
100%      0      0      0      28       1      1      24     24       0      0     1    100%    0%  -     4%       0     28      0    1793     33       0      0
100%      0      0      0      31       1      1      48    296       0      0     1    100%   12%  Tf   12%       0     31      0     163     83       0      0
100%      0      0      0     115       1      1     276    252       0      0     1    100%   15%  :    10%       0    115      0     624    452       0      0
100%      0      0      0      46       1      0      24     28       0      0     1    100%    0%  -    13%       5     41      0     237     66       0      0
100%      0      0      0       6       1      0       0      4       0      0     1    100%    0%  -     2%       0      6      0      17      0       0      0
100%      0      0      0      93       1      1       0      0       0      0     1      -     0%  -     0%       0     93      0    1460     27       0      0
100%      0      0      0      13       1      0      24     24       0      0     1    100%    0%  -     4%       0     13      0      36     17       0      0

sysstat -m -c 10 1 output:

ANY  AVG  CPU0 CPU1 CPU2 CPU3
 10%  60%   85%  74%  76%   5%
  9%  60%   85%  74%  76%   5%
 14%  61%   85%  75%  77%   9%
 12%  61%   85%  74%  77%   7%
 10%  60%   85%  74%  76%   6%
 15%  61%   85%  74%  76%   9%
 11%  60%   85%  74%  77%   6%
 12%  60%   85%  74%  77%   6%
 15%  61%   85%  74%  77%   8%
 14%  61%   85%  74%  77%   7%

vmsjaak13 · ‎2011-04-19

Try:

priv set advanced

then run:

ps

This might point you to the culprit.

If this continues, I'd create a call with NetApp support.

Regards,

Niek

ogdenclinic · ‎2011-04-20

vmsjaak13 wrote:

Try:

priv set advanced

then run:

ps

This might point you to the culprit.

If this continues, I'd create a call with NetApp support.

Regards,

Niek

I ran ps and am now more confused than ever. If I'm reading the output correctly--which is in the attached text file--it shows that idle threads are using all the CPU time. I know idle threads should be using the CPU when the CPUs are idle, but they most certainly are not idle right now. I'm going to open a support ticket regardless.

sysstat -c 5 -x 1

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s
                                       in    out    read  write    read  write    age    hit  time  ty  util                            in    out      in    out
 99%      0      0      0      60       0      1       8     32       0      0    26    100%    0%  -     5%       0     60      0     250    181       0      0
100%      0      0      0      79       1      0       0      0       0      0    26    100%    0%  -     0%       0     79      0     294    382       0      0
100%      0      0      0      31       0      1      16      0       0      0    26    100%    0%  -     0%       0     31      0      34    156       0      0
100%      0      0      0      64       9     11       8     24       0      0    26    100%    0%  -     3%       3     61      0     115    255       0      0
100%      0      0      0     108       0      1       8      0       0      0    26    100%    0%  -     0%       0    108      0     181    648       0      0

thomas_glodde · ‎2011-04-19

https://now.netapp.com/NOW/download/software/ontap/8.0.1P3/

There are several CPU related bugs fixed, you might want to give P3 a shot.

ogdenclinic · ‎2011-04-20

We're indeed running P3. I should have been more clear in my original post.

shaunjurr · ‎2011-04-24

Hi,

If you have a chance to generate an NMI-panic via the "RLM" (can't remember what the new name on the 32xx series is at the moment) then you can get a coredump to send along with your case and then this should get cleared up in a much more concrete way by engineering.

It will "crash" the filer "on purpose" to get a stateful coredump of what is going on. If you have a cluster, it will failover and you could, in theory, do this in a lightly loaded production environment if the host timeouts are set correctly. YMMV (your mileage may vary 😉 )

Good Luck.

dean · ‎2011-11-22

We're seeing this same behavior on 8.0.2. Have you found any more info on this?

c_beseler · ‎2012-03-21

I have the same issue on Ontap 8.1RC3. I will try to make a cluster failover tonight. I hope the restart will fix the problem.

c_beseler · ‎2012-03-22

it doesn't help. same behavior after restart. one cpu is consequently at 100%. no sis, no compression, nothing...

dimitrik · ‎2012-04-05

Just looking at CPU is not a reliable indicator of performance. We will use CPUs opportunistically for all kinds of things.

You have to tie CPU to I/O. Your examples show very little I/O. Almost none.

This reminds me of another thread where someone thought their 6280 was slow because of the CPU.

They were able to add 3x the workload before noticing an increase in latency - all the time, CPU was high...

Add some real load to the system and see how it works please.

c_beseler · ‎2012-04-10

I understand your thought. But let me provide an example: Each controllers of a FAS2040 has a dual-core CPU. On our second controller, one core is constantly at 100% load. The other core seems to handle all request of the netapp. By the way: at this time the Fabric Manager sends continous messages with "CPU too Busy". Anyway, the FAS2040 respond all request in a "normal" speed. So, your thought could be correct. But, if I activate now a DeDup Job on only one Volume on this controller. The leftover core jumps on 100% load and the netapp stop working and give no answer in an acceptable time. Normally, your are able to run a dedup job on a controller and one core is working on this job and the other is doing the rest. And this is not acceptable. Our storage system has no heavy workload and at night it has nearly nothing to do. But if i still have constantly 100% load on one core, it is not possible for me to run a dedup job. And that's an essential feature for us.

By the way: I could see under the "ps" command a process called "wafl_lopri" which run with 100%.

Best Regards,

Claudius

dimitrik · ‎2012-04-10

Thanks for adding some info, Claudius.

In that case please open a ticket with support - they need to figure out what's happening. What OS rev?

Regarding dedupe: The idea is that you run it when the system is not busy doing other things. So, I'd rather you run it before you start snapmirror sessions, or do any serious I/O, and, for efficiency reasons, before doing snaps.

However, even dedupe should back off if you want to make a request from the system.

Maybe you can try wafltop...

priv set -q diag

options stats.wafltop.config volume,process,message

wafltop start

(wait 5 min)

wafltop stop

options stats.wafltop.config off

You'll get an idea of what's happening in the system, and at the very least, if you can't understand what it's saying, the support personnel will get a head start.

Also be prepared to give support a perfstat pls.

D

c_beseler · ‎2012-04-11

Ok.. Support Case is open. I'll report any new progresses.

Regards,

Claudius

c_beseler · ‎2012-04-12

Netapp Technical Support Engineers analyzing the data now. They said "It seems a newly discovered bug might apply here."

exciting 😉

dimitrik · ‎2012-04-12

Doesn't it feel great to be the first? Send me the support case # pls. [email protected]

James_Littlefield · ‎2012-05-10

Let's not forgot...

The early bird gets the worm, but the second mouse gets the cheese

dgwilhelm · ‎2012-04-12

Are you running OnCommand? There are some reports in there that may help point to the culprit.

SATISH_PAMARTHI1983 · ‎2014-11-25

Hi DGWILHELM,

may i know on how to check using the oncommand reports for performance case.

Thanks in advance,

Satish.

tom_bailey · ‎2012-05-11

Did you get a resolution to this issue? Interested to know if it was a bug or just a quirk of DOT.

c_beseler · ‎2012-05-11

After 3 weeks we've got a solution for the problem. It is definitely a BUG, but Netapp Engineering told us that is rare case. The BUG is also in ONTAP8.1 GA release and appear mostly with the use of snapvault. Netapp has only 3 customers world wide with the problem. If you got the same trouble, open a case at Netapp. Regards

smorganarrisi · ‎2012-05-11

Hey, we have the same thing on our new 2240-2. Initially I was told by support that the 100% CPU was normal (strange I know). I actually referred the tech to this article and said we are hitting BURT 590193. Curious if that is the same as you guys.

Our filer is nothing but a snapmirror target at this time, destination for only about 15 relationships.