ONTAP Discussions

CPU getting killed on v3240 w/ONTAP 8.01 7-Mode

ogdenclinic

We recently installed a new v3240.  I have created all of 4 2.5 TB LUNs connected to two physical Win 2008 R2 hosts.  There is hardly any I/O going on as you can see from the output of sysstat-x below.  Dedupe is not configured for any of the volumes, snapmirror is not running, and compression is turned off.  We only use FCP--no NFS nor iSCSI.  WTF is killing my CPU here?  The chassis fans are blowing at full bore.  Thanks.

sysstat -x -c 10 1 output:

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s
                                       in    out    read  write    read  write    age    hit  time  ty  util                            in    out      in    out
100%      0      0      0      75       1      1      32     24       0      0     1    100%    0%  -     3%      24     51      0     201    344       0      0
100%      0      0      0      38       1      1       0      8       0      0     1    100%    0%  -     1%       0     38      0     259     50       0      0
100%      0      0      0      41       1      1       0      0       0      0     1      -     0%  -     0%       0     41      0     128      1       0      0
100%      0      0      0      28       1      1      24     24       0      0     1    100%    0%  -     4%       0     28      0    1793     33       0      0
100%      0      0      0      31       1      1      48    296       0      0     1    100%   12%  Tf   12%       0     31      0     163     83       0      0
100%      0      0      0     115       1      1     276    252       0      0     1    100%   15%  :    10%       0    115      0     624    452       0      0
100%      0      0      0      46       1      0      24     28       0      0     1    100%    0%  -    13%       5     41      0     237     66       0      0
100%      0      0      0       6       1      0       0      4       0      0     1    100%    0%  -     2%       0      6      0      17      0       0      0
100%      0      0      0      93       1      1       0      0       0      0     1      -     0%  -     0%       0     93      0    1460     27       0      0
100%      0      0      0      13       1      0      24     24       0      0     1    100%    0%  -     4%       0     13      0      36     17       0      0

sysstat -m -c 10 1 output:

ANY  AVG  CPU0 CPU1 CPU2 CPU3
10%  60%   85%  74%  76%   5%
  9%  60%   85%  74%  76%   5%
14%  61%   85%  75%  77%   9%
12%  61%   85%  74%  77%   7%
10%  60%   85%  74%  76%   6%
15%  61%   85%  74%  76%   9%
11%  60%   85%  74%  77%   6%
12%  60%   85%  74%  77%   6%
15%  61%   85%  74%  77%   8%
14%  61%   85%  74%  77%   7%
25 REPLIES 25

alexander_fassbender

Hello everybody,

we have the same problem with four or five NetApp systems from 2040 till 3270.

The systems are all SnapVault / SnapMirror targets and the cpu load of the system in "idle" (system does nothing) state ist at 100 %.

Also the console is slow when want to have a snapvault status etc.

But only with ONTAP 8.1. We didn't use or test 8.0.1 releases, only 8.0.2 and with this release we had no problems.

But what I could see is, that when you have "normal" load on the system, eg snapmirror update or snapvault transfers, then the load gone down of the system to a normal load. Then also the console was faster. When the transfers were finished, the cpu load was also at 100 %.

Does anyone know news about this?

Hi Alexander,

Right now BUG 568758 does not have a public report. Please open a case with NetApp Support so that we may investigate your issue in detail.

Regards,

Christine Carcallas

I could really recommend to open a support case and refer to our Case #: 2003000599 or the official BUG 568758. Netapp believes that this problem has only a few customers worldwide. But if I read your story...

colin_graham

We had a similar thing last year with our 6210. (801p2)  there was minimal load on the filer, yet the CPU was showing being pinned at 99% due to the silly way the filers show their CPU stats it was only 99% on one core out of 8.. performance was unaffected.

It turned out to be related to something in the kernel "beneath" ONTAP that was taking up cpu time.

Dropping into the nodeshell, a ps -auxw output showed a process "/usr/bin/env_mgr -l"  was using 99% cpu time.

We did get a bugID from it (515581), but it was one of those that dissapeared when the filer was rebooted.

tom_bailey

Did you get a resolution to this issue? Interested to know if it was a bug or just a quirk of DOT.

c_beseler

After 3 weeks we've got a solution for the problem. It is definitely a BUG, but Netapp Engineering told us that is rare case. The BUG is also in ONTAP8.1 GA release and appear mostly with the use of snapvault. Netapp has only 3 customers world wide with the problem. If you got the same trouble, open a case at Netapp. Regards

smorganarrisi

Hey, we have the same thing on our new 2240-2.  Initially I was told by support that the 100% CPU was normal (strange I know).  I actually referred the tech to this article and said we are hitting BURT 590193.  Curious if that is the same as you guys.

Our filer is nothing but a snapmirror target at this time, destination for only about 15 relationships.

c_beseler

I recommend you refer Netapp Support to our Case #: 2003000599. We hit the BUG 568758(but no ossv or snapvault used in environment). And if your case is handled at the moment by first level support, you should insist to come to the second level support. We hung three weeks at the first level support and nothing happened. That was really frustrating.

dgwilhelm

Are you running OnCommand? There are some reports in there that may help point to the culprit.

Hi DGWILHELM,

 

may i know on how to check using the oncommand reports for performance case.

 

Thanks in advance,

Satish.

 

dimitrik

Just looking at CPU is not a reliable indicator of performance. We will use CPUs opportunistically for all kinds of things.

You have to tie CPU to I/O. Your examples show very little I/O. Almost none.

This reminds me of another thread where someone thought their 6280 was slow because of the CPU.

They were able to add 3x the workload before noticing an increase in latency - all the time, CPU was high...

Add some real load to the system and see how it works please.

c_beseler

I understand your thought. But let me provide an example: Each controllers of a FAS2040 has a dual-core CPU. On our second controller, one core is constantly at 100% load. The other core seems to handle all request of the netapp. By the way: at this time the Fabric Manager sends continous messages with "CPU too Busy". Anyway, the FAS2040 respond all request in a "normal" speed. So, your thought could be correct. But, if I activate now a DeDup Job on only one Volume on this controller. The leftover core jumps on 100% load and the netapp stop working and give no answer in an acceptable time. Normally, your are able to run a dedup job on a controller and one core is working on this job and the other is doing the rest. And this is not acceptable. Our storage system has no heavy workload and at night it has nearly nothing to do. But if i still have constantly 100% load on one core, it is not possible for me to run a dedup job. And that's an essential feature for us.

By the way: I could see under the "ps" command a process called "wafl_lopri" which run with 100%.

Best Regards,

Claudius

dimitrik

Thanks for adding some info, Claudius.

In that case please open a ticket with support - they need to figure out what's happening. What  OS rev?

Regarding dedupe: The idea is that you run it when the system is not busy doing other things. So, I'd rather you run it before you start snapmirror sessions, or do any serious I/O, and, for efficiency reasons, before doing snaps.

However, even dedupe should back off if you want to make a request from the system.

Maybe you can try wafltop...

priv set -q diag

options stats.wafltop.config volume,process,message

wafltop start

(wait 5 min)

wafltop stop

options stats.wafltop.config off

You'll get an idea of what's happening in the system, and at the very least, if you can't understand what it's saying, the support personnel will get a head start.

Also be prepared to give support a perfstat pls.

D

c_beseler

Ok.. Support Case is open. I'll report any new progresses.

Regards,

Claudius

c_beseler

Netapp Technical Support Engineers analyzing the data now. They said "It seems a newly discovered bug might apply here."


exciting 😉

dimitrik

Doesn't it feel great to be the first? Send me the support case # pls. Dimitri@netapp.com

Let's not forgot...

The early bird gets the worm, but the second mouse gets the cheese

c_beseler

I have the same issue on Ontap 8.1RC3. I will try to make a cluster failover tonight. I hope the restart will fix the problem.

c_beseler

it doesn't help. same behavior after restart. one cpu is consequently at 100%. no sis, no compression, nothing...

dean

We're seeing this same behavior on 8.0.2. Have you found any more info on this?

Announcements
NetApp on Discord Image

We're on Discord, are you?

Live Chat, Watch Parties, and More!

Explore Banner

Meet Explore, NetApp’s digital sales platform

Engage digitally throughout the sales process, from product discovery to configuration, and handle all your post-purchase needs.

Public