ONTAP Discussions
ONTAP Discussions
We recently installed a new v3240. I have created all of 4 2.5 TB LUNs connected to two physical Win 2008 R2 hosts. There is hardly any I/O going on as you can see from the output of sysstat-x below. Dedupe is not configured for any of the volumes, snapmirror is not running, and compression is turned off. We only use FCP--no NFS nor iSCSI. WTF is killing my CPU here? The chassis fans are blowing at full bore. Thanks.
sysstat -x -c 10 1 output:
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write age hit time ty util in out in out
100% 0 0 0 75 1 1 32 24 0 0 1 100% 0% - 3% 24 51 0 201 344 0 0
100% 0 0 0 38 1 1 0 8 0 0 1 100% 0% - 1% 0 38 0 259 50 0 0
100% 0 0 0 41 1 1 0 0 0 0 1 - 0% - 0% 0 41 0 128 1 0 0
100% 0 0 0 28 1 1 24 24 0 0 1 100% 0% - 4% 0 28 0 1793 33 0 0
100% 0 0 0 31 1 1 48 296 0 0 1 100% 12% Tf 12% 0 31 0 163 83 0 0
100% 0 0 0 115 1 1 276 252 0 0 1 100% 15% : 10% 0 115 0 624 452 0 0
100% 0 0 0 46 1 0 24 28 0 0 1 100% 0% - 13% 5 41 0 237 66 0 0
100% 0 0 0 6 1 0 0 4 0 0 1 100% 0% - 2% 0 6 0 17 0 0 0
100% 0 0 0 93 1 1 0 0 0 0 1 - 0% - 0% 0 93 0 1460 27 0 0
100% 0 0 0 13 1 0 24 24 0 0 1 100% 0% - 4% 0 13 0 36 17 0 0
sysstat -m -c 10 1 output:
ANY AVG CPU0 CPU1 CPU2 CPU3
10% 60% 85% 74% 76% 5%
9% 60% 85% 74% 76% 5%
14% 61% 85% 75% 77% 9%
12% 61% 85% 74% 77% 7%
10% 60% 85% 74% 76% 6%
15% 61% 85% 74% 76% 9%
11% 60% 85% 74% 77% 6%
12% 60% 85% 74% 77% 6%
15% 61% 85% 74% 77% 8%
14% 61% 85% 74% 77% 7%
Try:
priv set advanced
then run:
ps
This might point you to the culprit.
If this continues, I'd create a call with NetApp support.
Regards,
Niek
vmsjaak13 wrote:
Try:
priv set advanced
then run:
ps
This might point you to the culprit.
If this continues, I'd create a call with NetApp support.
Regards,
Niek
I ran ps and am now more confused than ever. If I'm reading the output correctly--which is in the attached text file--it shows that idle threads are using all the CPU time. I know idle threads should be using the CPU when the CPUs are idle, but they most certainly are not idle right now. I'm going to open a support ticket regardless.
sysstat -c 5 -x 1
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write age hit time ty util in out in out
99% 0 0 0 60 0 1 8 32 0 0 26 100% 0% - 5% 0 60 0 250 181 0 0
100% 0 0 0 79 1 0 0 0 0 0 26 100% 0% - 0% 0 79 0 294 382 0 0
100% 0 0 0 31 0 1 16 0 0 0 26 100% 0% - 0% 0 31 0 34 156 0 0
100% 0 0 0 64 9 11 8 24 0 0 26 100% 0% - 3% 3 61 0 115 255 0 0
100% 0 0 0 108 0 1 8 0 0 0 26 100% 0% - 0% 0 108 0 181 648 0 0
https://now.netapp.com/NOW/download/software/ontap/8.0.1P3/
There are several CPU related bugs fixed, you might want to give P3 a shot.
We're indeed running P3. I should have been more clear in my original post.
Hi,
If you have a chance to generate an NMI-panic via the "RLM" (can't remember what the new name on the 32xx series is at the moment) then you can get a coredump to send along with your case and then this should get cleared up in a much more concrete way by engineering.
It will "crash" the filer "on purpose" to get a stateful coredump of what is going on. If you have a cluster, it will failover and you could, in theory, do this in a lightly loaded production environment if the host timeouts are set correctly. YMMV (your mileage may vary 😉 )
Good Luck.
We're seeing this same behavior on 8.0.2. Have you found any more info on this?
I have the same issue on Ontap 8.1RC3. I will try to make a cluster failover tonight. I hope the restart will fix the problem.
it doesn't help. same behavior after restart. one cpu is consequently at 100%. no sis, no compression, nothing...
Just looking at CPU is not a reliable indicator of performance. We will use CPUs opportunistically for all kinds of things.
You have to tie CPU to I/O. Your examples show very little I/O. Almost none.
This reminds me of another thread where someone thought their 6280 was slow because of the CPU.
They were able to add 3x the workload before noticing an increase in latency - all the time, CPU was high...
Add some real load to the system and see how it works please.
I understand your thought. But let me provide an example: Each controllers of a FAS2040 has a dual-core CPU. On our second controller, one core is constantly at 100% load. The other core seems to handle all request of the netapp. By the way: at this time the Fabric Manager sends continous messages with "CPU too Busy". Anyway, the FAS2040 respond all request in a "normal" speed. So, your thought could be correct. But, if I activate now a DeDup Job on only one Volume on this controller. The leftover core jumps on 100% load and the netapp stop working and give no answer in an acceptable time. Normally, your are able to run a dedup job on a controller and one core is working on this job and the other is doing the rest. And this is not acceptable. Our storage system has no heavy workload and at night it has nearly nothing to do. But if i still have constantly 100% load on one core, it is not possible for me to run a dedup job. And that's an essential feature for us.
By the way: I could see under the "ps" command a process called "wafl_lopri" which run with 100%.
Best Regards,
Claudius
Thanks for adding some info, Claudius.
In that case please open a ticket with support - they need to figure out what's happening. What OS rev?
Regarding dedupe: The idea is that you run it when the system is not busy doing other things. So, I'd rather you run it before you start snapmirror sessions, or do any serious I/O, and, for efficiency reasons, before doing snaps.
However, even dedupe should back off if you want to make a request from the system.
Maybe you can try wafltop...
priv set -q diag
options stats.wafltop.config volume,process,message
wafltop start
(wait 5 min)
wafltop stop
options stats.wafltop.config off
You'll get an idea of what's happening in the system, and at the very least, if you can't understand what it's saying, the support personnel will get a head start.
Also be prepared to give support a perfstat pls.
D
Ok.. Support Case is open. I'll report any new progresses.
Regards,
Claudius
Netapp Technical Support Engineers analyzing the data now. They said "It seems a newly discovered bug might apply here."
exciting 😉
Doesn't it feel great to be the first? Send me the support case # pls. Dimitri@netapp.com
Let's not forgot...
The early bird gets the worm, but the second mouse gets the cheese
Are you running OnCommand? There are some reports in there that may help point to the culprit.
Hi DGWILHELM,
may i know on how to check using the oncommand reports for performance case.
Thanks in advance,
Satish.
Did you get a resolution to this issue? Interested to know if it was a bug or just a quirk of DOT.
After 3 weeks we've got a solution for the problem. It is definitely a BUG, but Netapp Engineering told us that is rare case. The BUG is also in ONTAP8.1 GA release and appear mostly with the use of snapvault. Netapp has only 3 customers world wide with the problem. If you got the same trouble, open a case at Netapp. Regards
Hey, we have the same thing on our new 2240-2. Initially I was told by support that the 100% CPU was normal (strange I know). I actually referred the tech to this article and said we are hitting BURT 590193. Curious if that is the same as you guys.
Our filer is nothing but a snapmirror target at this time, destination for only about 15 relationships.