I am running sysstat -m and I am seeing one processor that always seems to be running at about 85% to 95%. I would love to identify the cause.
What are people using to identify the process that would be causing that?
One of the NetApp gurus told me that some processes are not as well threaded as others which causes the asymmetric CPU utilization. I believe it is possible to really see what is going on by doing a "priv set advanced" and then using the ps command but I haven't really tested that. I've also anecdotally been told that multi-threading is better on 7.3.2 and later.
Use this with care
priv set diag
sysstat -M 1
Now you can see what's going on on your filer. You will see several items (called domains). Probably the "Kahuna" domain is hogging your CPU ...
The Kahuna domain contains "all the rest" that is not mentioned seperateley (wafl task, snapmirrir, deduplication and other system tasks are part of the kahuna domain - we still have ontap 7.2.x)
As of Ontap 7.3, some tasks are seperated from the kahuna domain. I don't know which tasks.
See for yourself and use with care.
If you don't understand what's going on on your filer, open a support case and the will ask a perfstat. Netapp support can see what's going on
If you have Operations Manager setup, you can also use Performance Advisor inside the NetApp Management Console (NMC) to see something of a graphical breakdown of what sysstat gives you.
Hi Russ and welcome to Communities!
What ONTAP version are you running? As already said in this thread: 7.3.2 onwards is dealing with multithreading better than previous versions, so if you are on anything prior to 7.3.2, ONTAP upgrade may solve the issue.
We are running 7.3.2. It always seems to be the 4th proc that is the highest.
The ps cmd doesn't show anything consuming significant amounts of CPU except the idle_thread* process.
Boeckx response does show Kahuna consuming 50-60% CPU, which corresponds to CPU3.
did you run
priv set diag
sysstat -m 1
can you post some of it here?
You have to use a capital "M" like "sysstat -M 1"
A "sysstat -m 1" will show you:
ANY AVG CPU0 CPU1 CPU2 CPU3
A "sysstat -M 1" will show you: (for ontap version prior to 7.3.2)
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Storage Raid Target Kahuna WAFL_Ex(Kahu) Cifs Exempt Intr Host Ops/s CP
I am aware of the difference between capital and small m. I personally prefer m to M, thats all really. For us to help further we d need to know
more about controller type/model for starters and also what type of workload is being served off this controller. If ESX is served I d be looking
for misaligned file systems straight away, that ll cause kahuna to work extra. it seems virtually everybody suffers from this 🙂
These are the commands I use to troubleshoot performance issues:
always start with "priv set diag"
sysstat -M -i 5
--> already explained
sysstat -x 5
--> shows you the different I/O ("Disk util" is an importend one and also "CPty" see http://now.netapp.com/NOW/knowledge/docs/ontap/rel707/html/ontap/cmdref/man1/na_sysstat.1.html)
lun stats -i 5
--> shows you the read / write / latency's of luns
stats show lun
--> shows you detailed info of every lun (you will want to capture this in an output file)
stats show volume
--> same as lun but now for the volumes (you will want to capture this in an output file)
--> shows if any reallocation jobs are running (walf scan status shows you even more info)
If this is not enough you can get some info with statit
a "statit -b" will start the data collection (wait a few minutes)
a "statit -e" will stop the collection and will give you the result. (you will want this to capture in an output file)
If you want to capture the output in a file, connect via PUTTY to the filer. In PUTTY, you can specify an capture file.
After setting the "priv set diag" once your done is there a command you need to use to take it out of the "priv set diag" mode?
Type "priv set" to go back to admin mode (= normal mode).
Also when you exit the console session, the priv set diag is reset to normal.
You can see the difference between "diag" and "admin" priviledge by the "*" (asterix) that is there after your filer's name in "diag" priviledge mode.
Can you shed some light on why a controller will show constant 99% CPU util with a >sysstat , and then an average of 50-60% with a > sysstat -m (all individual CPU's posting no higher than 70%) ?
There is some great info in this post!!
Sysstat basic is showing the peak of the highest single core over the sample period. systat -m is showing you average per core load.