I am monitoring what is causing and I am seeing the"domain_busy:wafl_exempt" shoots up as the CPU utilization shoots up. Anyone know what this wafl_exempt is pertaining to?
Multi-processor safe wafl_exempt threads can run in parallel on multiple processors. These threads handle WAFL tasks such as writes, reads, readdir. AFAIK, the purpose is to increase WAFL performance.
If the Filer acts as a snapmirror destination, I assume that it is busy running the Deswizzler after a snapmirror upgrade.
Has something changed to the snapmirror configuration?
But as you said it, a busy CPU is not a problem per se. The Filer is running these system background jobs (like the deswizzler) while no User workload has to be done.
If user workload is introduced into the system, Data ONTAP will throttle down these system jobs to focus on this protocol/user workload.
The Ontap 8.1 upgrade alters how the CPU are utilised on the Filer meaning the CPU monitoring characteristics changed after the 8.1 upgrade.
Deswizzling process taking place after the upgrade (shown by a wafl scan status).
Looking at a "sysstat -m" on the Filer showed the actual CPU usage on each core was more even across them and considerably lower than the "CPU ANY" shown in the DFM graph.
Nothing has changed to Snapmirror and I have none running at the moment, however we recently upgraded to 8.1.1. I get alerted of high latency (meaning passed the threshold I set) on some LUNs and volumes. The latency shoot up the same time as the CPU (I believe this is what you mean by CPU_any) shoot up. But when I check each of the four physical processors, I see the wafl_exempt went up.
So does wafl_exempt relates to non-system activity in the machine , which normally shows in the Kahuna domain?
AGUMADAVALLI has the correct answer.
You can finish your scrubs but it will make no difference to your High CPU.
We are experiencing the same issue where the console is almost unusable and a 5 min perfstat might take 20 mins to run. (this is a little worrying but apparently nothing to worry about)
Before we upgraded to 8.1 then 8.1.1 subsequently, we had the same snapmirror schedule, and never saw the filer with high cpu or deswizzling so much. You will notice if you turn snapmirror off the cpu will stabilise again to normal.
If the deswizzler finishes in time before the next mirror, the CPU will also drop right off.
"If user workload is introduced into the system, Data ONTAP will throttle down these system jobs to focus on this protocol/user workload." This is evident if you enter the priv set diag and run the command wafl scan speed and you should see it throttle back...
It's annoying as DFM seems to spam us with CPU alerts all the time. We have not seen increase in latency but our workload is quite minimal atm. (purely CIFS).
Hope this helps, but don't expect once rlw_upgrading finishes to have your reported high cpu drop. (You should finish these scrubs)
Hi Paul, do you have any more info on where you are with this issue right now ? I am seeing high WAFL_exempt CPU usage on a customer's 8.1.2 system. By no means as bad as your description, but still worryingly high ...
I am also curious know how people resolved this. After our 8.1.2 upgrade on a pair of v3270s we see increased CPU elevation, specifically the WAFL_exempt counters. DFM reports our filer consistently in the +80% CPU utilization, while Sysstat -m shows the procs bouncing between 45%-70%.
This is not a snapmirror destination so I do not see any deswizzle operations active. All I see are active bitmap rearrangements.
Any help on what to look at next is appreciated.
I am wondering: how many of you who are reporting high WAFL_Ex utilization are using PAM/FlashCache ?
I noticed on a 8.1.2-system with PAM and high hit ratios and lots of IOPS coming from the PAM cards that the WAFL_Ex domain utilization was very high. When I disabled the PAM card, WAFL_Ex utilization was noticeably lower (albeit at the cost of higher disk utilization) and far less spiky. Is there any scientific explanation to this, i.e. are PAM request served by calls to the WAFL_Ex domain ?
Some decrease in PAM/Flashcache caching percentage is expected immediately post-DOT upgrade (cache re-warming.)
By chance, are there flashpools configured and in use on the storage system?
I do have PAM cards installed in this system. I do not have any flashpools.
On avg for a week I would say this system is replacing about 2000 reads with spikes to 3500. And those replaced read equate to about a 45-55% hit rate.
NetApp Wins One Silver and One Bronze Stevie® Award in 2022 Stevie Awards for Sales and Customer Service