CPU domains are discussed in KB3014084. WAFL Exempt is the domain where parallelized WAFL work happens, so if it is high it simply means that WAFL is doing lots of parallel work. If you can correlate frontend work (throughput or IOPs) with CPU load then probably it's just a sign of a busy system. If not, then something for support to analyze and see if there is bug or something.
I also see in the screenshot you are at 82%-92% average CPU utilization. At this utilization level new work will oftentimes have to queue before it can be run on the CPU. My rule of thumb is that from around 70% avg CPU you will start to see measurable queue time while something waits to find an available CPU. Also if you have a controller failover and both nodes are running at this load level there will be a pretty big shortfall of CPU resource leading to much higher queuing for CPU. So if the workload is driving the CPU usage (i.e. no bug) then I would recommend to get a bigger node, reduce workload, or accept higher latency than possible during normal use and very high if a failover occurs.
Also, a handy dashboard for checking the latency cost of each component in the cluster is on the volume page:
Each graph shows the avg latency breakdown for IOs by component in the data path where the 'from" is:
• Network: latency from the network outside NetApp like waiting on vscan for NAS, or SCSI XFER_RDY (which includes network and host delay, here for an example of a write) for SAN
• Throttle: latency from QoS throttle
• Frontend: latency to unpack/pack the protocol layer and translate to/from cluster messages occuring on the node that owns the LIF
• Cluster: latency from sending data over the cluster interconnect (the 'latency cost' of indirect IO)
• Backend: latency from the WAFL layer to process the message on the node that owns the volume
• Disk: latency from HDD/SSD access
I put a red arrow on backend because here is where you would see evidence of queuing for CPU. It actually includes queue time + service time, but usually service time is less than 500us, so if this is a major contributor to your overall latency I bet it is coming from wait time. A perfstat and subsequent analysis by NetApp can tell you how much queue time and service time you have on a per operation basis.
Hope this helps!
Storage Architect, NetApp EMEA (and author of Harvest)
Blog: It all begins with data
If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO