Hi @archspangler
Glad you like the monitoring stack!
This write cleaning metric is related to how Data ONTAP (DOT) persists data written by clients in a consistency point (CP). A CP actually has many phases. I simplify it into two: write cleaning and flushing to disk. So flushing is actually writing out to disk and cleaning (or maybe nicer to call it ‘write allocation’) is anything else including metadata reads needed to write data, building tetrises around fragmented freespace, etc. By design DOT will flush in a lazy way to avoid spikes of IO on the disks and only if DOT is under write pressure will it instead flush as fast as possible. So typically cleaning is the minority time, and flushing the majority.
If cleaning becomes the majority then something isn’t healthy. Maybe it’s freespace fragmentation, maybe readahead wasn’t able to prefetch required metadata or data needed to process partial IOs in time, maybe a bug caused no prefetch to occur, maybe lots of data was deleted on a full volume and it is reclaiming freespace in an aggressive mode, or more things. So this is why I show it as a sign of health of writes.
The only wrinkle I heard from another customer was he saw that during a heavy write workload on AFF the graph also went high. My guess (since I don’t have an AFF to test with) is that because the flush phase to SSD is so fast that the majority time ends up being cleaning. So this caveat should apply if interpreting the graph on AFF.
Did you have a heavy write workload during this incident? Did you delete any large snapshots or large files on full volumes while heavy writes were occurring (check your disk used graphs)? These are some ideas for you to research to try and get to the root cause. A perfstat also contains much more diagnostic info and could pinpoint root cause in case you have one (or so that if it happens again you can collect one).
Cheers,
Chris Madden
Storage Architect, NetApp EMEA (and author of Harvest)
Blog: It all begins with data
If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO