Disks always busy, WAFL scans neverending

PhilipR · ‎2015-07-03

When I see disk-busy stats for a RAID-DP aggr constantly >90%, 24x7, something seems wrong.

What's causing the I/O according to wafltop ("wafltop show -v io -i 5")? It's mosly aggr0:*:scanner, e.g.

I/O utilization
                                  ---------MB Read---------- ---------MB Write--------- --------IOs Read---------- --------IOs Write---------
             Application MB Total Standard ExtCache  Hybrid  Standard ExtCache  Hybrid  Standard ExtCache  Hybrid  Standard ExtCache  Hybrid
             ----------- -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
aggr0:vol_xxx:scanner:       55       13        0       41        1        0        0     2127        0    10617        0        0        0
  aggr0:vol_xxx:nfsv3:       15        0        0        0       15        0        0       58        0        4        0        0        0
   aggr0:vol_yyy:cifs:        2        0        0        0        2        0        0       44        0        0        0        0        0
       aggr0::scanner:        1        0        0        1        0        0        0        0        0       23        0        0        0

There's some NFSv3, sure, and a lot of the I/O is being handled well by the hybrid SSDs but the SATA data disks are running flat out.

netapp1*> wafl scan status vol_xxx
Volume vol_xxx:
 Scan id                   Type of scan     progress
   59244     blocks used summary update     block 1218 of 33705 (fbn 18862)
   59245    active bitmap rearrangement     fbn 2713 of 33704 w/ max_chain_len 3
   77545     snap create summary update     block 6392 of 33705 (fbn 6386)

The weird thing here is the old scan ID for "blocks used", almost 20000 behind the later IDs, and that "blocks used" never, ever completes. The autosnap setting for these volumes are the defaults, and the snapmirror to the DR site runs semi-sync with the default visibility_interval of 3 minutes, i.e. a snapshot is being created or deleted every 3 minutes by snapmirror, and every hour by autosnap.

Why do "blocks used summary update" and "snap create summary update" make no progress? Because every time there's snapshot churn they restart.

How long would they take to run to completion? Impossible to say, as the "block 502 of 33705 (fbn 501)" progress description walks an invisible list of fbns (what is an fbn?)

Does anyone have experience of successfully lowering disk busy %age by careful tweaking of snapmirror or other settings to avoid constant WAFL scans?