Hello everybody,
Could someone help us diagnose root cause of the problem with our FAS3170? ONTAP 8.1.2 7-mode, we have a single 1TB SATA disk shelf, configured with one aggregate. Aggregate utilization is less than 45%, there are several thin provisioned volumes, volumes are less than 70% full as well.
The problem is that from time to time, disks on the aggregate are busy 99-100% (see below) and access time is hitting 300-700ms (see attached screenshots), making all our VMs running from the volumes pretty much non-workable.
disk:50000C90:001D0E64:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:8%
disk:50000C90:001D2A34:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:11%
disk:50000C90:001D1F40:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001DD5DC:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001D0D84:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:99%
disk:50000C90:001D1E78:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001D1B9C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001DE5F8:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:99%
disk:50000C90:001DFDCC:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:94%
disk:50000C90:001D0C98:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:97%
disk:50000C90:001DE004:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:94%
disk:50000C90:001DEF90:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001DD5A8:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001DE208:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001DDE1C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:0%
disk:50000C90:001DD560:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:0%
disk:50000C90:001D29D0:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001D0CF8:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%
disk:50000C90:001D0E18:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:97%
disk:50000C90:001D1F60:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:99%
disk:50000C90:001D185C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:99%
disk:50000C90:001D1FAC:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:97%
I need just one thing - find out what causes 100% disk busy using ONTAP's or third-party tools. I turned off dedup, there is no snapmirror activity, no WAFL scans, no realloc, there are no high IO/s from VMs or some other places - still have no f-ing idea why disks are so busy.
Any help? Tickets opened at netapp support were useless, they told us that the system is too old and it is expected to have such kind of behaviour with SATA shelves...