Re: Disk Busy 100% without much I/O activity

irakli_natsvlishvili · ‎2015-03-25

Hello everybody,

Could someone help us diagnose root cause of the problem with our FAS3170? ONTAP 8.1.2 7-mode, we have a single 1TB SATA disk shelf, configured with one aggregate. Aggregate utilization is less than 45%, there are several thin provisioned volumes, volumes are less than 70% full as well.

The problem is that from time to time, disks on the aggregate are busy 99-100% (see below) and access time is hitting 300-700ms (see attached screenshots), making all our VMs running from the volumes pretty much non-workable.

disk:50000C90:001D0E64:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:8%

disk:50000C90:001D2A34:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:11%

disk:50000C90:001D1F40:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001DD5DC:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001D0D84:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:99%

disk:50000C90:001D1E78:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001D1B9C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001DE5F8:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:99%

disk:50000C90:001DFDCC:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:94%

disk:50000C90:001D0C98:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:97%

disk:50000C90:001DE004:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:94%

disk:50000C90:001DEF90:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001DD5A8:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001DE208:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001DDE1C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:0%

disk:50000C90:001DD560:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:0%

disk:50000C90:001D29D0:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001D0CF8:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:100%

disk:50000C90:001D0E18:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:97%

disk:50000C90:001D1F60:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:99%

disk:50000C90:001D185C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:99%

disk:50000C90:001D1FAC:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:97%

I need just one thing - find out what causes 100% disk busy using ONTAP's or third-party tools. I turned off dedup, there is no snapmirror activity, no WAFL scans, no realloc, there are no high IO/s from VMs or some other places - still have no f-ing idea why disks are so busy.

Any help? Tickets opened at netapp support were useless, they told us that the system is too old and it is expected to have such kind of behaviour with SATA shelves...

JSHACHER11 · ‎2015-03-25

if you are willing to collect and send me a perfstat file (while the disks are 100%) I might be able to help

irakli_natsvlishvili · ‎2015-03-25

Sure, do you need them with some specific settings?

JSHACHER11 · ‎2015-03-25

just sent you a private message with the details

SALVATORE_PUGLISI · ‎2016-06-21

Hello,

Could you find a solution for this issue? We have the same issue with a FAS3250, 4TB Disk, OnTap 8.1.4P8 7-Mode. Disk utilizations runs really high each morning for approx. 5 hours.

Regards

SaPu

naren_s · ‎2016-06-21

Are you checking it from sysstat ? It will give you the busiest disk as it only gives you the busiest CPU.

Go to advanced mode.

*> statit -b

Then wait for one minute

*>statit -e

To stop statit collection and dump data.

*>priv set

Now check the disk utilization in disk/aggregate section which will give the values for every disk.

Refer below document if you need help in running those stats.

https://kb.netapp.com/support/index?page=content&id=1014701

irakli_natsvlishvili · ‎2016-06-24

Well the best solution is to add more disks... In our case it was determined that we simply did not have enough disks. After we moved flash cache cards from another filer, situation got better.

aborzenkov · ‎2016-06-21

"each morning for approx. 5 hours" sounds very much like disk scrub.

J_curl · ‎2016-06-21

+1 for disk scrub. Default duration is 6 hours