Disk Utilization "Problem" / Performance Problem

netapp_fh · ‎2010-11-29

hello all,

i do have a question regarding diskutilisation.

can it be possible, that i do have a ~92% Disk util when the CP type is "-" ? i think i do have some sort of performance problem, any ideas how to check this out?

i cant believe, that even with sata disks, the disk util is over 90% with just 4mb/sec disk read...

any comments are welcome,

kind regards

-andy

STORE> sysstat -x 1

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

5% 0 726 0 726 2555 1371 2784 24 0 0 1 91% 0% - 89% 0 0 0 0 0 0

4% 0 755 0 755 1541 1136 3312 0 0 0 1 92% 0% - 89% 0 0 0 0 0 0

6% 0 1329 0 1334 3379 2069 3836 8 0 0 1 90% 0% - 79% 0 5 0 0 74 0

4% 0 637 0 637 2804 2179 3160 24 0 0 1 92% 0% - 86% 0 0 0 0 0 0

4% 0 587 0 587 2386 1241 2532 8 0 0 1 94% 0% - 98% 0 0 0 0 0 0

8% 0 381 0 381 2374 1063 5224 15120 0 0 6s 96% 45% Tf 78% 0 0 0 0 0 0

7% 0 473 0 473 2902 840 3020 20612 0 0 6s 98% 100% :f 100% 0 0 0 0 0 0

5% 0 1131 0 1133 3542 1371 2612 400 0 0 6s 92% 35% : 70% 0 2 0 0 20 0

7% 0 1746 0 1746 3874 1675 3572 0 0 0 6s 92% 0% - 79% 0 0 0 0 0 0

8% 0 2056 0 2056 5754 3006 4044 24 0 0 6s 95% 0% - 83% 0 0 0 0 0 0

6% 0 1527 0 1527 2912 2162 2360 0 0 0 6s 94% 0% - 86% 0 0 0 0 0 0

6% 0 1247 0 1265 3740 1341 2672 0 0 0 6s 94% 0% - 96% 0 18 0 0 98 0

6% 0 1215 0 1220 3250 1270 2676 32 0 0 6s 92% 0% - 86% 0 5 0 0 61 0

4% 0 850 0 850 1991 915 2260 0 0 0 6s 90% 0% - 75% 0 0 0 0 0 0

7% 0 1740 0 1740 3041 1246 2804 0 0 0 13s 92% 0% - 80% 0 0 0 0 0 0

3% 0 522 0 531 1726 1042 2340 24 0 0 16s 88% 0% - 69% 7 0 0 12 0 0

6% 0 783 0 804 5401 1456 3424 0 0 0 1 92% 0% - 89% 17 0 0 21 0 0

10% 0 478 0 503 4229 919 5840 13072 0 0 1 95% 65% Tf 98% 12 9 0 17 94 0

9% 0 473 0 487 3290 945 2720 23148 0 0 31s 97% 100% :f 100% 12 0 0 17 0 0

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

6% 0 602 0 606 3196 729 2380 12576 0 0 31s 97% 89% : 100% 0 0 0 0 0 0

10% 0 1291 0 1291 15950 3017 2680 0 0 0 31s 94% 0% - 100% 0 0 0 0 0 0

9% 0 977 0 977 13452 4553 4736 24 0 0 31s 96% 0% - 92% 0 0 0 0 0 0

6% 0 995 0 995 3923 2210 2356 8 0 0 31s 94% 0% - 85% 0 0 0 0 0 0

4% 0 575 0 583 1849 2948 3056 0 0 0 31s 93% 0% - 96% 0 8 0 0 111 0

5% 0 789 0 789 2316 742 2364 24 0 0 31s 94% 0% - 91% 0 0 0 0 0 0

4% 0 550 0 550 1604 1125 3004 0 0 0 31s 92% 0% - 80% 0 0 0 0 0 0

7% 0 1398 0 1398 2910 1358 2716 0 0 0 31s 94% 0% - 87% 0 0 0 0 0 0

statit from the same timeframe:

disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs

/aggr0/plex0/rg0:

0c.16 9 3.71 0.47 1.00 90842 2.94 15.14 1052 0.30 7.17 442 0.00 .... . 0.00 .... .

1b.17 11 3.86 0.47 1.00 126105 3.14 14.31 1170 0.25 2.20 1045 0.00 .... . 0.00 .... .

1b.18 35 35.52 33.62 1.24 14841 1.63 26.23 965 0.27 15.09 392 0.00 .... . 0.00 .... .

0c.25 78 35.15 33.47 1.13 64924 1.48 28.77 2195 0.20 16.75 1493 0.00 .... . 0.00 .... .

0c.24 34 33.96 32.26 1.13 17318 1.51 28.21 1007 0.20 17.00 257 0.00 .... . 0.00 .... .

1b.22 36 35.40 33.67 1.15 16802 1.51 28.25 1003 0.22 15.56 721 0.00 .... . 0.00 .... .

0c.21 35 34.98 33.27 1.16 17126 1.48 28.75 950 0.22 14.78 820 0.00 .... . 0.00 .... .

1b.28 77 34.93 33.02 1.13 66383 1.56 27.40 3447 0.35 10.21 8392 0.00 .... . 0.00 .... .

1b.23 32 33.02 31.12 1.17 14775 1.53 27.65 1018 0.37 10.80 1321 0.00 .... . 0.00 .... .

0c.20 35 34.41 32.38 1.29 15053 1.66 25.73 976 0.37 9.67 1076 0.00 .... . 0.00 .... .

0c.19 34 34.80 33.07 1.20 15961 1.51 28.30 930 0.22 15.00 681 0.00 .... . 0.00 .... .

1b.26 76 34.41 32.41 1.05 68532 1.63 26.09 3482 0.37 11.93 7698 0.00 .... . 0.00 .... .

1b.27 36 35.15 33.32 1.26 15327 1.56 27.35 1018 0.27 12.82 1170 0.00 .... . 0.00 .... .

/aggr0/plex0/rg1:

0c.29 5 2.00 0.00 .... . 1.63 27.89 1023 0.37 9.80 231 0.00 .... . 0.00 .... .

0c.33 5 2.03 0.00 .... . 1.68 27.13 1095 0.35 8.21 330 0.00 .... . 0.00 .... .

0c.34 32 34.46 32.75 1.19 14272 1.51 29.87 927 0.20 16.63 617 0.00 .... . 0.00 .... .

0c.35 31 32.85 31.00 1.15 14457 1.51 29.87 895 0.35 12.36 1075 0.00 .... . 0.00 .... .

0c.41 32 33.10 31.44 1.20 13396 1.51 29.87 930 0.15 21.83 618 0.00 .... . 0.00 .... .

0c.43 31 32.73 30.92 1.19 13827 1.58 28.47 1005 0.22 15.22 920 0.00 .... . 0.00 .... .

0c.44 31 32.65 31.02 1.11 14986 1.51 29.85 913 0.12 26.00 408 0.00 .... . 0.00 .... .

1b.32 31 32.68 30.87 1.13 14437 1.58 28.48 956 0.22 15.78 627 0.00 .... . 0.00 .... .

1b.36 32 34.70 32.95 1.13 14680 1.56 28.94 975 0.20 16.75 582 0.00 .... . 0.00 .... .

1b.37 31 32.43 30.70 1.21 13836 1.51 29.89 929 0.22 14.78 797 0.00 .... . 0.00 .... .

mheimberg · ‎2010-11-29

Just keep in mind that sysstat shows only the most busiest disk in "disk" and not an average.

So by the statit output it should be possible to narrow a bit what is going on...but I have to look up, what the output tells us exactly....

From the CP it is clear, that the system is not very busy with writes: they occur every 10s, which is one the triggers to write a checkpoint (type T). So does a larger disk write also occur at the time of the CP.

Your cache age seems rather low as well as the cache hit, pointig to a really random access;

Net out goes rather well with disk read, which is slighty higher, but regard it as normal. Are there any snapmirrors/snapvaults pointing to that system?

i think i do have some sort of performance problem, any ideas how to check this out?

BTW: DO you HAVE a performance problem? People complaining about poor response times or so?

Mark

netapp_fh · ‎2010-11-29

users / helpdesk report slow file transfers. sometime we do not get above 4-8 mb/sec over cifs (network link is not saturated).

windows roaming profiles get broken on writing back on logout. we did not have that issue, when we had them on local disks on a 4 year old sun fire v20z with normal scsi disks.

latest nfs write test (mounted with wsize 16384) is at about 33-36 mb/sec (with different file sizes. from 400mb to 3 gb).

is such a write speed normal with so many disks in 2 raid groups?

i am testing around alot, and just trying to get some idea what the problem can be (if there is a problem). i do not have that much experience with netapp filers (we had a fc/block based storage before).

-andy

mheimberg · ‎2010-11-29

What I see are 3 disks not behaving like the others:

0c.25, 1b.28 and 1b.26

they show up with >75% utilization, but do not more xfers like the other disks with 35%; they have a bigger roundtrip-time to get the 1.n 4k blocks, >60'000 usec compared to the 15'000 usecs of the other disks....

maybe these disks are slowing down overall performance.

But because the given data is only a very short snapshot of the whole, you should better investigate with your NetApp-partner or -Support to get a more reliable over-all picture.

Regarding your write-tests: it is normal that one single client cannot fill up the storage system - in such a scenario you often do not measure the "write-performance" of the storage, but the latency of your whole system and network...sorry.

If you really want to know what the storage is capable of then fire from multiple clients simultaneously and multithreaded to your storage - and record by means of perfstat what the storage is doing in that time as well as monitor the resources of the clients you use.

Another means of gathering data is using stats, eg.

>stats show -n 5 -i 2 cifs:cifs_latency

But as I said: I recommend contacting someone who can guide and help you on-site.

regards

Mark

netapp_fh · ‎2010-11-29

thanks for you efforts and help! ill gonna contact netapp support.

-andy

roman_verysell · ‎2011-03-07

Did you have some details about your issue? Where was a problem there? It was a hardware problem?

Disk Utilization "Problem" / Performance Problem

Introducing GenAI Search on NSS