VMware Solutions Discussions

Latency between FAS2050 and VMWare Cluster

thmaine
5,608 Views

Fairly new to Netapp and need some help.

We are starting to notice some latency between our FAS2050 and our vmware cluster.  our 2050 is on ontap 7.2.4L1 and capacity on the aggr's are above 90%.  we are running esx 3.5update5.  a couple of questions are

1) what is the max iops that a 2050 controller head can handle?

2) anything on the controller side that would cause the vmware quest to see upwords of 200ms latency?

thanks

thomas

13 REPLIES 13

radek_kubka
5,591 Views

Hi Thomas,

There are countless things which may impact your performance:

- too little free space in the aggregate

- too high fragmentation (could be a result of little space available)

- too few spindles

- networking issues

- filer just not coping with the workload

To be perfectly honest with you, FAS2050 is a rather slow box (to say the least) e.g. in the terms of its CPU capabilities.

Can you post the output of the command below (ideally run for few minutes when there is a performance issue, end with Ctrl-C)?

sysstat -x -s 5

regards,
Radek

thmaine
5,591 Views

Radek,

Thank you for the response. A few comments to you post.

- The aggr’s are around 90% so space is at a premium, we are looking at adding more.

- Are there any checks I can run to check for fragmentation or vm misalignment?

- We should have enough disk spindles we have 20 15 SAS and a shelf(14 drives) of 15K fiber drives. If anything the controller is running out of iops.

- We are investigating any networking issues. We have 2 dell switch stacks one for the storage network and another for all other traffic

- I will gather the sysstat data today and post in just a bit.

Thomas Maine

Technical Services Manager

Bond International Software, Inc.

radek_kubka
5,591 Views

Are there any checks I can run to check for fragmentation

Reallocate command will do the trick, both for checking & fixing problems - http://now.netapp.com/NOW/knowledge/docs/ontap/rel80/html/ontap/cmdref/man1/na_reallocate.1.htm

See this thread for some interesting angles, as the topic is not that straightforward and not well documented: http://communities.netapp.com/message/20969#20969

vm misalignment?

This is well documented - http://media.netapp.com/documents/tr-3747.pdf (page 31 talks about detection)

radek_kubka
5,591 Views

One more thing:

Have you considered upgrading your ONTAP and/or ESX versions?

On the ONTAP side, there are many areas where performance has been improved comparing 7.2.x to 7.3.x (e.g. read-ahead algorithms, which make a difference to CPU utilisation)

thmaine
5,591 Views

We actually have a couple of projects on the schedule over the next month to upgrade our esx and netapp.

Thomas Maine

Technical Services Manager

Bond International Software, Inc.

thmaine
5,591 Views

Average output on c1

25% 180 0 0 278 31349 1361 1646 6 0 28154 7 85% 0% - 11% 0 98 0 0

37% 178 0 0 260 29367 1253 7374 8445 0 26804 7 98% 44% Tf 19% 0 82 0 0

26% 208 0 0 280 36640 1748 2035 2478 0 33487 7 89% 12% : 9% 0 72 0 0

37% 410 0 0 488 49182 3968 4137 6 0 43429 6 76% 0% - 14% 0 78 0 0

36% 204 0 0 289 36248 1029 4967 10134 0 32847 7 98% 44% T 17% 0 85 0 0

31% 196 0 0 288 47934 1261 978 11 0 43346 7 88% 0% - 7% 0 92 0 0

35% 162 0 0 240 39398 1186 5438 5878 0 36333 7 97% 42% Tf 12% 0 78 0 0

30% 205 0 0 244 46653 1206 1208 2078 0 43149 4 88% 8% : 10% 0 39 0 0

31% 314 0 0 415 39727 1191 2193 11 0 34537 6 91% 0% - 15% 0 101 0 0

48% 188 0 0 249 35331 1225 7724 11444 0 32375 6 98% 69% T 28% 0 61 0 0

21% 214 0 0 272 30676 1167 1310 11 0 27682 6 85% 0% - 8% 0 58 0 0

31% 276 0 0 344 30569 1186 3984 2429 0 27342 5 97% 21% Tf 13% 0 68 0 0

35% 269 0 0 317 50644 1141 1437 6476 0 46819 5 93% 30% : 10% 0 48 0 0

32% 226 0 0 292 51672 1608 1474 11 0 47881 5 83% 0% - 10% 0 66 0 0

40% 288 0 0 317 37889 1043 6662 8482 0 34157 5 98% 57% T 15% 0 29 0 0

40% 475 0 0 649 48276 3486 4014 118 0 42952 6 80% 0% - 15% 0 174 0 0

33% 161 0 0 230 40437 1527 3428 846 0 37067 5 97% 10% Tf 13% 0 69 0 0

32% 228 0 0 298 42014 1358 1842 8757 0 38247 5 92% 41% : 13% 0 70 0 0

35% 353 0 0 427 49049 1545 1543 6 0 43293 5 88% 0% - 10% 0 74 0 0

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s

in out read write read write age hit time ty util in out

44% 209 0 0 346 40401 2027 8354 9430 0 36975 9 96% 58% Tf 22% 0 137 0 0

38% 313 0 0 440 41285 1805 2978 1417 0 32781 5 93% 6% : 21% 0 127 0 0

32% 262 0 0 364 43827 1927 2158 6 0 39531 6 83% 0% - 12% 0 102 0 0

46% 177 0 0 203 40369 1027 9911 16051 0 36622 5 99% 92% Tf 19% 0 26 0 0

26% 243 0 0 286 39857 1299 1373 18 0 36451 5 92% 1% : 7% 0 43 0 0

29% 234 0 0 380 41367 1666 1513 6 0 36517 5 86% 0% - 9% 0 146 0 0

39% 192 0 0 266 39504 1206 4408 10552 0 36359 5 98% 54% T 16% 0 74 0 0

28% 235 0 0 274 46429 1145 858 11 0 42480 5 92% 0% - 6% 0 39 0 0

Average output from c2

54% 140 0 0 226 3963 37786 38384 7512 0 0 6 99% 26% Tf 39% 0 86 0 0

47% 78 0 0 174 2188 37788 37647 3578 0 0 6 99% 15% : 44% 0 96 0 0

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s

in out read write read write age hit time ty util in out

48% 82 0 0 271 3286 44996 45144 6 0 0 6 99% 0% - 34% 0 189 0 0

50% 75 0 0 247 3404 38288 41668 10059 0 0 5 99% 48% T 29% 0 172 0 0

47% 72 0 0 247 3440 39371 39317 11 0 0 5 99% 0% - 39% 0 175 0 0

56% 89 0 0 200 3844 39358 42454 5833 0 0 5 99% 34% Tf 43% 0 111 0 0

50% 66 0 0 178 3454 43506 42507 4923 0 0 5 99% 21% : 39% 0 112 0 0

55% 139 0 0 312 2931 45593 45019 11 0 0 5 97% 0% - 46% 0 173 0 0

55% 117 0 0 213 2983 37411 41009 9738 0 0 6 99% 57% T 41% 0 96 0 0

58% 77 0 0 514 5567 44210 43752 11 0 0 6 95% 0% - 49% 0 437 0 0

60% 100 0 0 320 4590 40898 44822 2408 0 0 7 98% 21% Ts 43% 0 220 0 0

51% 77 0 0 297 3911 34172 35075 11124 0 0 7 99% 39% : 54% 0 220 0 0

49% 98 0 0 167 2145 42473 43370 11 0 0 7 99% 0% - 43% 0 69 0 0

50% 88 0 0 188 2359 31868 35439 11345 0 0 7 99% 60% Tf 42% 0 100 0 0

51% 82 0 0 264 4454 38671 37736 770 0 0 7 98% 5% : 44% 0 182 0 0

54% 74 0 0 256 4307 40893 39259 11 0 0 6 97% 0% - 54% 0 182 0 0

52% 70 0 0 274 2955 39878 45548 12642 0 0 6 98% 53% T 31% 0 204 0 0

48% 83 0 0 273 3663 51264 51033 6 0 0 6 97% 0% - 20% 0 190 0 0

55% 104 0 0 264 4086 37853 41557 10845 0 0 6 98% 40% Tf 44% 0 160 0 0

45% 83 0 0 285 4398 35637 36226 1370 0 0 6 98% 6% : 38% 0 202 0 0

Thomas Maine

Technical Services Manager

Bond International Software, Inc.

thmaine
5,591 Views

Thank for the links I will review them today.

Thomas Maine

Technical Services Manager

Bond International Software, Inc.

thmaine
5,591 Views

High cpu on c2

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s

in out read write read write age hit time ty util in out

100% 56 0 0 292 475 2659 7082 11 0 0 2s 52% 0% - 20% 0 236 0 0

100% 64 0 0 353 3076 2228 7096 4050 0 0 2s 55% 23% T 26% 0 289 0 0

100% 83 0 0 319 1076 2551 5927 6 0 0 2s 51% 0% - 20% 0 236 0 0

100% 86 0 0 356 1196 2567 9550 3130 0 0 2s 53% 19% Tf 30% 0 270 0 0

100% 75 0 0 366 1513 2511 6976 3683 0 0 2s 52% 19% : 24% 0 291 0 0

100% 191 0 0 467 1703 2950 7578 6 0 0 2s 51% 0% - 25% 0 276 0 0

100% 223 0 0 807 1710 5685 13415 6738 0 0 2s 56% 35% T 31% 0 584 0 0

100% 79 0 0 699 1421 6027 9835 11 0 0 2s 52% 0% - 21% 0 620 0 0

100% 91 0 0 643 1215 5705 12610 5710 0 0 3s 55% 33% T 28% 0 552 0 0

100% 67 0 0 631 1563 5077 9263 11 0 0 2s 52% 0% - 24% 0 564 0 0

100% 60 0 0 660 4912 6663 12511 3534 0 0 3s 54% 11% Ts 28% 0 600 0 0

100% 83 0 0 835 2668 5683 10466 5217 0 0 2s 54% 17% : 26% 0 752 0 0

100% 90 0 0 572 1463 4574 9668 11 0 0 2s 52% 0% - 24% 0 482 0 0

100% 67 0 0 729 1798 5733 11699 7022 0 0 2s 55% 28% T 26% 0 662 0 0

100% 61 0 0 794 2113 6594 9783 6 0 0 2s 52% 0% - 27% 0 733 0 0

100% 76 0 0 704 1472 5600 11387 5690 0 0 2s 54% 24% T 29% 0 628 0 0

100% 67 0 0 893 4991 6321 9518 11 0 0 2s 52% 0% - 21% 0 826 0 0

100% 88 0 0 1190 10357 4213 11498 8079 0 0 4s 58% 30% Fs 36% 0 1102 0 0

100% 67 0 0 970 9307 5603 9428 6875 0 0 2s 53% 23% : 22% 0 903 0 0

100% 77 0 0 823 2242 6150 13613 15297 0 0 3s 56% 63% F 33% 0 746 0 0

100% 68 0 0 798 2207 5700 9341 6 0 0 2s 52% 0% - 20% 0 730 0 0

99% 91 0 0 782 2175 5566 10245 13 0 0 3s 53% 3% Tn 23% 0 691 0 0

62% 84 0 0 510 1461 2948 12586 12811 0 0 8s 96% 89% Z 51% 0 426 0 0

Thomas Maine

Technical Services Manager

Bond International Software, Inc.

radek_kubka
5,591 Views

OK, fantastic.

When you look at the output with average CPU load, your disks are actually working harder & there is more network traffic than when CPU peaks. Normally that would indicate some internal process is hammering CPU.

My first guess is de-duplication scan may be causing this CPU peak.

Regards,
Radek

thmaine
5,015 Views

Can you explain the scoring for reallocateing? Is there a command you can run that will tell you if the volume needs to be reallocated?

“It says that these volumes are not in need of a reallocate (scoring 1 or 2”

Thomas Maine

Technical Services Manager

Bond International Software, Inc.

radek_kubka
5,015 Views

The man page I was referring to earlier describes the scoring:

The threshold when a LUN, file or volume is considered unoptimized enough that a reallocation should be performed is given as a number from 3 (moderately optimized) to 10 (very unoptimized)

Although I have seen people on this forum saying they got results as high as 20-something.

From the results you have, and also looking at the sysstat outputs (where disks seem to be not that busy), fragmentation is not the culprit you are looking for,

Regards,

Radek

p_maniawski
5,015 Views

Could You please suggest what might cause so heavy filler CPU load? Is there any way to check it (networking, disk checksum calculations etc.)? We're also having performance issue especially with Thread Link : 13413.

eric_barlier
5,015 Views

Hi Thomas,

You asked what is the IOPS the fas2050 can do, that depends on how little latency you want. 200ms is too much I agree.

How many disks in aggr. and what type?

Your aggr. seems full and that will potentially be part of the problem. Any filesystem that gets full will have to shuffle blocks

around before it can write stripes. So when you say your aggr. is above 90% I am concerned. Can you tell us exactly

how full it is? Please run the following commands.

df -Ag aggr.name

aggr show_space -g

snap reserve -A aggr.name

Lets see if we can free up some space.

How many VMs, data stores are you running on this baby? keep in mind the fas2050 is entry model controller.

Finally I d like you to collect data using "statit" command. please do the following:

priv set advanced

statit -b    (begin)

wait 2 minutes

statib -e  (end)

Collect the outout in a wordpad file and attach it to this thread please.

Cheers,

Eric

Public