Re: 3240 filer with high CPU utilization

rajdeepsengupta · ‎2012-06-12

My 3240 filer is showing very high CPU utilization. I have done a tech refresh from 3040 & 3020 to 3240 (ontap 7.3.6 to 8.1P1)

So other than disk shelve everything else is a fresh in the 3240 including PAM modules.

The new filer performance is obviously better now.

But my new filer CPU utilization is very high.

sysstat -M 1

ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP

90% 58% 31% 12% 50% 46% 46% 49% 62% 47% 0% 0% 6% 10% 0% 26% 77%( 54%) 7% 0% 0% 14% 9% 6% 13529 94%

95% 69% 39% 16% 59% 55% 55% 59% 66% 56% 0% 0% 7% 14% 0% 24% 86%( 60%) 7% 0% 0% 20% 11% 9% 13489 100%

88% 58% 29% 10% 50% 46% 47% 50% 56% 60% 0% 0% 5% 10% 0% 22% 67%( 50%) 4% 0% 0% 13% 11% 7% 17440 75%

95% 65% 37% 17% 57% 50% 50% 57% 70% 52% 0% 0% 8% 14% 0% 32% 82%( 54%) 3% 0% 0% 17% 10% 8% 15628 75%

87% 57% 29% 8% 56% 53% 55% 60% 56% 45% 0% 0% 8% 15% 0% 19% 69%( 51%) 6% 0% 0% 16% 9% 38% 13040 76%

Also if I see the WAFL scan status, it shows the following

Volume vol0:

Scan id Type of scan progress

1 active bitmap rearrangement fbn 791 of 3959 w/ max_chain_len 3

Volume vol1:

Scan id Type of scan progress

2 active bitmap rearrangement fbn 1108 of 13474 w/ max_chain_len 3

Volume vol2:

Scan id Type of scan progress

3 active bitmap rearrangement fbn 159 of 356 w/ max_chain_len 3

Volume vol3:

Scan id Type of scan progress

4 active bitmap rearrangement fbn 311 of 356 w/ max_chain_len 3

----------------------------------------------------------------------------------------------------------------

Let me know if anyone find some good reason for the high CPU utilization.

Thanks

dougsiggins · ‎2012-06-12

Those numbers actually look decent. Are you looking at the ANY1+ and thinking you have high CPU? That is not really a good indicator. I prefer to look at individual and AVG columns to determine if the CPU is the bottleneck. From this information it would seem your workload probably has a lot of writes. Can you pull out the statit -b statit -e (after a few minutes). As well a few lines of the sysstat -x 1? You could also look at volume latencies (stats show volume: -- or look in OM/Performance Advisor).

rajdeepsengupta · ‎2012-06-13

filer2> sysstat -x 1

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

98% 20607 1 0 20608 17309 82445 134107 148711 0 0 >60 98% 93% H 93% 0 0 0 0 0 0 0

99% 19381 1 0 19382 41222 23018 136308 142793 0 0 0s 99% 65% Hf 97% 0 0 0 0 0 0 0

98% 18134 0 0 18140 16738 20368 136415 158128 0 0 17s 98% 97% Hf 97% 6 0 0 0 0 0 0

97% 16471 0 0 16471 36261 21444 120937 134514 0 0 17s 99% 93% Hf 99% 0 0 0 0 0 0 0

statit -e

Hostname: filer2 ID: 1574797944 Memory: 5958 MB

NetApp Release 8.1P1 7-Mode: Wed Apr 25 23:47:02 PDT 2012

<1O>

Start time: Wed Jun 13 14:12:19 IST 2012

CPU Statistics

74.881299 time (seconds) 100 %

201.480400 system time 269 %

8.298362 rupt time 11 % (2845179 rupts x 3 usec/rupt)

193.182038 non-rupt system time 258 %

98.044796 idle time 131 %

57.735933 time in CP 77 % 100 %

6.278167 rupt time in CP 11 % (2178820 rupts x 3 usec/rupt)

Multiprocessor Statistics (per second)

cpu0 cpu1 cpu2 cpu3 total

sk switches 81770.60 83865.64 74345.36 61023.43 301005.03

hard switches 34613.49 40316.34 41225.01 25198.45 141353.29

domain switches 2342.83 612.98 418.41 320.28 3694.50

CP rupts 13296.07 2762.87 6259.63 6778.42 29096.98

nonCP rupts 3930.97 617.84 2132.66 2217.40 8898.87

IPI rupts 0.00 0.00 0.00 0.00 0.00

grab kahuna 0.00 0.00 0.04 0.00 0.04

grab kahuna usec 0.00 0.00 4.15 0.00 4.15

CP rupt usec 40250.74 1844.35 17245.57 24500.90 83841.59

nonCP rupt usec 12524.45 363.86 5938.08 8152.22 26978.62

idle 369294.66 368234.77 322539.00 249267.76 1309336.22

kahuna 0.00 0.19 0.00 298462.26 298462.45

storage 99.04 80015.43 179.95 0.00 80294.43

exempt 107146.65 104735.28 62647.36 12.14 274541.44

raid 345.04 325.64 195562.15 0.00 196232.84

target 8.87 10.28 13.14 0.00 32.30

dnscache 0.00 0.00 0.00 0.00 0.00

cifs 79.18 98.78 76.04 0.00 254.00

wafl_exempt 196668.68 184589.57 138866.71 419604.67 939729.68

wafl_xcleaner 26908.44 20709.12 11898.34 0.00 59515.91

sm_exempt 17.39 21.78 20.45 0.00 59.63

cluster 0.00 0.00 0.00 0.00 0.00

protocol 56.14 44.78 63.87 0.00 164.81

nwk_exclusive 1015.94 942.87 786.67 0.00 2745.48

nwk_exempt 222122.75 234019.20 240687.06 0.00 696829.04

nwk_legacy 19608.98 61.80 49.34 0.00 19720.14

hostOS 3852.96 3982.20 3426.14 0.00 11261.32

73.083640 seconds with one or more CPUs active ( 98%)

59.972449 seconds with 2 or more CPUs active ( 80%)

42.233825 seconds with 3 or more CPUs active ( 56%)

13.111190 seconds with one CPU active ( 18%)

17.738624 seconds with 2 CPUs active ( 24%)

19.801098 seconds with 3 CPUs active ( 26%)

22.432726 seconds with all CPUs active ( 30%)

Domain Utilization of Shared Domains (per second)

0.00 idle 623762.03 kahuna

0.00 storage 0.00 exempt

0.00 raid 0.00 target

0.00 dnscache 0.00 cifs

0.00 wafl_exempt 0.00 wafl_xcleaner

0.00 sm_exempt 0.00 cluster

0.00 protocol 558570.79 nwk_exclusive

0.00 nwk_exempt 0.00 nwk_legacy

0.00 hostOS

Miscellaneous Statistics (per second)

141353.29 hard context switches 21695.50 NFS operations

3.47 CIFS operations 0.00 HTTP operations

36522.31 network KB received 19334.33 network KB transmitted

102048.50 disk KB read 138664.10 disk KB written

33106.94 NVRAM KB written 0.00 nolog KB written

3047.17 WAFL bufs given to clients 0.00 checksum cache hits ( 0%)

3022.37 no checksum - partial buffer 0.00 FCP operations

0.00 iSCSI operations

WAFL Statistics (per second)

2371.10 name cache hits ( 28%) 6106.66 name cache misses ( 72%)

341682.51 buf hash hits ( 82%) 76669.41 buf hash misses ( 18%)

89406.30 inode cache hits ( 94%) 5887.97 inode cache misses ( 6%)

73474.18 buf cache hits ( 98%) 1722.50 buf cache misses ( 2%)

442.71 blocks read 1719.07 blocks read-ahead

135.95 chains read-ahead 191.84 dummy reads

2228.51 blocks speculative read-ahead 14006.94 blocks written

40.80 stripes written 2.99 blocks page flipped

0.00 blocks over-written 0.00 wafl_timer generated CP

0.00 snapshot generated CP 0.00 wafl_avail_bufs generated CP

0.84 dirty_blk_cnt generated CP 0.00 full NV-log generated CP

0.01 back-to-back CP 0.00 flush generated CP

0.00 sync generated CP 0.00 deferred back-to-back CP

0.00 container-indirect-pin CP 0.00 low mbufs generated CP

0.00 low datavecs generated CP 54890.67 non-restart messages

849.34 IOWAIT suspends 18.03 next nvlog nearly full msecs

45.70 dirty buffer susp msecs 0.00 nvlog full susp msecs

578860 buffers

RAID Statistics (per second)

4056.07 xors 0.00 long dispatches [0]

0.00 long consumed [0] 0.00 long consumed hipri [0]

0.00 long low priority [0] 0.00 long high priority [0]

0.00 long monitor tics [0] 0.00 long monitor clears [0]

16360.88 long dispatches [1] 49522.63 long consumed [1]

49522.63 long consumed hipri [1] 0.00 long low priority [1]

96.63 long high priority [1] 96.65 long monitor tics [1]

0.01 long monitor clears [1] 18 max batch

61.30 blocked mode xor 539.87 timed mode xor

6.16 fast adjustments 4.74 slow adjustments

0 avg batch start 0 avg stripe/msec

713.72 checksum dispatches 5748.77 checksum consumed

45.62 tetrises written 0.00 master tetrises

0.00 slave tetrises 2226.47 stripes written

1818.54 partial stripes 407.93 full stripes

13773.87 blocks written 5631.73 blocks read

26.11 1 blocks per stripe size 7 14.50 2 blocks per stripe size 7

9.03 3 blocks per stripe size 7 4.66 4 blocks per stripe size 7

2.92 5 blocks per stripe size 7 1.44 6 blocks per stripe size 7

1.98 7 blocks per stripe size 7 95.70 1 blocks per stripe size 9

83.99 2 blocks per stripe size 9 92.81 3 blocks per stripe size 9

95.89 4 blocks per stripe size 9 103.15 5 blocks per stripe size 9

112.98 6 blocks per stripe size 9 128.50 7 blocks per stripe size 9

145.15 8 blocks per stripe size 9 190.50 9 blocks per stripe size 9

49.91 1 blocks per stripe size 10 40.49 2 blocks per stripe size 10

44.94 3 blocks per stripe size 10 54.42 4 blocks per stripe size 10

65.37 5 blocks per stripe size 10 86.60 6 blocks per stripe size 10

113.11 7 blocks per stripe size 10 153.50 8 blocks per stripe size 10

224.25 9 blocks per stripe size 10 215.38 10 blocks per stripe size 10

17.49 1 blocks per stripe size 12 15.38 2 blocks per stripe size 12

12.66 3 blocks per stripe size 12 8.33 4 blocks per stripe size 12

5.41 5 blocks per stripe size 12 3.75 6 blocks per stripe size 12

2.47 7 blocks per stripe size 12 1.47 8 blocks per stripe size 12

1.10 9 blocks per stripe size 12 0.79 10 blocks per stripe size 12

0.28 11 blocks per stripe size 12 0.07 12 blocks per stripe size 12

Network Interface Statistics (per second)

iface side bytes packets multicasts errors collisions pkt drops

e0a recv 10784077.30 11721.23 0.00 0.00 0.00

xmit 5945870.48 14076.35 0.16 0.00 0.00

e0b recv 582.12 7.39 0.00 0.00 0.00

xmit 6.73 0.16 0.16 0.00 0.00

e3a recv 3732557.54 5244.78 0.00 0.00 0.00

xmit 1057090.34 4103.55 0.16 0.00 0.00

e3b recv 5824929.83 10247.98 0.00 0.00 0.00

xmit 7795375.59 13947.39 0.16 0.00 0.00

e3c recv 17056065.72 21738.75 0.00 0.00 0.00

xmit 4999612.71 8512.65 0.16 0.00 0.00

e3d recv 0.00 0.00 0.00 0.00 0.00

xmit 0.00 0.00 0.00 0.00 0.00

c0a recv 208.33 15.80 1.10 0.00 0.00

xmit 203.68 1.10 1.07 0.00 0.00

c0b recv 208.33 1.07 1.10 0.00 0.00

xmit 203.68 1.10 1.07 0.00 0.00

e0M recv 227.43 1.60 1.60 0.00 0.00

xmit 0.00 0.00 0.00 0.00 0.00

e0P recv 0.00 0.00 0.00 0.00 0.00

xmit 0.00 0.00 0.00 0.00 0.00

vh recv 0.00 0.00 0.00 0.00 0.00

xmit 0.00 0.00 0.00 0.00 0.00

vifa recv 37397630.39 48952.74 0.00 0.00 0.00

xmit 19797949.11 40639.93 0.64 0.00 0.00

vif1 recv 37398212.51 48960.13 14.30 0.65 0.00

xmit 19797955.84 40640.09 0.80 0.00 0.00

Disk Statistics (per second)

ut% is the percent of time the disk was busy.

xfers is the number of data-transfer commands issued per second.

xfers = ureads + writes + cpreads + greads + gwrites

chain is the average number of 4K blocks per command.

usecs is the average disk round-trip time per 4K block.

disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs

/aggr2/plex0/rg0:

2b.02.0 17 22.28 1.72 1.00 7434 13.74 46.00 284 6.81 10.69 810 0.00 .... . 0.00 .... .

2b.02.1 22 24.00 1.72 1.00 6535 15.57 40.83 301 6.70 9.89 955 0.00 .... . 0.00 .... .

2b.02.2 56 86.90 31.04 2.37 9899 35.74 13.80 1552 20.13 6.39 2370 0.00 .... . 0.00 .... .

2b.02.3 49 81.33 27.35 2.50 9979 33.74 14.37 1574 20.25 6.67 2011 0.00 .... . 0.00 .... .

2b.02.4 47 80.68 26.46 2.85 7706 34.11 14.28 1510 20.11 6.72 1749 0.00 .... . 0.00 .... .

2b.02.5 47 79.32 24.57 2.83 8098 34.12 14.50 1504 20.62 6.29 1774 0.00 .... . 0.00 .... .

2b.02.6 47 79.94 24.61 3.03 7642 34.62 14.43 1538 20.71 6.11 2018 0.00 .... . 0.00 .... .

2b.02.7 49 84.73 27.65 2.81 7797 35.98 13.39 1704 21.10 6.75 1835 0.00 .... . 0.00 .... .

2b.02.8 48 80.87 25.67 2.72 8206 34.66 14.13 1617 20.54 6.53 1892 0.00 .... . 0.00 .... .

0a.03.23 42 81.52 26.78 2.66 8275 34.40 14.48 1034 20.34 6.06 1648 0.00 .... . 0.00 .... .

2b.02.10 47 77.01 24.17 2.88 7493 33.13 14.67 1541 19.70 6.93 1925 0.00 .... . 0.00 .... .

2b.02.11 44 71.09 24.51 2.54 8791 27.67 12.71 1757 18.91 13.82 793 0.00 .... . 0.00 .... .

/aggr2/plex0/rg1:

2b.02.12 18 36.94 0.00 .... . 19.67 31.97 485 17.27 13.01 402 0.00 .... . 0.00 .... .

2b.02.13 19 37.11 0.00 .... . 19.83 31.73 500 17.28 12.97 462 0.00 .... . 0.00 .... .

2b.02.14 46 82.07 23.95 2.85 7580 34.78 12.48 1627 23.34 7.92 1545 0.00 .... . 0.00 .... .

2b.02.15 47 85.13 24.57 2.64 8426 36.34 10.10 2012 24.21 7.74 1449 0.00 .... . 0.00 .... .

2b.02.16 47 84.50 24.63 2.73 7465 35.98 10.14 1984 23.89 8.03 1386 0.00 .... . 0.00 .... .

2b.02.17 48 85.10 24.83 2.47 8534 35.62 10.31 2004 24.65 7.19 1683 0.00 .... . 0.00 .... .

2b.02.18 47 83.22 24.40 2.71 8260 35.30 10.62 1892 23.52 7.60 1552 0.00 .... . 0.00 .... .

2b.02.19 47 83.62 23.97 2.65 8463 35.58 10.36 1983 24.07 7.53 1690 0.00 .... . 0.00 .... .

2b.02.20 47 85.03 25.04 2.74 7733 36.41 10.12 1981 23.59 7.91 1638 0.00 .... . 0.00 .... .

2b.02.21 46 87.10 26.26 2.94 6865 36.62 10.34 1959 24.23 7.47 1356 0.00 .... . 0.00 .... .

2b.02.22 46 84.02 24.31 3.02 7292 35.78 10.21 1942 23.93 7.54 1482 0.00 .... . 0.00 .... .

/aggr1/plex0/rg0:

2c.04.0 16 24.52 1.78 1.00 6774 13.73 30.51 429 9.01 10.95 671 0.00 .... . 0.00 .... .

2c.04.1 19 26.23 1.71 1.00 5148 15.53 27.20 443 8.99 10.79 633 0.00 .... . 0.00 .... .

2c.04.2 34 56.44 14.29 2.96 6150 26.44 10.72 1732 15.71 6.89 1813 0.00 .... . 0.00 .... .

2c.04.23 30 53.94 11.47 3.25 5656 25.86 10.57 1892 16.61 7.72 1182 0.00 .... . 0.00 .... .

2c.04.4 30 51.68 11.73 3.21 5751 24.89 11.56 1709 15.06 7.31 1336 0.00 .... . 0.00 .... .

2c.04.20 29 51.27 11.75 2.96 6148 24.09 11.58 1773 15.43 7.38 1523 0.00 .... . 0.00 .... .

2c.04.6 28 49.98 11.15 3.33 5418 23.73 11.69 1688 15.09 7.20 1319 0.00 .... . 0.00 .... .

2c.04.7 30 52.19 11.79 2.91 6652 24.76 11.37 1766 15.64 7.14 1406 0.00 .... . 0.00 .... .

2c.04.8 29 51.85 11.33 2.71 6658 24.88 10.69 1810 15.64 8.16 1206 0.00 .... . 0.00 .... .

2c.04.9 29 51.54 11.47 3.15 5685 24.40 11.31 1733 15.67 6.89 1356 0.00 .... . 0.00 .... .

2c.04.10 29 51.74 11.90 3.10 5939 24.35 11.29 1695 15.49 7.63 1234 0.00 .... . 0.00 .... .

2c.04.11 29 51.93 11.43 3.31 5689 24.69 11.01 1734 15.80 7.31 1452 0.00 .... . 0.00 .... .

/aggr1/plex0/rg1:

2c.04.12 13 23.61 0.00 .... . 12.67 33.32 433 10.94 10.18 539 0.00 .... . 0.00 .... .

2c.04.13 13 23.72 0.00 .... . 12.78 33.06 435 10.94 9.88 561 0.00 .... . 0.00 .... .

2c.04.14 32 55.89 10.90 2.65 7448 27.65 10.13 2051 17.35 6.88 1507 0.00 .... . 0.00 .... .

2c.04.15 29 54.33 11.11 3.17 5303 26.27 10.74 1774 16.95 6.72 1449 0.00 .... . 0.00 .... .

2c.04.16 30 53.57 10.87 2.77 7080 25.83 11.16 1694 16.87 6.67 1440 0.00 .... . 0.00 .... .

2c.04.17 30 55.36 11.04 2.87 6676 26.87 10.16 1857 17.44 7.03 1320 0.00 .... . 0.00 .... .

2c.04.18 31 56.41 11.86 2.70 6799 27.18 10.14 1873 17.38 6.67 1344 0.00 .... . 0.00 .... .

2c.04.19 29 54.98 10.43 3.01 5756 26.79 10.32 1774 17.76 6.93 1228 0.00 .... . 0.00 .... .

2c.04.22 29 55.70 10.52 2.91 6102 28.03 9.97 1885 17.15 7.28 1286 0.00 .... . 0.00 .... .

2c.04.21 29 54.52 10.87 2.99 5972 26.43 10.78 1700 17.21 6.62 1386 0.00 .... . 0.00 .... .

2c.04.3 30 56.61 11.61 2.81 6032 27.24 10.45 2191 17.76 6.73 1128 0.00 .... . 0.00 .... .

/aggr0/plex0/rg0:

0a.03.0 8 9.42 2.12 1.00 5346 5.02 14.02 446 2.27 24.16 289 0.00 .... . 0.00 .... .

0a.03.1 10 11.00 2.06 1.00 4364 6.73 10.97 468 2.22 24.21 355 0.00 .... . 0.00 .... .

0a.03.2 8 8.71 4.46 3.79 2299 2.72 7.53 551 1.52 13.25 896 0.00 .... . 0.00 .... .

0a.03.3 2 3.18 0.89 14.18 259 0.96 19.29 554 1.32 16.42 221 0.00 .... . 0.00 .... .

0a.03.4 2 3.13 0.92 14.72 232 0.91 19.87 548 1.30 14.25 232 0.00 .... . 0.00 .... .

0a.03.5 2 3.26 0.83 13.65 262 0.96 19.31 553 1.47 11.72 334 0.00 .... . 0.00 .... .

0a.03.6 2 3.31 0.96 13.61 303 0.92 18.86 618 1.43 13.10 287 0.00 .... . 0.00 .... .

0a.03.7 2 3.66 0.97 13.92 339 1.00 18.43 631 1.68 13.01 511 0.00 .... . 0.00 .... .

0a.03.8 2 3.34 0.96 13.63 112 0.95 18.85 568 1.43 14.21 516 0.00 .... . 0.00 .... .

0a.03.9 1 3.30 0.99 13.59 208 0.88 19.85 552 1.43 14.09 208 0.00 .... . 0.00 .... .

0a.03.10 2 3.29 0.79 13.76 273 0.99 18.26 512 1.51 15.76 292 0.00 .... . 0.00 .... .

0a.03.11 2 3.38 0.97 13.67 218 0.92 19.75 520 1.48 12.75 309 0.00 .... . 0.00 .... .

0a.03.12 2 3.19 0.96 14.39 234 0.88 20.91 480 1.35 15.11 326 0.00 .... . 0.00 .... .

0a.03.13 2 3.41 0.99 14.16 218 0.97 18.74 504 1.44 10.87 264 0.00 .... . 0.00 .... .

/aggr0/plex0/rg1:

0a.03.14 3 4.54 0.00 .... . 2.19 27.69 410 2.35 20.22 422 0.00 .... . 0.00 .... .

0a.03.15 3 4.54 0.00 .... . 2.19 27.69 401 2.35 20.16 453 0.00 .... . 0.00 .... .

2b.02.9 75 259.06 1.06 14.19 938 0.97 19.63 1003 1.38 15.28 898 255.66 64.00 67 0.00 .... .

0a.03.17 2 3.19 1.07 13.39 277 0.91 22.34 421 1.22 13.18 678 0.00 .... . 0.00 .... .

0a.03.18 2 3.37 1.16 13.44 318 0.81 24.31 436 1.39 14.66 628 0.00 .... . 0.00 .... .

0a.03.19 2 3.21 0.97 13.48 99 0.92 21.90 545 1.31 12.60 587 0.00 .... . 0.00 .... .

0a.03.20 1 3.43 1.04 13.37 164 0.97 19.99 421 1.42 13.93 173 0.00 .... . 0.00 .... .

0a.03.21 2 3.38 1.03 13.66 336 0.95 20.51 538 1.40 14.98 531 0.00 .... . 0.00 .... .

0a.03.22 2 3.30 1.03 13.69 269 0.83 23.69 385 1.44 14.70 673 0.00 .... . 0.00 .... .

Aggregate statistics:

Minimum 1 3.13 0.00 0.81 1.22 0.00 0.00

Mean 25 45.71 10.66 18.98 12.37 3.70 0.00

Maximum 75 259.06 31.04 36.62 24.65 255.66 0.00

FCP Statistics (per second)

0.00 FCP Bytes recv 0.00 FCP Bytes sent

0.00 FCP ops

iSCSI Statistics (per second)

0.00 iSCSI Bytes recv 0.00 iSCSI Bytes xmit

0.00 iSCSI ops

Interrupt Statistics (per second)

12.02 int_1 1451.95 PAM II Comp (IRQ 2)

815.46 int_3 1379.25 int_4

4940.31 int_5 6990.17 int_6

7285.64 int_7 0.23 int_9

0.16 int_10 412.33 int_11

6705.43 Gigabit Ethernet (IRQ 12) 6.36 Gigabit Ethernet (IRQ 13)

1.60 int_14 0.00 RTC

0.00 IPI 1000.04 Msec Clock

31000.94 total

lafoucrier · ‎2012-06-18

Hello,

We had same experimentationand the same context : FAS3240 with PAM 512Go + 8.1,

we see ANY1+ CPU staying at 99% particulary when the Filer as no real activity (I mean : transfert Snapmirror/snapvault, Dedup, Wafl SCAN, CIFs NFS or ISCSI), the main process busing CPU is WAFL_Ex(Kahu)..

in our case it seems to be related to Snapvault bug : our Fielr acting as a secondary Filer ( Snapvault/Snapmirror Destination).

sysstat -x 1

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

99% 0 0 0 24 95 5969 3846 0 0 0 45s 100% 0% - 40% 0 0 24 0 0 0 6223

99% 0 0 0 0 9 558 1322 0 0 0 45s 100% 0% - 30% 0 0 0 0 0 0 0

99% 0 0 0 31 159 8599 7485 0 0 0 45s 100% 0% - 36% 0 0 31 0 0 0 8184

priv set diag; sysstat -M 1

ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP

99% 70% 4% 1% 44% 38% 37% 31% 69% 1% 0% 0% 2% 1% 0% 9% 157%( 90%) 0% 0% 0% 2% 1% 2% 8 0%

100% 77% 4% 1% 46% 44% 42% 32% 68% 1% 0% 0% 2% 2% 0% 5% 168%( 95%) 0% 0% 0% 2% 2% 2% 8 0%

100% 77% 5% 1% 47% 47% 40% 34% 66% 2% 0% 0% 2% 2% 0% 6% 169%( 94%) 0% 0% 0% 2% 1% 4% 30 0%

100% 78% 3% 1% 46% 45% 39% 33% 66% 1% 0% 0% 2% 1% 0% 5% 170%( 95%) 0% 0% 0% 2% 1% 1% 2 0%

100% 74% 7% 2% 46% 43% 34% 34% 73% 2% 0% 0% 3% 2% 0% 12% 155%( 88%) 3% 0% 0% 5% 2% 1% 24 11%

are you using your Filer as a Snapvault and/or Snapmirror Destination?

The Netapp related bug is 568758 but is currently in research and does not have any comment, this bug is particulary related to Snapvault/Snapmirror secondary (destnation) Filer, if it's your case this could be the explanation.

Did you open a case to Netapp Support?

Hope this help!

Regards,

Yannick,

rajdeepsengupta · ‎2012-06-18

Thanks for your comments.

Our filer is primary one and we do not use it for snap destination. But we use it for snapmirror source.

Also I have seen that for last few days the ANY cpu load has now stabled at 70-90% so I have not open any case with netapp. Let us see for few more days.

Anyway your input on the case is appreciated.

davidrnexon · ‎2012-06-28

Hi just out of interest if you enter in priv set diag, and type aggr status <aggr number> -v can you see RLW_ON or RLW_Upgrading ?

Also if you type in aggr scrub status -v, when was the last time your aggrgates completed a full scrub ?

rajdeepsengupta · ‎2012-07-01

Thanks David..

Actually when we have done the tech refresh from 3040 to 3240, the system might be busy doing internal upgrade related to wafl. Becuase now, the ysstem is showing cpu as per expectation.

So when I looked into the aggr status -v in diag mode, I donot see anything like RLW_ON or RLW_upgrading.

But if we could have done this during the issue, then we could might have see this.

Anyway thanks

lafoucrier · ‎2012-07-02

hello ,

in my situation I always have rlw_upgrading aggregates seeing using "priv set diag; aggr status -v", but not on all aggregates.

My upgrade occuring 2 month ago now....and at this time the aggr scrub status -v command show me that all aggregates in "RLW_upgrading" status haven't complete their full scrub operations. So a complete scrubing after the DOT 8.1 upgrade seems to be related with the status "RLW_ON" status,

if the scrub is not totaly complete on an aggregate after DOT 8.1 upgrade, this aggregate will stay in "RLW_Upgrading" status.

I'm not sure it can be related to CPU behavior we observe, but it's a lead to follow...

My case (High CPU utilization) is now Off with the Netapp support we must now follow the 568758 Bug...

Regards,

Yannick

rajdeepsengupta · ‎2012-07-02

I tried to look into the bugid 568758, but could not find one in the Netapp support site.Can you please check the bugid please.

Thanks

lafoucrier · ‎2012-07-02

this is the link : http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=568758

but there is no explanation in the detail.

Regards

rajdeepsengupta · ‎2012-07-02

I also got the same..

Can you please help me to understand, what options or config you changed with help from Netapp support to fix the issue?

lafoucrier · ‎2012-07-02

Our issue is not solve, problem is always occuring for us, Netapp ask us to follow the bug resolution and upgrade as soon as possible to the DOT version who will reolve the issue.

We always seeing sometime ANY1+ CPU Staying at 99% with no disk or network activity.

davidrnexon · ‎2012-07-02

The ANY1+ monitor is not your actual cpu utilization. You are better off using sysstat -M 1 to see the utilization on each CPU. ANY1+ means 1 of your cpu's reached a max or is operating at 99%, ANY2+ means 2 or more of your cpu's are operating at x%, and so on.

In our case, after the aggregates completed a full scrub, we noticed a drop in CPU utilization.

Also in our case, we have turned off all dedupe jobs for the time being. Previously if a dedupe job kicked in and the scrub had not completed, the system was almost unusable.

Hope this helps you and others.

rajdeepsengupta · ‎2012-07-02

ok, now I got it.

So there is no resolution till date, and an update from Netapp R&D is expected which will resolve this.

am I correct?

lafoucrier · ‎2012-07-02

yes you are!

regards,

christin · ‎2012-06-29

This sounds like a support related question. If you have an active NetApp Support login, there are subject matter experts in the NetApp Support Community that may help answer your questions.

If this is an urgent issue please open a case with NetApp Technical Support.

Regards,

Christine