ONTAP Hardware

3240 filer with high CPU utilization

rajdeepsengupta
17,474 Views

My 3240 filer is showing very high CPU utilization. I have done a tech refresh from 3040 & 3020 to 3240 (ontap 7.3.6 to 8.1P1)

So other than disk shelve everything else is a fresh in the 3240 including PAM modules.

The new filer performance is obviously better now.

But my new filer CPU utilization is very high.

sysstat -M 1

ANY1+ ANY2+ ANY3+ ANY4+  AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host  Ops/s   CP

  90%   58%   31%   12%  50%  46%  46%  49%  62%     47%       0%      0%      6%  10%     0%    26%     77%( 54%)          7%        0%   0%    14%   9%   6%  13529  94%

  95%   69%   39%   16%  59%  55%  55%  59%  66%     56%       0%      0%      7%  14%     0%    24%     86%( 60%)          7%        0%   0%    20%  11%   9%  13489 100%

  88%   58%   29%   10%  50%  46%  47%  50%  56%     60%       0%      0%      5%  10%     0%    22%     67%( 50%)          4%        0%   0%    13%  11%   7%  17440  75%

  95%   65%   37%   17%  57%  50%  50%  57%  70%     52%       0%      0%      8%  14%     0%    32%     82%( 54%)          3%        0%   0%    17%  10%   8%  15628  75%

  87%   57%   29%    8%  56%  53%  55%  60%  56%     45%       0%      0%      8%  15%     0%    19%     69%( 51%)          6%        0%   0%    16%   9%  38%  13040  76%

Also if I see the WAFL scan status, it shows the following

Volume vol0:

Scan id                   Type of scan     progress

       1    active bitmap rearrangement     fbn 791 of 3959 w/ max_chain_len 3

Volume vol1:

Scan id                   Type of scan     progress

       2    active bitmap rearrangement     fbn 1108 of 13474 w/ max_chain_len 3

Volume vol2:

Scan id                   Type of scan     progress

       3    active bitmap rearrangement     fbn 159 of 356 w/ max_chain_len 3

Volume vol3:

Scan id                   Type of scan     progress

       4    active bitmap rearrangement     fbn 311 of 356 w/ max_chain_len 3

----------------------------------------------------------------------------------------------------------------

Let me know if anyone find some good reason for the high CPU utilization.

Thanks

15 REPLIES 15

dougsiggins
17,422 Views

Those numbers actually look decent. Are you looking at the ANY1+ and thinking you have high CPU? That is not really a good indicator. I prefer to look at individual and AVG columns to determine if the CPU is the bottleneck. From this information it would seem your workload probably has a lot of writes. Can you pull out the statit -b statit -e (after a few minutes). As well a few lines of the sysstat -x 1? You could also look at volume latencies (stats show volume: -- or look in OM/Performance Advisor).

rajdeepsengupta
17,422 Views

filer2> sysstat -x 1

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s

                                       in    out    read  write    read  write    age    hit  time  ty  util                            in    out      in    out

98%  20607      1      0   20608   17309  82445  134107 148711       0      0   >60     98%   93%  H    93%       0      0      0       0      0       0      0

99%  19381      1      0   19382   41222  23018  136308 142793       0      0     0s    99%   65%  Hf   97%       0      0      0       0      0       0      0

98%  18134      0      0   18140   16738  20368  136415 158128       0      0    17s    98%   97%  Hf   97%       6      0      0       0      0       0      0

97%  16471      0      0   16471   36261  21444  120937 134514       0      0    17s    99%   93%  Hf   99%       0      0      0       0      0       0      0

statit -e

Hostname: filer2  ID: 1574797944  Memory: 5958 MB

  NetApp Release 8.1P1 7-Mode: Wed Apr 25 23:47:02 PDT 2012

    <1O>

  Start time: Wed Jun 13 14:12:19 IST 2012

                       CPU Statistics

      74.881299 time (seconds)       100 %

     201.480400 system time          269 %

       8.298362 rupt time             11 %   (2845179 rupts x 3 usec/rupt)

     193.182038 non-rupt system time 258 %

      98.044796 idle time            131 %

      57.735933 time in CP            77 %   100 %

       6.278167 rupt time in CP               11 %   (2178820 rupts x 3 usec/rupt)

                       Multiprocessor Statistics (per second)

                          cpu0       cpu1       cpu2       cpu3      total

sk switches           81770.60   83865.64   74345.36   61023.43  301005.03

hard switches         34613.49   40316.34   41225.01   25198.45  141353.29

domain switches        2342.83     612.98     418.41     320.28    3694.50

CP rupts              13296.07    2762.87    6259.63    6778.42   29096.98

nonCP rupts            3930.97     617.84    2132.66    2217.40    8898.87

IPI rupts                 0.00       0.00       0.00       0.00       0.00

grab kahuna               0.00       0.00       0.04       0.00       0.04

grab kahuna usec          0.00       0.00       4.15       0.00       4.15

CP rupt usec          40250.74    1844.35   17245.57   24500.90   83841.59

nonCP rupt usec       12524.45     363.86    5938.08    8152.22   26978.62

idle                 369294.66  368234.77  322539.00  249267.76 1309336.22

kahuna                    0.00       0.19       0.00  298462.26  298462.45

storage                  99.04   80015.43     179.95       0.00   80294.43

exempt               107146.65  104735.28   62647.36      12.14  274541.44

raid                    345.04     325.64  195562.15       0.00  196232.84

target                    8.87      10.28      13.14       0.00      32.30

dnscache                  0.00       0.00       0.00       0.00       0.00

cifs                     79.18      98.78      76.04       0.00     254.00

wafl_exempt          196668.68  184589.57  138866.71  419604.67  939729.68

wafl_xcleaner         26908.44   20709.12   11898.34       0.00   59515.91

sm_exempt                17.39      21.78      20.45       0.00      59.63

cluster                   0.00       0.00       0.00       0.00       0.00

protocol                 56.14      44.78      63.87       0.00     164.81

nwk_exclusive          1015.94     942.87     786.67       0.00    2745.48

nwk_exempt           222122.75  234019.20  240687.06       0.00  696829.04

nwk_legacy            19608.98      61.80      49.34       0.00   19720.14

hostOS                 3852.96    3982.20    3426.14       0.00   11261.32

      73.083640 seconds with one or more CPUs active   ( 98%)

      59.972449 seconds with 2 or more CPUs active     ( 80%)

      42.233825 seconds with 3 or more CPUs active     ( 56%)

      13.111190 seconds with one CPU active            ( 18%)

      17.738624 seconds with 2 CPUs active             ( 24%)

      19.801098 seconds with 3 CPUs active             ( 26%)

      22.432726 seconds with all CPUs active           ( 30%)

                       Domain Utilization of Shared Domains (per second)

      0.00 idle                         623762.03 kahuna

      0.00 storage                           0.00 exempt

      0.00 raid                              0.00 target

      0.00 dnscache                          0.00 cifs

      0.00 wafl_exempt                       0.00 wafl_xcleaner

      0.00 sm_exempt                         0.00 cluster

      0.00 protocol                     558570.79 nwk_exclusive

      0.00 nwk_exempt                        0.00 nwk_legacy

      0.00 hostOS

                       Miscellaneous Statistics (per second)

141353.29 hard context switches         21695.50 NFS operations

      3.47 CIFS operations                   0.00 HTTP operations

  36522.31 network KB received           19334.33 network KB transmitted

102048.50 disk KB read                 138664.10 disk KB written

  33106.94 NVRAM KB written                  0.00 nolog KB written

   3047.17 WAFL bufs given to clients        0.00 checksum cache hits  (   0%)

   3022.37 no checksum - partial buffer      0.00 FCP operations

      0.00 iSCSI operations

                       WAFL Statistics (per second)

   2371.10 name cache hits      (  28%)   6106.66 name cache misses    (  72%)

341682.51 buf hash hits        (  82%)  76669.41 buf hash misses      (  18%)

  89406.30 inode cache hits     (  94%)   5887.97 inode cache misses   (   6%)

  73474.18 buf cache hits       (  98%)   1722.50 buf cache misses     (   2%)

    442.71 blocks read                    1719.07 blocks read-ahead

    135.95 chains read-ahead               191.84 dummy reads

   2228.51 blocks speculative read-ahead  14006.94 blocks written

     40.80 stripes written                   2.99 blocks page flipped

      0.00 blocks over-written               0.00 wafl_timer generated CP

      0.00 snapshot generated CP             0.00 wafl_avail_bufs generated CP

      0.84 dirty_blk_cnt generated CP        0.00 full NV-log generated CP

      0.01 back-to-back CP                   0.00 flush generated CP

      0.00 sync generated CP                 0.00 deferred back-to-back CP

      0.00 container-indirect-pin CP         0.00 low mbufs generated CP

      0.00 low datavecs generated CP     54890.67 non-restart messages

    849.34 IOWAIT suspends                  18.03 next nvlog nearly full msecs

     45.70 dirty buffer susp msecs           0.00 nvlog full susp msecs

    578860 buffers

                       RAID Statistics (per second)

   4056.07 xors                              0.00 long dispatches [0]

      0.00 long consumed [0]                 0.00 long consumed hipri [0]

      0.00 long low priority [0]             0.00 long high priority [0]

      0.00 long monitor tics [0]             0.00 long monitor clears [0]

  16360.88 long dispatches [1]           49522.63 long consumed [1]

  49522.63 long consumed hipri [1]           0.00 long low priority [1]

     96.63 long high priority [1]           96.65 long monitor tics [1]

      0.01 long monitor clears [1]             18 max batch

     61.30 blocked mode xor                539.87 timed mode xor

      6.16 fast adjustments                  4.74 slow adjustments

         0 avg batch start                      0 avg stripe/msec

    713.72 checksum dispatches            5748.77 checksum consumed

     45.62 tetrises written                  0.00 master tetrises

      0.00 slave tetrises                 2226.47 stripes written

   1818.54 partial stripes                 407.93 full stripes

  13773.87 blocks written                 5631.73 blocks read

     26.11 1 blocks per stripe size 7       14.50 2 blocks per stripe size 7

      9.03 3 blocks per stripe size 7        4.66 4 blocks per stripe size 7

      2.92 5 blocks per stripe size 7        1.44 6 blocks per stripe size 7

      1.98 7 blocks per stripe size 7       95.70 1 blocks per stripe size 9

     83.99 2 blocks per stripe size 9       92.81 3 blocks per stripe size 9

     95.89 4 blocks per stripe size 9      103.15 5 blocks per stripe size 9

    112.98 6 blocks per stripe size 9      128.50 7 blocks per stripe size 9

    145.15 8 blocks per stripe size 9      190.50 9 blocks per stripe size 9

     49.91 1 blocks per stripe size 10      40.49 2 blocks per stripe size 10

     44.94 3 blocks per stripe size 10      54.42 4 blocks per stripe size 10

     65.37 5 blocks per stripe size 10      86.60 6 blocks per stripe size 10

    113.11 7 blocks per stripe size 10     153.50 8 blocks per stripe size 10

    224.25 9 blocks per stripe size 10     215.38 10 blocks per stripe size 10

     17.49 1 blocks per stripe size 12      15.38 2 blocks per stripe size 12

     12.66 3 blocks per stripe size 12       8.33 4 blocks per stripe size 12

      5.41 5 blocks per stripe size 12       3.75 6 blocks per stripe size 12

      2.47 7 blocks per stripe size 12       1.47 8 blocks per stripe size 12

      1.10 9 blocks per stripe size 12       0.79 10 blocks per stripe size 12

      0.28 11 blocks per stripe size 12      0.07 12 blocks per stripe size 12

                       Network Interface Statistics (per second)

iface    side      bytes    packets multicasts     errors collisions  pkt drops

e0a      recv 10784077.30   11721.23       0.00       0.00                  0.00

         xmit 5945870.48   14076.35       0.16       0.00       0.00

e0b      recv     582.12       7.39       0.00       0.00                  0.00

         xmit       6.73       0.16       0.16       0.00       0.00

e3a      recv 3732557.54    5244.78       0.00       0.00                  0.00

         xmit 1057090.34    4103.55       0.16       0.00       0.00

e3b      recv 5824929.83   10247.98       0.00       0.00                  0.00

         xmit 7795375.59   13947.39       0.16       0.00       0.00

e3c      recv 17056065.72   21738.75       0.00       0.00                  0.00

         xmit 4999612.71    8512.65       0.16       0.00       0.00

e3d      recv       0.00       0.00       0.00       0.00                  0.00

         xmit       0.00       0.00       0.00       0.00       0.00

c0a      recv     208.33      15.80       1.10       0.00                  0.00

         xmit     203.68       1.10       1.07       0.00       0.00

c0b      recv     208.33       1.07       1.10       0.00                  0.00

         xmit     203.68       1.10       1.07       0.00       0.00

e0M      recv     227.43       1.60       1.60       0.00                  0.00

         xmit       0.00       0.00       0.00       0.00       0.00

e0P      recv       0.00       0.00       0.00       0.00                  0.00

         xmit       0.00       0.00       0.00       0.00       0.00

vh       recv       0.00       0.00       0.00       0.00                  0.00

         xmit       0.00       0.00       0.00       0.00       0.00

vifa     recv 37397630.39   48952.74       0.00       0.00                  0.00

         xmit 19797949.11   40639.93       0.64       0.00       0.00

vif1     recv 37398212.51   48960.13      14.30       0.65                  0.00

         xmit 19797955.84   40640.09       0.80       0.00       0.00

                       Disk Statistics (per second)

        ut% is the percent of time the disk was busy.

        xfers is the number of data-transfer commands issued per second.

        xfers = ureads + writes + cpreads + greads + gwrites

        chain is the average number of 4K blocks per command.

        usecs is the average disk round-trip time per 4K block.

disk             ut%  xfers  ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs

/aggr2/plex0/rg0:

2b.02.0           17  22.28    1.72   1.00  7434  13.74  46.00   284   6.81  10.69   810   0.00   ....     .   0.00   ....     .

2b.02.1           22  24.00    1.72   1.00  6535  15.57  40.83   301   6.70   9.89   955   0.00   ....     .   0.00   ....     .

2b.02.2           56  86.90   31.04   2.37  9899  35.74  13.80  1552  20.13   6.39  2370   0.00   ....     .   0.00   ....     .

2b.02.3           49  81.33   27.35   2.50  9979  33.74  14.37  1574  20.25   6.67  2011   0.00   ....     .   0.00   ....     .

2b.02.4           47  80.68   26.46   2.85  7706  34.11  14.28  1510  20.11   6.72  1749   0.00   ....     .   0.00   ....     .

2b.02.5           47  79.32   24.57   2.83  8098  34.12  14.50  1504  20.62   6.29  1774   0.00   ....     .   0.00   ....     .

2b.02.6           47  79.94   24.61   3.03  7642  34.62  14.43  1538  20.71   6.11  2018   0.00   ....     .   0.00   ....     .

2b.02.7           49  84.73   27.65   2.81  7797  35.98  13.39  1704  21.10   6.75  1835   0.00   ....     .   0.00   ....     .

2b.02.8           48  80.87   25.67   2.72  8206  34.66  14.13  1617  20.54   6.53  1892   0.00   ....     .   0.00   ....     .

0a.03.23          42  81.52   26.78   2.66  8275  34.40  14.48  1034  20.34   6.06  1648   0.00   ....     .   0.00   ....     .

2b.02.10          47  77.01   24.17   2.88  7493  33.13  14.67  1541  19.70   6.93  1925   0.00   ....     .   0.00   ....     .

2b.02.11          44  71.09   24.51   2.54  8791  27.67  12.71  1757  18.91  13.82   793   0.00   ....     .   0.00   ....     .

/aggr2/plex0/rg1:

2b.02.12          18  36.94    0.00   ....     .  19.67  31.97   485  17.27  13.01   402   0.00   ....     .   0.00   ....     .

2b.02.13          19  37.11    0.00   ....     .  19.83  31.73   500  17.28  12.97   462   0.00   ....     .   0.00   ....     .

2b.02.14          46  82.07   23.95   2.85  7580  34.78  12.48  1627  23.34   7.92  1545   0.00   ....     .   0.00   ....     .

2b.02.15          47  85.13   24.57   2.64  8426  36.34  10.10  2012  24.21   7.74  1449   0.00   ....     .   0.00   ....     .

2b.02.16          47  84.50   24.63   2.73  7465  35.98  10.14  1984  23.89   8.03  1386   0.00   ....     .   0.00   ....     .

2b.02.17          48  85.10   24.83   2.47  8534  35.62  10.31  2004  24.65   7.19  1683   0.00   ....     .   0.00   ....     .

2b.02.18          47  83.22   24.40   2.71  8260  35.30  10.62  1892  23.52   7.60  1552   0.00   ....     .   0.00   ....     .

2b.02.19          47  83.62   23.97   2.65  8463  35.58  10.36  1983  24.07   7.53  1690   0.00   ....     .   0.00   ....     .

2b.02.20          47  85.03   25.04   2.74  7733  36.41  10.12  1981  23.59   7.91  1638   0.00   ....     .   0.00   ....     .

2b.02.21          46  87.10   26.26   2.94  6865  36.62  10.34  1959  24.23   7.47  1356   0.00   ....     .   0.00   ....     .

2b.02.22          46  84.02   24.31   3.02  7292  35.78  10.21  1942  23.93   7.54  1482   0.00   ....     .   0.00   ....     .

/aggr1/plex0/rg0:

2c.04.0           16  24.52    1.78   1.00  6774  13.73  30.51   429   9.01  10.95   671   0.00   ....     .   0.00   ....     .

2c.04.1           19  26.23    1.71   1.00  5148  15.53  27.20   443   8.99  10.79   633   0.00   ....     .   0.00   ....     .

2c.04.2           34  56.44   14.29   2.96  6150  26.44  10.72  1732  15.71   6.89  1813   0.00   ....     .   0.00   ....     .

2c.04.23          30  53.94   11.47   3.25  5656  25.86  10.57  1892  16.61   7.72  1182   0.00   ....     .   0.00   ....     .

2c.04.4           30  51.68   11.73   3.21  5751  24.89  11.56  1709  15.06   7.31  1336   0.00   ....     .   0.00   ....     .

2c.04.20          29  51.27   11.75   2.96  6148  24.09  11.58  1773  15.43   7.38  1523   0.00   ....     .   0.00   ....     .

2c.04.6           28  49.98   11.15   3.33  5418  23.73  11.69  1688  15.09   7.20  1319   0.00   ....     .   0.00   ....     .

2c.04.7           30  52.19   11.79   2.91  6652  24.76  11.37  1766  15.64   7.14  1406   0.00   ....     .   0.00   ....     .

2c.04.8           29  51.85   11.33   2.71  6658  24.88  10.69  1810  15.64   8.16  1206   0.00   ....     .   0.00   ....     .

2c.04.9           29  51.54   11.47   3.15  5685  24.40  11.31  1733  15.67   6.89  1356   0.00   ....     .   0.00   ....     .

2c.04.10          29  51.74   11.90   3.10  5939  24.35  11.29  1695  15.49   7.63  1234   0.00   ....     .   0.00   ....     .

2c.04.11          29  51.93   11.43   3.31  5689  24.69  11.01  1734  15.80   7.31  1452   0.00   ....     .   0.00   ....     .

/aggr1/plex0/rg1:

2c.04.12          13  23.61    0.00   ....     .  12.67  33.32   433  10.94  10.18   539   0.00   ....     .   0.00   ....     .

2c.04.13          13  23.72    0.00   ....     .  12.78  33.06   435  10.94   9.88   561   0.00   ....     .   0.00   ....     .

2c.04.14          32  55.89   10.90   2.65  7448  27.65  10.13  2051  17.35   6.88  1507   0.00   ....     .   0.00   ....     .

2c.04.15          29  54.33   11.11   3.17  5303  26.27  10.74  1774  16.95   6.72  1449   0.00   ....     .   0.00   ....     .

2c.04.16          30  53.57   10.87   2.77  7080  25.83  11.16  1694  16.87   6.67  1440   0.00   ....     .   0.00   ....     .

2c.04.17          30  55.36   11.04   2.87  6676  26.87  10.16  1857  17.44   7.03  1320   0.00   ....     .   0.00   ....     .

2c.04.18          31  56.41   11.86   2.70  6799  27.18  10.14  1873  17.38   6.67  1344   0.00   ....     .   0.00   ....     .

2c.04.19          29  54.98   10.43   3.01  5756  26.79  10.32  1774  17.76   6.93  1228   0.00   ....     .   0.00   ....     .

2c.04.22          29  55.70   10.52   2.91  6102  28.03   9.97  1885  17.15   7.28  1286   0.00   ....     .   0.00   ....     .

2c.04.21          29  54.52   10.87   2.99  5972  26.43  10.78  1700  17.21   6.62  1386   0.00   ....     .   0.00   ....     .

2c.04.3           30  56.61   11.61   2.81  6032  27.24  10.45  2191  17.76   6.73  1128   0.00   ....     .   0.00   ....     .

/aggr0/plex0/rg0:

0a.03.0            8   9.42    2.12   1.00  5346   5.02  14.02   446   2.27  24.16   289   0.00   ....     .   0.00   ....     .

0a.03.1           10  11.00    2.06   1.00  4364   6.73  10.97   468   2.22  24.21   355   0.00   ....     .   0.00   ....     .

0a.03.2            8   8.71    4.46   3.79  2299   2.72   7.53   551   1.52  13.25   896   0.00   ....     .   0.00   ....     .

0a.03.3            2   3.18    0.89  14.18   259   0.96  19.29   554   1.32  16.42   221   0.00   ....     .   0.00   ....     .

0a.03.4            2   3.13    0.92  14.72   232   0.91  19.87   548   1.30  14.25   232   0.00   ....     .   0.00   ....     .

0a.03.5            2   3.26    0.83  13.65   262   0.96  19.31   553   1.47  11.72   334   0.00   ....     .   0.00   ....     .

0a.03.6            2   3.31    0.96  13.61   303   0.92  18.86   618   1.43  13.10   287   0.00   ....     .   0.00   ....     .

0a.03.7            2   3.66    0.97  13.92   339   1.00  18.43   631   1.68  13.01   511   0.00   ....     .   0.00   ....     .

0a.03.8            2   3.34    0.96  13.63   112   0.95  18.85   568   1.43  14.21   516   0.00   ....     .   0.00   ....     .

0a.03.9            1   3.30    0.99  13.59   208   0.88  19.85   552   1.43  14.09   208   0.00   ....     .   0.00   ....     .

0a.03.10           2   3.29    0.79  13.76   273   0.99  18.26   512   1.51  15.76   292   0.00   ....     .   0.00   ....     .

0a.03.11           2   3.38    0.97  13.67   218   0.92  19.75   520   1.48  12.75   309   0.00   ....     .   0.00   ....     .

0a.03.12           2   3.19    0.96  14.39   234   0.88  20.91   480   1.35  15.11   326   0.00   ....     .   0.00   ....     .

0a.03.13           2   3.41    0.99  14.16   218   0.97  18.74   504   1.44  10.87   264   0.00   ....     .   0.00   ....     .

/aggr0/plex0/rg1:

0a.03.14           3   4.54    0.00   ....     .   2.19  27.69   410   2.35  20.22   422   0.00   ....     .   0.00   ....     .

0a.03.15           3   4.54    0.00   ....     .   2.19  27.69   401   2.35  20.16   453   0.00   ....     .   0.00   ....     .

2b.02.9           75 259.06    1.06  14.19   938   0.97  19.63  1003   1.38  15.28   898 255.66  64.00    67   0.00   ....     .

0a.03.17           2   3.19    1.07  13.39   277   0.91  22.34   421   1.22  13.18   678   0.00   ....     .   0.00   ....     .

0a.03.18           2   3.37    1.16  13.44   318   0.81  24.31   436   1.39  14.66   628   0.00   ....     .   0.00   ....     .

0a.03.19           2   3.21    0.97  13.48    99   0.92  21.90   545   1.31  12.60   587   0.00   ....     .   0.00   ....     .

0a.03.20           1   3.43    1.04  13.37   164   0.97  19.99   421   1.42  13.93   173   0.00   ....     .   0.00   ....     .

0a.03.21           2   3.38    1.03  13.66   336   0.95  20.51   538   1.40  14.98   531   0.00   ....     .   0.00   ....     .

0a.03.22           2   3.30    1.03  13.69   269   0.83  23.69   385   1.44  14.70   673   0.00   ....     .   0.00   ....     .

Aggregate statistics:

Minimum            1   3.13    0.00                0.81                1.22                0.00                0.00     

Mean              25  45.71   10.66               18.98               12.37                3.70                0.00     

Maximum           75 259.06   31.04               36.62               24.65              255.66                0.00     

FCP Statistics (per second)

      0.00 FCP Bytes recv                    0.00 FCP Bytes sent

      0.00 FCP ops

                       iSCSI Statistics (per second)

      0.00 iSCSI Bytes recv                  0.00 iSCSI Bytes xmit

      0.00 iSCSI ops

                       Interrupt Statistics (per second)

     12.02 int_1                          1451.95 PAM II Comp (IRQ 2)

    815.46 int_3                          1379.25 int_4

   4940.31 int_5                          6990.17 int_6

   7285.64 int_7                             0.23 int_9

      0.16 int_10                          412.33 int_11

   6705.43 Gigabit Ethernet (IRQ 12)         6.36 Gigabit Ethernet (IRQ 13)

      1.60 int_14                            0.00 RTC

      0.00 IPI                            1000.04 Msec Clock

  31000.94 total

lafoucrier
17,422 Views

Hello,

We had same experimentationand the same context : FAS3240 with PAM 512Go + 8.1,

we see ANY1+ CPU staying at 99% particulary when the Filer as no real activity (I mean : transfert Snapmirror/snapvault, Dedup, Wafl SCAN, CIFs NFS or ISCSI), the main process busing CPU is WAFL_Ex(Kahu)..

in our case it seems to be related to Snapvault bug : our Fielr acting as a secondary Filer ( Snapvault/Snapmirror Destination).

sysstat -x 1

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s

                                       in    out    read  write    read  write    age    hit  time  ty  util                            in    out      in    out

99%      0      0      0      24      95   5969    3846      0       0      0    45s   100%    0%  -    40%       0      0     24       0      0       0   6223

99%      0      0      0       0       9    558    1322      0       0      0    45s   100%    0%  -    30%       0      0      0       0      0       0      0

99%      0      0      0      31     159   8599    7485      0       0      0    45s   100%    0%  -    36%       0      0     31       0      0       0   8184

priv set diag; sysstat -M 1

ANY1+ ANY2+ ANY3+ ANY4+  AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host  Ops/s   CP

  99%   70%    4%    1%  44%  38%  37%  31%  69%      1%       0%      0%      2%   1%     0%     9%    157%( 90%)          0%        0%   0%     2%   1%   2%      8   0%

100%   77%    4%    1%  46%  44%  42%  32%  68%      1%       0%      0%      2%   2%     0%     5%    168%( 95%)          0%        0%   0%     2%   2%   2%      8   0%

100%   77%    5%    1%  47%  47%  40%  34%  66%      2%       0%      0%      2%   2%     0%     6%    169%( 94%)          0%        0%   0%     2%   1%   4%     30   0%

100%   78%    3%    1%  46%  45%  39%  33%  66%      1%       0%      0%      2%   1%     0%     5%    170%( 95%)          0%        0%   0%     2%   1%   1%      2   0%

100%   74%    7%    2%  46%  43%  34%  34%  73%      2%       0%      0%      3%   2%     0%    12%    155%( 88%)          3%        0%   0%     5%   2%   1%     24  11%

are you using your Filer as a Snapvault and/or Snapmirror Destination?

The Netapp related bug is 568758 but is currently in research and does not have any comment, this bug is particulary related to Snapvault/Snapmirror secondary  (destnation) Filer, if it's your case this could be the explanation.

Did you open a case to Netapp Support?

Hope this help!

Regards,

Yannick,

rajdeepsengupta
17,422 Views

Thanks for your comments.

 
Our filer is primary one and we do not use it for snap destination. But we use it for snapmirror source.
Also I have seen that for last few days the ANY cpu load has now stabled at 70-90% so I have not open any case with netapp. Let us see for few more days.
Anyway your input on the case is appreciated.

davidrnexon
17,423 Views

Hi just out of interest if you enter in priv set diag, and type aggr status <aggr number> -v can you see RLW_ON or RLW_Upgrading ?

Also if you type in aggr scrub status -v, when was the last time your aggrgates completed a full scrub ?

rajdeepsengupta
17,422 Views

Thanks David..

Actually when we have done the tech refresh from 3040 to 3240, the system might be busy doing internal upgrade related to wafl. Becuase now, the ysstem is showing cpu as per expectation.

So when I looked into the aggr status -v in diag mode, I donot see anything like RLW_ON or RLW_upgrading.

But if we could have done this during the issue, then we could might have see this.

Anyway thanks

lafoucrier
17,422 Views

hello ,

in my situation I always have rlw_upgrading aggregates seeing using "priv set diag; aggr status -v", but not on all aggregates.

My upgrade occuring 2 month ago now....and at this time the aggr scrub status -v command show me that all aggregates in "RLW_upgrading" status haven't complete their full scrub operations. So a complete scrubing after the DOT 8.1 upgrade seems to be related with the status "RLW_ON" status,

if the scrub is not totaly complete on an aggregate after DOT 8.1 upgrade, this aggregate will stay in "RLW_Upgrading" status.

I'm not sure it can be related to CPU behavior we observe, but it's a lead to follow...

My case (High CPU utilization) is now Off with the Netapp support we must now follow the 568758 Bug...

Regards,

Yannick

rajdeepsengupta
17,422 Views

I tried to look into the bugid 568758, but could not find one in the Netapp support site.Can you please check the bugid please.

Thanks

lafoucrier
9,863 Views

this is the link : http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=568758

but there is no explanation in the detail.

Regards

rajdeepsengupta
9,863 Views

I also got the same..

Can you please help me to understand, what options or config you changed with help from Netapp support to fix the issue?

lafoucrier
9,863 Views

Our issue is not solve, problem is always occuring for us, Netapp ask us to follow the bug resolution and upgrade as soon as possible to the DOT version who will reolve the issue.

We always seeing sometime ANY1+ CPU Staying at 99% with no disk or network activity.

davidrnexon
9,864 Views

The ANY1+ monitor is not your actual cpu utilization. You are better off using sysstat -M 1 to see the utilization on each CPU. ANY1+ means 1 of your cpu's reached a max or is operating at 99%, ANY2+ means 2 or more of your cpu's are operating at x%, and so on.

In our case, after the aggregates completed a full scrub, we noticed a drop in CPU utilization.

Also in our case, we have turned off all dedupe jobs for the time being. Previously if a dedupe job kicked in and the scrub had not completed, the system was almost unusable.

Hope this helps you and others.

rajdeepsengupta
9,863 Views

ok, now I got it.

So there is no resolution till date, and an update from Netapp R&D is expected which will resolve this.

am I correct?

lafoucrier
9,863 Views

yes you are!

regards,

christin
17,423 Views

This sounds like a support related question. If you have an active NetApp Support login, there are subject matter experts in the NetApp Support Community that may help answer your questions.

If this is an urgent issue please open a case with NetApp Technical Support.

Regards,

Christine

Public