Help with statit interpretation

JOSE_TOME · ‎2012-12-04

Hi.

I need some help to read the output of a statit to see if I can manage to identify any problem on our array. It's a FAS2040, Ontap 7.3.4. 2 shelfs with 24 disks each, one with SAS and the other with SATA. Two controllers with 12 disks of each type. On SAS disk I have some luns assigned by FCP to some hosts, and iSCSI luns for a ESX environment. For SATA disks I have NAS shares via CIFS.

The problem we have is that the backup window for the NAS shares vía NDMP takes more that 31 hours (approx. 3.8TB), we have 100GB/h transfer rate approx, which is very low. I've been watching the environment, had an oppened ticket with Netapp and the only thing that came across was that we have very few disks (RAID-DP with 9 data disks, two parity and a spare). Support told us that if we increase the number of disks, the transfer rate could increase, and I'm trying to get an explanation for this.

I've been taking performance samples with statit for several different scenaries, and could use some help to read one of them, and If possible, have an explanation of the possible HDD bottleneck that we suppose to have.

I've attached the output with some sysstat also.

Thanks for the help.

JVT

peter_lehmann · ‎2012-12-04

I've had a quick glance on the attached file...

- Cannot see any tape activities in sysstat or statit, is the NDMP Backup a direct one to tape or a 3-way?

- The SATA disks are 20% loaded, so still some headroom before I'd say you need more spindles to get more throughput.

- How much throughput do you get when reading from one of the CIFS shares? You should get at least the same when running NDMP Backup.

- The sysstat seems to show nothing, what was running during the sysstat?

Peter

JOSE_TOME · ‎2012-12-04

I have attached another statit output. I'm trying to get another while dumping data for the backup. This was taken when launching the backup, so it must be reading inodes, creating the backup catalog, etc, before dumping to the VTL we are using. What I see here is that I'm getting about 48% of full striped cout, which I think is very low, I have about 1,07 index while dividing partial/full stripes. Also the cpreads/writes are over 1,2 in several disks in the SATA aggr, but what bothers me is that the array is only reading at 173KB/sec in this time (almost doing nothing there). Could it be that data is very fragmented on disks?

peter_lehmann · ‎2012-12-04

Lets have a look at one where the backup IS running (transferring data). In this one the SATA disks were busier then before, but still not maxed out.

WAFL is a "fragmented" filesystem and in most circumstances has no issues with "fragmentation" (unlike traditional, older filesystems like UFS or NTFS).

JOSE_TOME · ‎2012-12-04

I have read several posts talking about fragmentation, and how this becomes a huge issue while performing sequential read/write operations, so backups performed by other method than snapshots is likely to be affected. Also read that dividing the volumes to be backed up in several "smaller" volumes could help, so it's adding more physical disks to te array. o I'm trying to figure out how can I improve this.

As soon as I got the ststit while transfering data I'll post it.

JOSE_TOME · ‎2012-12-13

Hi!

I finally manage to get some reading when a backup is going on and transfering data to the VTL. This is the output from the statit:

------------------------------------------------------------------------------------------------------------------------

Hostname: SHUSE-FS01 ID: 0135112970 Memory: 2816 MB
NetApp Release 7.3.4P2: Sat Sep 4 05:11:24 PDT 2010
<8O>
Start time: Mon Dec 3 19:01:52 CET 2012

                       CPU Statistics
     315.979006 time (seconds)       100 %
     169.867678 system time           54 %
       9.226622 rupt time              3 %   (2600865 rupts x 4 usec/rupt)
     160.641056 non-rupt system time 51 %
     462.090332 idle time            146 %

150.136114 time in CP 48 % 100 %
6.041689 rupt time in CP 4 % (1561910 rupts x 4 usec/rupt)

                       Multiprocessor Statistics (per second)
                          cpu0       cpu1      total
sk switches           59900.95   56480.88 116381.83
hard switches         34656.38   39071.03   73727.41
domain switches         502.97     705.21    1208.18
CP rupts               4463.69     479.39    4943.08
nonCP rupts            2761.82     526.23    3288.05
IPI rupts                63.27       5.57      68.84
grab kahuna               0.23       0.28       0.51
grab w_xcleaner           0.00      71.94      71.94

grab kahuna usec          2.29       0.90       3.19
grab w_xcleaner usec      0.00   21738.43   21738.43
CP rupt usec          18316.76     803.78   19120.54
nonCP rupt usec        9435.77     643.80   10079.57
idle                 776445.65 685962.69 1462408.34
kahuna                    0.00 223157.23 223157.23
storage               38325.61   12057.61   50383.21
exempt                47537.74   31787.76   79325.50
raid                  34005.82   11549.42   45555.24
target                 4610.26    4937.77    9548.03
netcache                  0.00       0.00       0.00
netcache2                 0.00       0.00       0.00
cifs                  23013.17   15125.81   38138.99
wafl_exempt               0.00       0.00       0.00
wafl_xcleaner             0.00       0.00       0.00
sm_exempt                31.37      19.85      51.22
cluster                   0.00       0.00       0.00
protocol                  0.00       0.00       0.00
nwk_exclusive             0.00       0.00       0.00
nwk_exempt                0.00       0.00       0.00
nwk_legacy            48277.83   13954.28   62232.11
nwk_ctx1                  0.00       0.00       0.00
nwk_ctx2                  0.00       0.00       0.00
nwk_ctx3                  0.00       0.00       0.00
nwk_ctx4                  0.00       0.00       0.00

120.076056 seconds with one or more CPUs active ( 38%)

76.889564 seconds with one CPU active ( 24%)
43.186492 seconds with both CPUs active ( 14%)

                       Domain Utilization of Shared Domains (per second)
      0.00 idle                              0.00 kahuna
      0.00 storage                           0.00 exempt
      0.00 raid                              0.00 target
      0.00 netcache                          0.00 netcache2
      0.00 cifs                              0.00 wafl_exempt
      0.00 wafl_xcleaner                     0.00 sm_exempt
      0.00 cluster                           0.00 protocol
      0.00 nwk_exclusive                     0.00 nwk_exempt
      0.00 nwk_legacy                        0.00 nwk_ctx1
      0.00 nwk_ctx2                          0.00 nwk_ctx3
      0.00 nwk_ctx4

                       CSMP Domain Switches (per second)
   From\To       idle     kahuna    storage     exempt       raid     target   netcache netcache2       cifs wafl_exempt wafl_xcleaner sm_exempt    cluster   protocol nwk_exclusive nwk_exempt nwk_legacy   nwk_ctx1   nwk_ctx2   nwk_ctx3   nwk_ctx4
      idle       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
    kahuna       0.00       0.00      11.34       0.96      61.34       1.02       0.00       0.00     195.07       0.00       0.00       0.00       0.00       0.00       0.00       0.00      57.42       0.00       0.00       0.00       0.00
   storage       0.00      11.34       0.00       0.00     274.10       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       2.32       0.00       0.00       0.00       0.00
    exempt       0.00       0.96       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.42       0.00       0.00       0.00       0.00
      raid       0.00      61.34     274.10       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
    target       0.00       1.02       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.09       0.00       0.00       0.00       0.00
netcache       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
netcache2       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
      cifs       0.00     195.07       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
wafl_exempt       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
wafl_xcleaner       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
sm_exempt       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
   cluster       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
protocol       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
nwk_exclusive       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
nwk_exempt       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
nwk_legacy       0.00      57.42       2.32       0.42       0.00       0.09       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
nwk_ctx1       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
nwk_ctx2       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
nwk_ctx3       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
nwk_ctx4       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00

                       Miscellaneous Statistics (per second)
73727.41 hard context switches             0.07 NFS operations
   1822.92 CIFS operations                   0.00 HTTP operations
      0.00 NetCache URLs                     0.00 streaming packets
   7524.12 network KB received            4445.83 network KB transmitted
24311.18 disk KB read                  14180.27 disk KB written
   9675.15 NVRAM KB written                  0.00 nolog KB written
   2118.47 WAFL bufs given to clients        0.00 checksum cache hits (   0%)
      0.00 no checksum - partial buffer    154.69 FCP operations
     79.51 iSCSI operations

                       WAFL Statistics (per second)
   3604.54 name cache hits      ( 98%)     88.36 name cache misses    (   2%)
86664.68 buf hash hits        ( 86%) 14148.92 buf hash misses      ( 14%)
12829.76 inode cache hits     ( 100%)     13.07 inode cache misses   (   0%)
12738.01 buf cache hits       ( 88%)   1756.80 buf cache misses     ( 12%)
    145.96 blocks read                    5578.99 blocks read-ahead
   1082.83 chains read-ahead               138.71 dummy reads
   3855.41 blocks speculative read-ahead   2851.32 blocks written
     12.05 stripes written                   0.00 blocks over-written
      0.03 wafl_timer generated CP           0.00 snapshot generated CP
      0.00 wafl_avail_bufs generated CP      0.00 dirty_blk_cnt generated CP
      0.03 full NV-log generated CP          0.05 back-to-back CP
      0.00 flush generated CP                0.13 sync generated CP
      0.00 wafl_avail_vbufs generated CP      0.03 deferred back-to-back CP
      0.00 container-indirect-pin CP         0.00 low mbufs generated CP
      0.00 low datavecs generated CP     11773.29 non-restart messages
     91.64 IOWAIT suspends             122333146.43 next nvlog nearly full msecs
      0.00 dirty buffer susp msecs          52.39 nvlog full susp msecs
    565192 buffers

                       RAID Statistics (per second)
    408.53 xors                              0.00 long dispatches [0]
      0.00 long consumed [0]                 0.00 long consumed hipri [0]
      0.00 long low priority [0]             0.00 long high priority [0]
      0.00 long monitor tics [0]             0.00 long monitor clears [0]
      0.00 long dispatches [1]               0.00 long consumed [1]
      0.00 long consumed hipri [1]           0.00 long low priority [1]
      0.00 long high priority [1]            0.00 long monitor tics [1]
      0.00 long monitor clears [1]             18 max batch
      8.56 blocked mode xor                130.55 timed mode xor
      2.53 fast adjustments                  1.07 slow adjustments
         0 avg batch start                      0 avg stripe/msec
     13.25 tetrises written                  0.00 master tetrises
      0.00 slave tetrises                  338.36 stripes written
     70.67 partial stripes                 267.70 full stripes
   2867.78 blocks written                  140.38 blocks read
      5.99 1 blocks per stripe size 9        2.40 2 blocks per stripe size 9
      1.67 3 blocks per stripe size 9        1.99 4 blocks per stripe size 9
      3.42 5 blocks per stripe size 9        5.56 6 blocks per stripe size 9
     12.85 7 blocks per stripe size 9       36.79 8 blocks per stripe size 9
    267.70 9 blocks per stripe size 9

                       Network Interface Statistics (per second)
iface    side      bytes    packets multicasts     errors collisions pkt drops
e0P      recv      20.56       0.18       0.05       0.00                  0.00
         xmit      12.51       0.14       0.00       0.00       0.00
e0a      recv 161035.45    1006.36       0.00       0.00                  0.00
         xmit 289917.84     535.97       0.04       0.00       0.00
e0b      recv 6466822.64    5105.37       0.00       0.00                  0.00
         xmit 2994190.18    4591.79       0.03       0.00       0.00
e0c      recv 1070085.88    1644.35       0.00       0.00                  0.00
         xmit 1154351.26    1552.51       0.03       0.00       0.00
e0d      recv    6738.45      42.52       0.00       0.00                  0.00
         xmit 114060.17     105.06       0.00       0.00       0.00
vh       recv       0.00       0.00       0.00       0.00                  0.00
         xmit       0.00       0.00       0.00       0.00       0.00
vif01    recv 7707491.93    7778.64       3.54       0.00                  0.00
         xmit 4485364.45    6714.86       0.11       0.00       0.00
vif02    recv    6878.26      43.47       0.01       0.00                  0.00
         xmit 118060.92     108.36       0.00       0.00       0.00

                       Disk Statistics (per second)
        ut% is the percent of time the disk was busy.
        xfers is the number of data-transfer commands issued per second.
        xfers = ureads + writes + cpreads + greads + gwrites
        chain is the average number of 4K blocks per command.
        usecs is the average disk round-trip time per 4K block.

disk             ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr0_SASdisks/plex0/rg0:
0d.01.0            3   8.10    0.93   1.06 10118   5.81 16.69   182   1.35   3.78   469   0.00   ....     .   0.00   ....     .
0d.01.2            3   8.33    0.92   1.06 13291   6.04 16.15   184   1.37   4.19   458   0.00   ....     .   0.00   ....     .
0d.01.4           25 73.77   66.56   3.12 4884   5.05 17.03   530   2.16   4.43 1432   0.00   ....     .   0.00   ....     .
0d.01.6           24 72.30   65.45   3.17 4707   4.54 18.83   546   2.32   4.76 1446   0.00   ....     .   0.00   ....     .
0d.01.8           25 72.08   65.34   3.24 4746   4.59 18.56   581   2.15   4.78 1370   0.00   ....     .   0.00   ....     .
0d.01.10          24 72.49   65.72   3.26 4704   4.58 18.61   560   2.18   4.78 1283   0.00   ....     .   0.00   ....     .
0d.01.12          24 72.11   65.46   3.20 4802   4.53 18.87   568   2.11   4.98 1510   0.00   ....     .   0.00   ....     .
0d.01.14          24 73.16   66.09   3.16 4718   4.70 17.94   592   2.37   5.00 1242   0.00   ....     .   0.00   ....     .
0d.01.16          25 73.00   66.21   3.16 4889   4.62 18.54   614   2.16   4.60 1694   0.00   ....     .   0.00   ....     .
0d.01.18          25 73.88   67.18   3.16 4795   4.47 19.14   568   2.22   4.68 1337   0.00   ....     .   0.00   ....     .
0d.01.20          24 72.54   65.73   3.13 4863   4.58 18.55   601   2.23   4.84 1463   0.00   ....     .   0.00   ....     .
/aggr1_SATAdisks/plex0/rg0:
0d.02.2            8 11.18    0.58   1.00 16228   9.25 26.17   399   1.35   6.16   671   0.00   ....     .   0.00   ....     .
0d.02.18           8 11.39    0.58   1.00 28214   9.51 25.51   424   1.31   5.29   814   0.00   ....     .   0.00   ....     .
0d.02.22          80 99.85   88.91   5.03 7772   9.26 25.32 1357   1.67   4.69 5874   0.00   ....     .   0.00   ....     .
0d.02.4           77 98.44   88.15   5.04 7084   8.69 26.87 1303   1.60   6.09 3727   0.00   ....     .   0.00   ....     .
0d.02.6           78 98.17   87.79   5.05 7206   8.74 26.70 1283   1.64   5.82 4108   0.00   ....     .   0.00   ....     .
0d.02.8           78 97.10   86.95   5.11 7108   8.63 27.10 1324   1.52   5.60 4260   0.00   ....     .   0.00   ....     .
0d.02.10          77 97.71   87.38   5.02 7295   8.69 26.76 1341   1.65   6.29 3969   0.00   ....     .   0.00   ....     .
0d.02.12          78 99.41   89.02   5.00 7469   8.77 26.57 1330   1.62   5.53 4288   0.00   ....     .   0.00   ....     .
0d.02.14          78 98.23   88.11   5.03 7235   8.66 27.01 1278   1.46   5.88 4100   0.00   ....     .   0.00   ....     .
0d.02.16          77 97.74   87.05   5.03 7208   8.81 26.08 1330   1.88   7.13 3392   0.00   ....     .   0.00   ....     .
0d.02.20          77 98.03   87.43   5.00 7278   8.78 26.54 1301   1.82   5.57 4240   0.00   ....     .   0.00   ....     .

Aggregate statistics:
Minimum            3   8.10    0.58                4.47                1.31                0.00                0.00
Mean              43 71.77   63.07                6.88                1.82                0.00                0.00
Maximum           80 99.85   89.02                9.51                2.37                0.00                0.00

Spares and other disks:
0d.01.1 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.3 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.5 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.7 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.9 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.11 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.13 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.15 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.17 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.19 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.21 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.22 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.01.23 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.0 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.1 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.3 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.5 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.7 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.9 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.11 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.13 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.15 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.17 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.19 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.21 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

Spares and other disks:
0d.02.23 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .

                       FCP Statistics (per second)
2792330.86 FCP Bytes recv              3958402.35 FCP Bytes sent
    154.69 FCP ops

                       iSCSI Statistics (per second)
1303016.39 iSCSI Bytes recv            1195745.34 iSCSI Bytes xmit
     79.51 iSCSI ops

Tape Statistics (per second)

tape                             write bytes blocks    read bytes blocks
SHUSE-SAN01:7.125                 9849304.89   37.57         0.00    0.00
SHUSE-SAN01:7.125L1                  1659.25    0.01         0.00    0.00
SHUSE-SAN01:7.125L2                     0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L3                     0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L4                     0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L5                     0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L6                     0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L7                     0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L8                     0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L9                     0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L10                    0.00    0.00         0.00    0.00
SHUSE-SAN01:7.125L11                    0.00    0.00         0.00    0.00

                       Interrupt Statistics (per second)
   2000.03 Clock (IRQ 0)                  4061.30 PCI direct (IRQ 16)
   2100.60 PCI direct (IRQ 17)               0.00 RTC
     68.84 IPI                            8230.77 total

                       NVRAM Statistics (per second)
      0.00 total dma transfer KB             0.00 wafl write req data KB
      0.00 dma transactions                  0.00 dma destriptors
   2787.38 waitdone preempts                 0.01 waitdone delays
      0.02 transactions not queued         335.84 transactions queued
    336.80 transactions done                42.81 total waittime (MS)
   1479.39 completion wakeups              197.86 nvdma completion wakeups
    118.72 nvdma completion waitdone      9674.19 total nvlog KB
      0.00 nvlog shadow header array full      0.00 channel1 dma transfer KB
      0.00 channel1 dma transactions         0.00 channel1 dma descriptors

                       E7520 Data Mover Statistics (per second)
10334.55 total dma transfer KB             4.94 total bcopy transfer KB
      2.60 total waittime (MS)

------------------------------------------------------------------------------------------------------------------------

And also I got some output from a sysstat when the statit was running:

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

26% 0 1947 0 2124 18008 2455 20341 2475 0 0 8s 67% 13% Fn 73% 136 41 287 439 306 1181

71% 0 3462 0 3564 66792 2633 13299 89953 0 0 28s 95% 95% Fn 84% 80 22 342 303 281 276

71% 0 3618 0 3723 64474 3447 19938 78679 0 0 28s 92% 61% Fn 78% 75 30 262 181 230 904

65% 0 3437 0 3535 62042 2941 14108 82709 0 0 4s 94% 89% F 82% 69 29 283 146 352 395

70% 0 3449 0 3570 67054 3342 18552 83138 0 0 30s 93% 84% Ff 85% 86 35 464 74 308 802

17% 0 1852 0 1956 960 2431 22146 0 0 0 4s 62% 0% - 85% 55 49 206 38 540 958

17% 1 1625 0 1805 700 2590 23444 12 0 0 8s 64% 0% - 83% 161 18 594 2425 369 0

19% 0 1707 0 2427 903 2403 27926 0 0 0 8s 65% 0% - 84% 667 53 904 4038 514 1187

48% 1 3718 0 8367 2324 73504 87383 21052 0 0 3s 92% 100% :f 79% 4611 37 247 18768 218 701

33% 0 3156 0 3230 1554 37708 45966 7364 0 0 3s 81% 99% Zf 94% 47 27 432 9 225 197

32% 0 3168 0 3245 1612 44303 58272 3710 0 0 2s 78% 99% Zf 98% 45 32 726 17 143 985

31% 0 3233 0 3311 1830 43011 50216 5395 0 0 54s 85% 99% Zf 100% 58 20 481 16 386 0

34% 0 3611 0 3750 1896 48525 56898 3962 0 0 57s 82% 99% Zf 98% 78 61 945 20 289 852

30% 0 3106 0 3203 1699 43288 58432 5382 0 0 58s 86% 99% Zn 95% 63 34 774 2 277 335

32% 0 2992 0 3104 1880 50770 66186 5510 0 0 59s 87% 99% Zn 98% 58 54 514 18 317 66

29% 0 3019 0 3209 1630 41634 55848 6352 0 0 1 86% 99% Zn 100% 141 49 1023 296 253 1116

43% 0 4055 0 4315 2447 69478 74662 11116 0 0 1 89% 99% Zf 89% 196 64 4232 1454 362 453

Now I do see some massive disk usage for the SATA aggregate. But still little traffic from the interfaces and poorly 24,67MB/s activity from disks (at 80% disk utilization?)

JOSE_TOME · ‎2012-12-14

Is it possible from the statit output to calculate the average read/write IOPS that are requested to the array in order to compare it with the "theoretical" IOPS the array is capable of serve by the ammount of disks it has? Make sense?

peter_lehmann · ‎2012-12-17

I'd say that the SATA disks are the bottleneck (no surprise). You do have a lot of CIFS IOPS when the backup is running. This is slowing it down too, because there are a lot of CP's being generated. Maybe you can move the CIFS activity and the Backup Activity to different timeslots, that would certainly help.

Comparing the current IOPS with the "theoretical" IOPS is difficult but can be done. I'd recommend you to get someone from NetApp or a Partner Company with performance troubleshooting experience involved at this stage.

JOSE_TOME · ‎2012-12-18

Hi.

Thanks for the response.

I've been doing some tests. Yesterday I created a new volume on the same SATA aggregate (1TB volume), and copied about 100GB of files of 1-1,5GB each. I made a dump to null and took some statit/sysstat info. The dump was made in 6 minutes, I registered a throughtput of about 1TB/h (close to 300MB/s).

Today I decided to do another dump to null with a production volume, specifically with the userfiles share which has about 3TB of data in small files (about 3million files). The dump was aborted in about one hour and it gets to read 240GB, getting a throughput close to 300GB/h (88MB/s). I also get statit/sysstat info for this.

Both volumes are on the same aggregate, meaning, same physical disks. The only thing I can "conclude" on this is that the file directory structure, along with the filesize and ammount of files is impacting the reading process.

Here's some output from the dump, where you can see the time spent on each Pass of the dump:

----------------------------------------------------------------------------------

DUMP: creating "/vol/usuarios/../snapshot_for_backup.511" snapshot.

DUMP: Using Full Volume Dump

DUMP: Dumping tape file 1 on null

DUMP: Date of this level 0 dump: Tue Dec 18 10:09:03 2012.

DUMP: Date of last level 0 dump: the epoch.

DUMP: Dumping /vol/usuarios to null

DUMP: mapping (Pass I)[regular files]

DUMP: mapping (Pass II)[directories]

DUMP: estimated 3080484582 KB.

DUMP: dumping (Pass III) [directories]

DUMP: Tue Dec 18 10:21:01 2012 : We have written 370385 KB.

DUMP: Tue Dec 18 10:26:01 2012 : We have written 1142394 KB.

DUMP: dumping (Pass IV) [regular files]

DUMP: Tue Dec 18 10:31:01 2012 : We have written 11003960 KB.

DUMP: Tue Dec 18 10:36:01 2012 : We have written 43909314 KB.

DUMP: Tue Dec 18 10:41:01 2012 : We have written 82547223 KB.

DUMP: Tue Dec 18 10:46:01 2012 : We have written 116505114 KB.

DUMP: Tue Dec 18 10:51:01 2012 : We have written 149442003 KB.

DUMP: Tue Dec 18 10:56:01 2012 : We have written 183890952 KB.

DUMP: Tue Dec 18 11:01:01 2012 : We have written 219154461 KB.

DUMP: Tue Dec 18 11:06:01 2012 : We have written 251863963 KB.

----------------------------------------------------------------------------------

I've read about some other environments with the same array, talking about having a lot more millions of files that what we do. Can I actually conclude that this is what is messing with the backups? how can I prove this (with numbers)?