FAS2020 Performance issues

NETDIREKT · ‎2011-10-18

Hello,

We are using FAS2020 on 5 different locations.

We just updated to 7.3.6 because of some bugs.

Generally CPU usage: %90

I/O Bytes/sec 383 Kbytes/S

OPS/SEC 2500

Read Latency 8000msec

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache

in out read write read write age hit

82% 2242 0 0 2242 18016 44599 58914 16986 0 0 3s 80%

65% 2050 0 0 2050 6394 46315 57136 5035 0 0 2s 83%

68% 1807 0 0 1807 11131 39836 54533 10247 0 0 2s 79%

67% 1833 0 0 1833 10571 37530 50131 10699 0 0 2s 82%

73% 2025 0 0 2025 13901 39879 53675 11781 0 0 3s 81%

I think that, this performance performance should be more? What about your I/O, cpu values?

Regards

jakob_bena · ‎2011-10-18

hello,

i think you dont get much more performace from this system, it is the smallest filer of this line.

maybe someone else make other experiences.

regards

stevensmithSCC · ‎2011-10-18

Hi there,

Performance problems can be somewhat difficult to identify. I can give yousome pointers on here but you will likely require someone with performanceanalysis skills to go through the system and identify the bottleneck, likelyNetApp PS or your own partner PS.

Firstly, you are pushing 50-70MB / sec through the system, which is not insignificantamount of data for a FAS2020. But the CPU does not seem to busy, the standardsysstat output CPU busy measure is not perfect. If you use "sysstat -ms1" you will be able to view busy states on all cores in the system in nearreal time. But I doubt the system is CPU limited, normally CPU does not becomean issue until the system is running at 90%+ Utilisation all of the time.

This leads onto the likely culprit is disk utilisation. To get an idea ofhow busy the disks are "sysstat -us 1" will give a figure for diskutilisation and give you an idea if the disks are the bottleneck. You have notgiven any indication to the number and type of disks installed in the system soits difficult to tell if this is the bottleneck.

Finally, you have the "stats" command. This is used to collectdata from counters built into the code and provides information from individualdisks, volumes, luns etc. This can be used to determine where the performancebottleneck is.

Let me know if you have any more questions.

Cheers

Steve

jakob_bena · ‎2011-10-19

what kind of disk are in use?

how many vmware machine are running on this system?

are there any tools running like sanpmirror or snapvault?

when you start your deduplication?

in the priv set diag mode you can take a look on your cpu performance with this cmd "sysstat -M 1"

NETDIREKT · ‎2011-10-19

Hello,

Thank you for replies.

I think my problem started when i upgraded to 7.3.6.

We are using SAS 15K disks.

There is no vmware on system. This is a CDN storage system. 5 servers are connected via NFS

heinowalther · ‎2011-10-19

Depending of which version you upgraded from, there might be some filesystem operations involved after the upgrade, which will takeup some IOs.

But basically we need to know the aggregate configuration... (aggr status -r)

Also if you have added disks to the aggregate you might need to do a reallocate on the volumes to spread out the volume on all the disks...

You can messure if this will make sense with the (reallocate messure) command...

Also a (sysconfig -u 1 or even sysconfg -x 1) migth show some more info as it shows the CPs... it might be you are writing alot of small IOs to the filer...

/Heino

NETDIREKT · ‎2011-10-20

Hello,

aggr status -r

Aggregate aggr0 (online, raid4) (block checksums)

Plex /aggr0/plex0 (online, normal, active, pool0)

RAID group /aggr0/plex0/rg0 (normal)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)

--------- ------ ------------- ---- ---- ---- ----- -------------- --------------

parity 0c.00.3 0c 0 3 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

data 0c.00.7 0c 0 7 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

data 0c.00.2 0c 0 2 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

data 0c.00.11 0c 0 11 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

data 0c.00.6 0c 0 6 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688

data 0c.00.8 0c 0 8 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

data 0c.00.10 0c 0 10 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688

data 0c.00.4 0c 0 4 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

data 0c.00.1 0c 0 1 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

data 0c.00.9 0c 0 9 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

data 0c.00.5 0c 0 5 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

Pool1 spare disks (empty)

Pool0 spare disks

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)

--------- ------ ------------- ---- ---- ---- ----- -------------- --------------

Spare disks for block or zoned checksum traditional volumes or aggregates

spare 0c.00.0 0c 0 0 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

sysstat -u 1

CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk

ops/s in out read write read write age hit time ty util

99% 2656 24611 41014 54957 45706 0 0 14s 75% 90% F 81%

100% 2572 29682 41593 71038 14782 0 0 12s 70% 51% F 75%

99% 3058 29519 44912 64488 25407 0 0 2s 70% 75% : 78%

99% 2257 20982 40047 59549 33126 0 0 2s 75% 93% F 76%

98% 2374 20012 40556 63405 11652 0 0 2s 71% 57% F 75%

100% 2671 21475 41520 67612 25684 0 0 2s 79% 65% : 95%

100% 2494 15177 39469 64092 24460 0 0 2s 77% 68% F 91%

sysstat -x 1

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

98% 2318 0 0 2318 13895 41763 59264 33904 0 0 2s 72% 82% Ff 81% 0 0 0 0 0 0

91% 3137 0 0 3137 21804 49027 71080 4504 0 0 2s 72% 25% : 71% 0 0 0 0 0 0

99% 2715 0 0 2715 19775 42026 65614 21151 0 0 2s 71% 80% Ff 79% 0 0 0 0 0 0

100% 2918 0 0 2918 23577 48491 71632 17542 0 0 18s 75% 83% Fs 84% 0 0 0 0 0 0

95% 2105 0 0 2105 8759 46881 69399 30585 0 0 2s 79% 100% :f 99% 0 0 0 0 0 0

100% 2677 0 0 2677 16101 52417 73874 87 0 0 2s 69% 16% : 85% 0 0 0 0 0 0

100% 2636 0 0 2636 20345 49581 77934 11960 0 0 2s 71% 49% Fs 79% 0 0 0 0 0 0

Also we upgraded from 7.3.2 to 7.3.6 3 days ago.

Best Regars

radek_kubka · ‎2011-10-19

Read Latency 8000msec

You mean 8000 microseconds? That translates to a mere 8 milliseconds, which can be deemed as good performance - normally anything below 20ms is good.

Regards,

Radek

NETDIREKT · ‎2011-10-20

I know this is already a good value. my problem with high cpu usage.

jakob_bena · ‎2011-10-20

have you some stats from your system, on status 7.3.2?

the only thing i see, is that your system have a high usage. net in/out, disk in/out.

have you cheked out "sysstat -M 1"?

regards

NETDIREKT · ‎2011-10-20

Sorry but there is no "-M" option;

usage: sysstat [-c count] [-s] [-u | -x | -f | -i | -b] [interval]

-c count - the number of iterations to execute

-s - print out summary statistics when done

-u - print out utilization format instead

-x - print out all fields (overrides -u)

-f - print out FCP target statistics

-i - print out iSCSI target statistics

-b - print out SAN statistics

I upgraded 5 of 5 storage to 7.3.6, so i dont have any 7.3.2 log.

jakob_bena · ‎2011-10-20

switch in privileg mode "priv set diag" there you can execute this option.

NETDIREKT · ‎2011-10-20

priv set diag

Warning: These diagnostic commands are for use by NetApp

personnel only.

I changed priv mode but still there is no -M options. Any idea?

jakob_bena · ‎2011-10-20

have you tried it?

NETDIREKT · ‎2011-10-20

Ofcourse;

*> priv set diag

*> sysstat -M 1

usage: sysstat [-c count] [-s] [-u | -x | -f | -i | -b] [interval]

-c count - the number of iterations to execute

-s - print out summary statistics when done

-u - print out utilization format instead

-x - print out all fields (overrides -u)

-f - print out FCP target statistics

-i - print out iSCSI target statistics

-b - print out SAN statistics

interval - the interval between iterations in seconds, default is 15 seconds

jakob_bena · ‎2011-10-20

i have a fas2040, my output looks like this

usage: sysstat [-c count] [-s] [-u | -x | -m | -f | -i | -b] [interval]

-c count - the number of iterations to execute

-s - print out summary statistics when done

-u - print out utilization format instead

-x - print out all fields (overrides -u)

-m - print out multiprocessor statistics

-f - print out FCP target statistics

-i - print out iSCSI target statistics

-b - print out SAN statistics

interval - the interval between iterations in seconds, default is 15 seconds

filer2*> sysstat -M 1

ANY1+ ANY2+ AVG CPU0 CPU1 Network Storage Raid Target Kahuna WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP

0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0 0%

2% 0% 1% 0% 2% 0% 0% 0% 0% 2% 0% 0% 0% 0% 0% 0% 0 10%

0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0 0%

columbus_admin · ‎2011-10-20

The 2020 is a single CPU/single core, so there is no "-M" option. The 2040 is a single CPU/dual core, so "-M" is viable there.

Check that none of your NFS hosts are accessing the filer over the e0M management interface. A single GbE host connecting to that interface which is 10/100, causes TCP buffer offloading to the CPU.

Also check to make sure that none of your hosts are using the mount options actimeo=0 or noac, this causes a full attribute return everytime a host looks, touchs, writes, read, modifies a file. The 2020 would be hard pressed, even with only five hosts to run with that mount option set. These are used mainly for databases...again the 2020 is not a high end storage system and not capable of serving data with these requirements outside of a couple of machines.

- Scott

NETDIREKT · ‎2011-10-20

None of our NFS hosts are accesin vie e0M. All of them connects via gbit.

cat /etc/fstab

# Device Mountpoint FStype Options Dump Pass#

/dev/ada0s1b none swap sw 0 0

/dev/ada0s1a / ufs rw 1 1

/dev/acd0 /cdrom cd9660 ro,noauto 0 0

10.50.10.10:/www /var/www nfs rw,tcp,async,noatime,nfsv3,wsize=65536,rsize=65536 0 0

10.50.10.10:/vol/vol3 /storage nfs rw,tcp,async,noatime,nfsv3,wsize=65536,rsize=65536 0 0

rajdeepsengupta · ‎2011-10-31

We had a similar problem with our 3020 system, though not due to ontap upgrade, but it just started one day.

In our case the issue was with network domain being busy, but I wonder why it is happening across all sites in your case. Since your storage does not support sysstat -M(capital), so it is difficult to debug the cause of CPU busy, the only way out is, you send the perfstat to Netapp support, to find out which process is keeping the CPU busy.