Solved: Re: FAS6280. Disappointing performance?

webfarm_aruba · ‎2011-02-20

We have just acquired a new shiny FAS6280 in metrocluster configuration with 2x512GB PAM per controller and a total of 336 FC 15k disks hoping for awesome performance, but now that we have set it up we are quite disappointed about NFS performance.

We already have a metrocluster FAS3160 (our first netapp), and I have to say that we were surprised by its performance reaching the 35-40k NFS v3 IOPS per controller (our application profile is 80% metadata, 12% reads, 8% writes on hundred of millions of files) saturating CPU (its bottlenek in our profile), with very low latency (disks and bandwidth were OK).

We use just NFS protocol, nothing more, nothing less but we use it heavily and we need very high IOPS capacity (we use storage for real not as many customers that use it without touching its limit), so we decided to move to the new FAS6280 hoping for a huge improvement in CPU performance (powerful exacore X5670 versus a tiny old dual core 2218 AMD Opteron). If you look at CPU benchmarks websites (like http://www.cpubenchmark.net) you can see something like 6x performance in pure CPU power so we hoped for at least 3x performance increment knowing that the bottleneck in our configuration was just the CPU.

A bad surprise.. using same application profile, a single controller seems barely reach 75-80k IOPS before 100% CPU busy is touched. So from our point of view just 2x performance more than a FAS3160. The bad thing is that obviously a FAS6280 doesn't cost 2x a FAS3160...

So we tried to investigate..

The real surprise is how the cpu is badly badly (let me say badly) employed on FAS6280. For a comparison here is the sysstat -m from our FAS3160 running near 100% cpu busy:

sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3
100% 86% 70% 96% 90% 90%

How you can see the CPUs usage is quite well balanced and all core are employed. Good.

Now the sysstat -m on the FAS6280 with same application profile:

sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11
99% 21% 6% 0% 0% 0% 0% 0% 0% 0% 48% 48% 58% 93%

How you can see the FAS6280 barely uses 4-5 cpu cores out of 12 available, expecially CPU11 reaching 100% cpu busy very soon. We tried to optimize everything, mount in V2, V3, V4, create more aggregates, more volumes, flexvols, traditional vols and so on, but nothing let us to increase performances..

From my personal point of view, it seems that the FAS6280 hardware is far far more advanced than the Ontap version (8.0.1) that probably can't take advantage of bigger number of cores of this new family of filers, so it finishes to use an advanced cpu like X5670 just as an older dual core or a little more.. Simply the x5670 core is faster than 2218 core so it obtain a better performance.. but far far away from what it could do..

I read that new ontap upgrades should unlock NVRAM (now it can use just 2Gigs out of 8Gigs installed) and cache (now it can use 48Gigs out of 96Gigs) and should give better multithreading. Will these upgrades unlock some power out of our FAS6280?

Another strange thing is that the filer reach 100% busy CPU with extra low latencies. I read that is not recommended to take the filer upper than 90% cpu busy limit, because latencies could increase very fast and in an unpredictable way. This sounds reasonable, but for us is not useful at all to have application with 200us latencies.. we just need to stay under the 12ms limit.. For instance, if we touch the 100% CPU bound is it reasonable to continue to increase the filer usage until we reach, for example, 8ms medium latency for the slowest op? Or the latency could really explode in a unpredictable way causing problems?

What do you think? I sincerely don't know what other to refine, I think we have tried almost everything. Can we better balance CPU utilization? What do you suggest?

Of course I can post more diag commands output if you need it.

Thanks in advance,

Lorenzo.

lovik_netapp · ‎2011-02-21

Hi Lorenzo,

Welcome to community!

Let me first tell you that, "you should never judge a netapp system's load by CPU usage." It's always a bad idea to think that a system is busy just by looking at CPU usage, to best use your netapp system always use multithreaded operation with recommended settings and look at latency values, which in your case is very good 200us.

Ontap ties every type of operation in a stack called domain, every domain is coded in a way that it will fist use the explicit assigned cpu and once it satuarates that CPU then it shifts its load to next CPU. Every domain does have it's own priority and they are hard coded so instance a houkeeping job domain will always have lower priority over NFS/CIFS domain and Ontap always makes sure that user requests always take priority over system internal work.

One more thing please don't look at CPU stat's 'ANY' counter always see 'AVG' as 'ANY' conter always give me a mild heart attack.

At last I would say that the issuse you are looking at is cosmetic, however If you think you have any performance problem run a benchmarck with SIO_ontap and then you can see the system's capaciry or open a support case but I am sure you will hear the same thing from support also.

Cheers!

View solution in original post

bosko_radivojevic · ‎2011-02-21

[ it is a bit offtopic, sorry for this ]

How do you measure IO profile? We are using our storage in simillar way (NFS v3) and having the same CPU bound problem (on much smaller 2040 platform).

lovik_netapp · ‎2011-02-21

Hi Lorenzo,

Welcome to community!

Let me first tell you that, "you should never judge a netapp system's load by CPU usage." It's always a bad idea to think that a system is busy just by looking at CPU usage, to best use your netapp system always use multithreaded operation with recommended settings and look at latency values, which in your case is very good 200us.

Ontap ties every type of operation in a stack called domain, every domain is coded in a way that it will fist use the explicit assigned cpu and once it satuarates that CPU then it shifts its load to next CPU. Every domain does have it's own priority and they are hard coded so instance a houkeeping job domain will always have lower priority over NFS/CIFS domain and Ontap always makes sure that user requests always take priority over system internal work.

One more thing please don't look at CPU stat's 'ANY' counter always see 'AVG' as 'ANY' conter always give me a mild heart attack.

At last I would say that the issuse you are looking at is cosmetic, however If you think you have any performance problem run a benchmarck with SIO_ontap and then you can see the system's capaciry or open a support case but I am sure you will hear the same thing from support also.

Cheers!

webfarm_aruba · ‎2011-02-21

Hi Lovik,
thanks for your answer! I considered not the ANY cpu ( that I've already understood that is a meaningless parameter ) but the syssstat -x N CPU busy parameter (I haven't post it but it was at 99%). Has this parameter a better use to understand a storage system load? Or is it meaningless this too? How could I understand when the storage will reach its limit? It's just try until it dies? or there is a predictable way to understand it?

I mean, you are saying that all the cores will be employed one after the other (cool way to lower cpu content switch!) when they get saturated. So could I use the number of core utilization as a meter to understand when the storage is at the end of performance? The FAS has 12 cores, when I see 11 of them employed I can say it's time to buy another one? Now just 4-5 are really employed.. this means that we are far away from its saturation?

In a Netapp whitepaper i've read that going upper 90% CPU busy parameter can give unpredictable result on latency that could increase in an exploding way. Is this right? Of course we have to monitor storage and we have to know when we have to add another storage at least one or two month before this happens just because acquiring a new storage is not so fast.

For benchmarking purpose we prefer install some servers recreating our applications environment to really understand how the storage performs. In the past we had very different behaviour between benchmark results and real world result so we prefer to have a more traditional approach to avoid (already happened) future surprises

Currently we have installed 20 dual xeon quad servers to stress the storage and we reached the Cpu 100% busy on the filer. As I said latency continues to be exceptionally low but we need to know the real limits of the storage to better understand for how many months it will give us sufficient performance. We came from EMC Celerras and on that appliances all is quite more linear, meanings that when you reach the 100% datamover CPU you are dead and all stops working good so it is quite easy to calculate its limits and make prevision when it will saturate. Should I increase the test servers until I reach our latency limits? How much is reliable this approach? I mean we wont make explode latency with an 1% more load on storage..

Thanks again,
Lorenzo.

PS: Bosko, you can reset your filer nfs statistics with nfsstat -z, launch your application for a while than launch again nfsstat to see the percent nfs profile and sizes.

anton_oks · ‎2011-02-21

Dear Lovik.

We see the same problem as described from Lorenzo in our environment. And if your domains are really "shifting load to the next CPU", you for sure stop "shifting" at a maximum of 4 (used) CPUs.

To "open a case" is a waste of time. NetApp support asks for lots of different logs and things, just to tell you at the end "it works as designed" and "all will be good with ONTAPxyz".

Regards

anton_oks · ‎2011-02-21

Hi Lorenzo.

Welcome in the club, as we use to say here... and I'm not talking about the NetApp community but about the FAS6xy0 CPU performance issue.

We see the same problem here. And we hear the same explanations and "excuses" from NetApp since many months.

Regards

Anton

webfarm_aruba · ‎2011-02-21

Hi Anton,
I was calmed down by the answer of Lovik and now i'm returning to my depression..

So how do you manage the CPU limit? Do you increase pression on your storage until you reach the desired latency (I must admit this way give me the creeps) or do you keep CPU busy steadly near 90% and avoid to go upper?

Thanks,
Lorenzo.

anton_oks · ‎2011-02-21

Hi Lorenzo.

If your application doesn't mind, I'd recommend to disable all the atime updates on all volumes. This should help a bit..

In theory and best case, you could kick the MetroCluster and maybe even the Cluster and use only the single heads. The write overhead/penalty with a MetroCluster is there when you are really pushing those Filers to the edge... well, theory, as I said.

Unfortunately, this is not an option with our (main) application. So what we did in the meantime, is, we had to introduce a (stats based) latency monitoring and spread the load of the worst application, which we know makes all the trouble (synchronous IOs) , over multiple NetApp clusters. The new FAS6280 will host other stuff for now. We hope NetApp will come up with a better ONTAP release very soon... because the FAS6080 has to replace some of the other Filers soon.

We started to use "jumbo frames" and do use 10GigE now... but this may also not free up enough CPU cycles in your case... and we disabled DeDup on all performance relevant NetApp Filers.

We also started to change our "well known" trouble making application. In the future we try to use as much local disks for the application log's and such things as possible and write less synchronous data to the Filer.

Some good answers about that topic could i.e. be found here:

http://nfs.sourceforge.net/nfs-howto/ar01s05.html (see "5.9. Synchronous vs. Asynchronous Behavior in NFS")

http://milek.blogspot.com/2010/05/zfs-synchronous-vs-asynchronous-io.html (ZFS - synchronous vs. asynchronous IO)

http://kerneltrap.org/mailarchive/linux-fsdevel/2009/8/27/6359123/thread (Linux kernel: adding proper O_SYNC/O_DSYNC)

I still don't have a 100% satisfying answer from NetApp why they don't spread their "domains" over the CPU resources they have... or just give up that kind of design. Maybe there are deep WAFL secrets involved... and maybe the new SSD shelfs could help then... if you can afford them.

Regards

Anton

webfarm_aruba · ‎2011-02-22

Hi Anton,

I've already disabled atime from all vols, disabled dedup and unfortunately, as you, I can't disable metrocluster and cluster.. (netapp was chosen for HA and metrocluster feature).

Unofrtunately now we can't change application too.. we are planning for it but it's a long long process.. Application access near a billion of files (million of users) and this causes lot of metadata (lookup at 60%)

We tried to enable jumbo frames on a LACP with two 10Gigs ethernet ports too but it doesn't seems help at all.

Our problem seems to be just the CPU, disk are not used at all (5-10% from sysstat), so I think that SSD technology is not very useful for us.. we just need to use all CPU we bought..

I can't uderstand why Netapp decided to use exacore CPUs if they really use 4-5 cores out of 12..

Thanks,

Lorenzo.

p_maniawski · ‎2011-02-22

We actually experience similar issue (high CPU load) for small sequential sync writes (database log writes). I'm being told that CPU load in sysstat is not the "real" CPU usage, but also no one can explain me what's shown in there... Enabling jumbo frames for smaller data blocks, disabling VLANs tagging and LACP didn't help. Disks usage 10%, CPU 100%. While our FAS is just the entry level one and event can't match performance for your unit, the problem remains the same so I'm looking forward for your conclusion.

erick_moore · ‎2011-02-21

Sounds like someone is a clock watcher! For starters, latency is generally the most important predicator to performance issues on a storage array. If latency rises, and IOPS drop, you certainly have a problem. As it stands now with your sub millisecond latency I would say you are performing exceptionally well! I understand your concern about not wanting to push your CPUs to the point where the system does become slow, but don't forget about FlexShare! Take a look at TR-3459. FlexShare lets you prioritize your workload once your system becomes heavily loaded. You can give priority to more critical operations using FlexShare, and the best part about it is you already have it on your system! If you don't contact your local SE and they will get you the license since it is free!

Regards,

Erick

chrisatnav · ‎2011-02-21

Lorenzo,

Going back to your NFS profile (80% metadata), are all of your metadata operations GETATTR calls? If so, is there any chance you could find a way to reduce the number of those calls?

--

Chris

webfarm_aruba · ‎2011-02-22

Hi Chris,

Metadata is mainly composed of Lookup (56%), getattr (9%), setattr (7%), then there are some creates some remove and so on but they are just a small part of them. Our application need to access something like a billion of files divided in douzens of mountpoints (our new FAS6280 is beginning to be part of this). This probably causes client lookup cache to invalidate before the same file is recalled, this could explain the high lookup requests. Fortunately lookup are one of the lighter operation on storage so this is not a problem for now. Unfortunately we can't change application at this time: of course this is planned but it is a long process and will take a lot of time.

Thanks,

Lorenzo.

dejanliuit · ‎2011-02-21

Hi.

Looking at how the load is unevenly distributed I would make a guess that something in the IRQ (interrupt handling) isn't as good as it should be.

Is all the IO-load going thru one or few network interfaces?

I've seen too many cases where especially on older Linux (pre MSI-interrupt) handling would load just one CPU.

Also, even with modern MSI-interrupt handling, I notice that some cards (QLogic 4GB/s) still just loads the first CPU in certain cases, although all the cores

In that case I would try to redistribute the load across different network cards and see if the CPU-load distribution moves around, and you could find the most optimal distribution.

You might want/be forced to do trunked connection to do that, depending on your setup and requirements to adress one or few NFS-IP adresses.

Second thing is to enable NFSv4, if possible, and look at enabling the NFSv4 delegation, again if possible/supported by your client OS.

Read and Write delegations are disabled by default.

Read about delegations in this paper : http://www.nfsconf.com/pres04/baker.pdf

You didn't mention if you have a high thruput on the NICs also.

One thing that realy P-O me is that Netapp decided to drop TCP (TOE) offloading.

So now I see a much higher CPU load on NFS with my 10G NICS when doing the exact same tests compared to 4GB/s FC.

And I doubt that the protocoll differences and block size makes up for the almost 30-50% higher CPU load I see on 10GB/s NFS, and still getting higher raw thruput on 2x4GB/s.

Netapp reintroduces TCP TSO (Segmentation offloading) in Ontap 8, but that is just a small part of the whole process of handling network traffic.

TSO : http://en.wikipedia.org/wiki/Large_segment_offload

TOE : http://en.wikipedia.org/wiki/TCP_Offload_Engine

The reasoning I hear is that TOE doesn't give any extra performance compared to CPU nowdays and adds overhead.

But CPU should IMHO do other things than calculate checksums, that even the cheapest NICs does at almost wirespeed nowdays. And I suppose Netapp doesn't use the cheapest NICS out there.

What if the CPU is busy like in you case? Why not switch on TOE then if you care about performance.

Or at least leave it to us, the customers, to decide what is the optimal way.

shane_bradley · ‎2011-02-24

Re the TOE thing, i heard there were driver dramas. Again its all just hearsay. We've been using TOE on our 6080's for quite a while, for us it was the difference between it bottoming out at about 400MB/s and sustained 700MB/s with line speed peaks. the biggest pain with TOE was the lack of trunk support, we used RR DNS to work around it which wasnt to bad.

We're still trying to see whats going on with 8.0.1 and stateless offload, it seems interesting but we cant really measure the improvement. To be honest we havent tried, but we probably should one day.

webfarm_aruba · ‎2011-02-22

We are gone further with our tests and probably we began to better understand how this filer works.

First of all Lovik seems to be right.

We pushed forward to the CPU utilization and what we have seen is that more you press and more core begins to be employed. We reached 7 cores utilization:

ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11

100% 33% 19% 0% 0% 0% 0% 0% 41% 38% 66% 68% 71% 90%

Pushing further we began to use CPU7, then CPU6. Latency increased but very low.. medium latency was 0.500 msec (Slowest operation in read op, 5ms, then mkdir with 2.4ms, then the others with under 1 ms latency).

The IOPS have doubled:

CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk

ops/s in out read write read write age hit time ty util

99% 160143 152620 300862 15218 127792 0 0 3 99% 100% : 12%

PAM (1TB) seems work good avoiding a lot of disk reads:

Usage Hit Meta Miss Hit Evict Inval Insert Chain Blocks Chain Blocks Replaced

% /s /s /s % /s /s /s /s /s /s /s /s

90 3947 3700 4254 48 105 0 6430 3887 3901 100 6425 3887

The strange thing is that CPU busy was 100% at 80000 iops, and continue to be 100% at 160000 iops, telling me that CPU Utilization is not a good storage usage indicator..

Now the 20 dual xeon test servers got saturated.. they have load at 300 and cannot spit out more cpu cycles.. anyway the IO Wait on client continue to be low, near 4%, and this seems to point out that we haven't reach the storage limit yet.

At this point the only reliable indication of storage usage seems to be the latency indication as Lovik said.. we'll try to increase test servers to understand where the limit is and to understand if pressing on storage will bring latency to explode or not (we want maximum 12ms latency)..

Thank you all,

Lorenzo.

m_lubinski · ‎2011-02-22

Hi Lorenzo,

One question from me (I also am having very serious performance problems which we don't know where they come from), you wrote that you disabled dedupe, but it didn't help.

Does it mean that you only used command sis off /vol ? or did you use sis undo command (real disable of dedupe)?

we only disabled dedupe (not udoing) and later on we couldn't undo (due to not enough space).

just do sis status - if it shows "disabled", then it still checks of changed blocks from your volume. Try to undo asis volumes.

webfarm_aruba · ‎2011-02-22

Hi M,
Probably it's more correct to say that dedupe was never turned on and we mantain it disabled.
If I launch a sis status I receive a "No status entry found."

Is dedupe causing an excessive overhead to you? Netapp claims overhead is very low but I've never tried it and I'm interested on real customers application.

Thanks,

Lorenzo.

m_lubinski · ‎2011-02-22

in my case i know dedupe is wrong in many ways. we made some mistakes earlier, like disabling dedupe (without undo) and later on we couldn't undo due to not enough space (overdeduplicated).

i also sometimes get a call that customer sees performance issue, and when i check there are let's say 3 concurrent drdupe tasks running (some in verification mode) and when i disable them problem seems to be solved afterwards.

that is why in new setup with clustered (no metro) fas3160 with pam2 i won't use deduplication because of performance implications.

in my case i could never get help from netapp as they always said that i have misaligned vm's.

may i ask btw if you are using your filers for vmware as well or just physical boxes for website? and did anything change after going from 1gbit to 10g?

Marek

webfarm_aruba · ‎2011-02-23

Hi Marek,

This FAS6280 is planned to host only our mail application, we are planning a test with our web hosting service but not for now and no virtual machine services at all.

I've checked for difference between 1gbit link and 10gbit link but what i've found is that if you don't need bandwidth the CPU usage doesn't change very much.

I've checked FAS3160 (linked at 2 x 1Gbit/sec) and FAS6280 (linked at 2 x 10gbit/s). Using ps -c 1 in advanced mode I see an 8% CPU usage for each network thread in 1gbit and 10gbit configuration, so no difference.

A slightly better usage in 10gbit config is in the link thread e0X_qidX (10gbit) that seems to use lower CPU than Gb_Enet/eXX (1gbit) thread, but i'm speaking 1% against 3% so very very little difference and probably too low to be considered reliable.

I think that the needed networking processing power low and you should upgrade to 10gigs only if you need bandwidth.. but this is my opinion of course..

Lorenzo.

m_lubinski · ‎2011-02-23

hi Lorenzo,

thanks for your reply. and tell me when you do dd test with your storage, are you able to write it with 100MB/s? i often get 20MB max, and have no clue how to deal with it

Marek