2011-02-20 01:22 PM
We have just acquired a new shiny FAS6280 in metrocluster configuration with 2x512GB PAM per controller and a total of 336 FC 15k disks hoping for awesome performance, but now that we have set it up we are quite disappointed about NFS performance.
We already have a metrocluster FAS3160 (our first netapp), and I have to say that we were surprised by its performance reaching the 35-40k NFS v3 IOPS per controller (our application profile is 80% metadata, 12% reads, 8% writes on hundred of millions of files) saturating CPU (its bottlenek in our profile), with very low latency (disks and bandwidth were OK).
We use just NFS protocol, nothing more, nothing less but we use it heavily and we need very high IOPS capacity (we use storage for real not as many customers that use it without touching its limit), so we decided to move to the new FAS6280 hoping for a huge improvement in CPU performance (powerful exacore X5670 versus a tiny old dual core 2218 AMD Opteron). If you look at CPU benchmarks websites (like http://www.cpubenchmark.net) you can see something like 6x performance in pure CPU power so we hoped for at least 3x performance increment knowing that the bottleneck in our configuration was just the CPU.
A bad surprise.. using same application profile, a single controller seems barely reach 75-80k IOPS before 100% CPU busy is touched. So from our point of view just 2x performance more than a FAS3160. The bad thing is that obviously a FAS6280 doesn't cost 2x a FAS3160...
So we tried to investigate..
The real surprise is how the cpu is badly badly (let me say badly) employed on FAS6280. For a comparison here is the sysstat -m from our FAS3160 running near 100% cpu busy:
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3
100% 86% 70% 96% 90% 90%
How you can see the CPUs usage is quite well balanced and all core are employed. Good.
Now the sysstat -m on the FAS6280 with same application profile:
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11
99% 21% 6% 0% 0% 0% 0% 0% 0% 0% 48% 48% 58% 93%
How you can see the FAS6280 barely uses 4-5 cpu cores out of 12 available, expecially CPU11 reaching 100% cpu busy very soon. We tried to optimize everything, mount in V2, V3, V4, create more aggregates, more volumes, flexvols, traditional vols and so on, but nothing let us to increase performances..
From my personal point of view, it seems that the FAS6280 hardware is far far more advanced than the Ontap version (8.0.1) that probably can't take advantage of bigger number of cores of this new family of filers, so it finishes to use an advanced cpu like X5670 just as an older dual core or a little more.. Simply the x5670 core is faster than 2218 core so it obtain a better performance.. but far far away from what it could do..
I read that new ontap upgrades should unlock NVRAM (now it can use just 2Gigs out of 8Gigs installed) and cache (now it can use 48Gigs out of 96Gigs) and should give better multithreading. Will these upgrades unlock some power out of our FAS6280?
Another strange thing is that the filer reach 100% busy CPU with extra low latencies. I read that is not recommended to take the filer upper than 90% cpu busy limit, because latencies could increase very fast and in an unpredictable way. This sounds reasonable, but for us is not useful at all to have application with 200us latencies.. we just need to stay under the 12ms limit.. For instance, if we touch the 100% CPU bound is it reasonable to continue to increase the filer usage until we reach, for example, 8ms medium latency for the slowest op? Or the latency could really explode in a unpredictable way causing problems?
What do you think? I sincerely don't know what other to refine, I think we have tried almost everything. Can we better balance CPU utilization? What do you suggest?
Of course I can post more diag commands output if you need it.
Thanks in advance,
Solved! SEE THE SOLUTION
2011-02-21 01:06 AM
[ it is a bit offtopic, sorry for this ]
How do you measure IO profile? We are using our storage in simillar way (NFS v3) and having the same CPU bound problem (on much smaller 2040 platform).
2011-02-21 01:21 AM
Welcome to community!
Let me first tell you that, "you should never judge a netapp system's load by CPU usage." It's always a bad idea to think that a system is busy just by looking at CPU usage, to best use your netapp system always use multithreaded operation with recommended settings and look at latency values, which in your case is very good 200us.
Ontap ties every type of operation in a stack called domain, every domain is coded in a way that it will fist use the explicit assigned cpu and once it satuarates that CPU then it shifts its load to next CPU. Every domain does have it's own priority and they are hard coded so instance a houkeeping job domain will always have lower priority over NFS/CIFS domain and Ontap always makes sure that user requests always take priority over system internal work.
One more thing please don't look at CPU stat's 'ANY' counter always see 'AVG' as 'ANY' conter always give me a mild heart attack.
At last I would say that the issuse you are looking at is cosmetic, however If you think you have any performance problem run a benchmarck with SIO_ontap and then you can see the system's capaciry or open a support case but I am sure you will hear the same thing from support also.
2011-02-21 03:15 AM
thanks for your answer! I considered not the ANY cpu ( that I've already understood that is a meaningless parameter ) but the syssstat -x N CPU busy parameter (I haven't post it but it was at 99%). Has this parameter a better use to understand a storage system load? Or is it meaningless this too? How could I understand when the storage will reach its limit? It's just try until it dies? or there is a predictable way to understand it?
I mean, you are saying that all the cores will be employed one after the other (cool way to lower cpu content switch!) when they get saturated. So could I use the number of core utilization as a meter to understand when the storage is at the end of performance? The FAS has 12 cores, when I see 11 of them employed I can say it's time to buy another one? Now just 4-5 are really employed.. this means that we are far away from its saturation?
In a Netapp whitepaper i've read that going upper 90% CPU busy parameter can give unpredictable result on latency that could increase in an exploding way. Is this right? Of course we have to monitor storage and we have to know when we have to add another storage at least one or two month before this happens just because acquiring a new storage is not so fast.
For benchmarking purpose we prefer install some servers recreating our applications environment to really understand how the storage performs. In the past we had very different behaviour between benchmark results and real world result so we prefer to have a more traditional approach to avoid (already happened) future surprises
Currently we have installed 20 dual xeon quad servers to stress the storage and we reached the Cpu 100% busy on the filer. As I said latency continues to be exceptionally low but we need to know the real limits of the storage to better understand for how many months it will give us sufficient performance. We came from EMC Celerras and on that appliances all is quite more linear, meanings that when you reach the 100% datamover CPU you are dead and all stops working good so it is quite easy to calculate its limits and make prevision when it will saturate. Should I increase the test servers until I reach our latency limits? How much is reliable this approach? I mean we wont make explode latency with an 1% more load on storage..
PS: Bosko, you can reset your filer nfs statistics with nfsstat -z, launch your application for a while than launch again nfsstat to see the percent nfs profile and sizes.
2011-02-21 05:04 AM
Welcome in the club, as we use to say here... and I'm not talking about the NetApp community but about the FAS6xy0 CPU performance issue.
We see the same problem here. And we hear the same explanations and "excuses" from NetApp since many months.
2011-02-21 05:21 AM
We see the same problem as described from Lorenzo in our environment. And if your domains are really "shifting load to the next CPU", you for sure stop "shifting" at a maximum of 4 (used) CPUs.
To "open a case" is a waste of time. NetApp support asks for lots of different logs and things, just to tell you at the end "it works as designed" and "all will be good with ONTAPxyz".
2011-02-21 06:51 AM
I was calmed down by the answer of Lovik and now i'm returning to my depression..
So how do you manage the CPU limit? Do you increase pression on your storage until you reach the desired latency (I must admit this way give me the creeps) or do you keep CPU busy steadly near 90% and avoid to go upper?
2011-02-21 07:39 AM
Going back to your NFS profile (80% metadata), are all of your metadata operations GETATTR calls? If so, is there any chance you could find a way to reduce the number of those calls?
2011-02-21 10:10 AM
If your application doesn't mind, I'd recommend to disable all the atime updates on all volumes. This should help a bit..
In theory and best case, you could kick the MetroCluster and maybe even the Cluster and use only the single heads. The write overhead/penalty with a MetroCluster is there when you are really pushing those Filers to the edge... well, theory, as I said.
Unfortunately, this is not an option with our (main) application. So what we did in the meantime, is, we had to introduce a (stats based) latency monitoring and spread the load of the worst application, which we know makes all the trouble (synchronous IOs) , over multiple NetApp clusters. The new FAS6280 will host other stuff for now. We hope NetApp will come up with a better ONTAP release very soon... because the FAS6080 has to replace some of the other Filers soon.
We started to use "jumbo frames" and do use 10GigE now... but this may also not free up enough CPU cycles in your case... and we disabled DeDup on all performance relevant NetApp Filers.
We also started to change our "well known" trouble making application. In the future we try to use as much local disks for the application log's and such things as possible and write less synchronous data to the Filer.
Some good answers about that topic could i.e. be found here:
http://nfs.sourceforge.net/nfs-howto/ar01s05.html (see "5.9. Synchronous vs. Asynchronous Behavior in NFS")
http://milek.blogspot.com/2010/05/zfs-synchronous-vs-asynchronous-io.html (ZFS - synchronous vs. asynchronous IO)
http://kerneltrap.org/mailarchive/linux-fsdevel/2009/8/27/6359123/thread (Linux kernel: adding proper O_SYNC/O_DSYNC)
I still don't have a 100% satisfying answer from NetApp why they don't spread their "domains" over the CPU resources they have... or just give up that kind of design. Maybe there are deep WAFL secrets involved... and maybe the new SSD shelfs could help then... if you can afford them.
2011-02-21 11:29 AM
Sounds like someone is a clock watcher! For starters, latency is generally the most important predicator to performance issues on a storage array. If latency rises, and IOPS drop, you certainly have a problem. As it stands now with your sub millisecond latency I would say you are performing exceptionally well! I understand your concern about not wanting to push your CPUs to the point where the system does become slow, but don't forget about FlexShare! Take a look at TR-3459. FlexShare lets you prioritize your workload once your system becomes heavily loaded. You can give priority to more critical operations using FlexShare, and the best part about it is you already have it on your system! If you don't contact your local SE and they will get you the license since it is free!