We have just acquired a new shiny FAS6280 in metrocluster configuration with 2x512GB PAM per controller and a total of 336 FC 15k disks hoping for awesome performance, but now that we have set it up we are quite disappointed about NFS performance.
We already have a metrocluster FAS3160 (our first netapp), and I have to say that we were surprised by its performance reaching the 35-40k NFS v3 IOPS per controller (our application profile is 80% metadata, 12% reads, 8% writes on hundred of millions of files) saturating CPU (its bottlenek in our profile), with very low latency (disks and bandwidth were OK).
We use just NFS protocol, nothing more, nothing less but we use it heavily and we need very high IOPS capacity (we use storage for real not as many customers that use it without touching its limit), so we decided to move to the new FAS6280 hoping for a huge improvement in CPU performance (powerful exacore X5670 versus a tiny old dual core 2218 AMD Opteron). If you look at CPU benchmarks websites (like http://www.cpubenchmark.net) you can see something like 6x performance in pure CPU power so we hoped for at least 3x performance increment knowing that the bottleneck in our configuration was just the CPU.
A bad surprise.. using same application profile, a single controller seems barely reach 75-80k IOPS before 100% CPU busy is touched. So from our point of view just 2x performance more than a FAS3160. The bad thing is that obviously a FAS6280 doesn't cost 2x a FAS3160...
So we tried to investigate..
The real surprise is how the cpu is badly badly (let me say badly) employed on FAS6280. For a comparison here is the sysstat -m from our FAS3160 running near 100% cpu busy:
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3
100% 86% 70% 96% 90% 90%
How you can see the CPUs usage is quite well balanced and all core are employed. Good.
Now the sysstat -m on the FAS6280 with same application profile:
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11
99% 21% 6% 0% 0% 0% 0% 0% 0% 0% 48% 48% 58% 93%
How you can see the FAS6280 barely uses 4-5 cpu cores out of 12 available, expecially CPU11 reaching 100% cpu busy very soon. We tried to optimize everything, mount in V2, V3, V4, create more aggregates, more volumes, flexvols, traditional vols and so on, but nothing let us to increase performances..
From my personal point of view, it seems that the FAS6280 hardware is far far more advanced than the Ontap version (8.0.1) that probably can't take advantage of bigger number of cores of this new family of filers, so it finishes to use an advanced cpu like X5670 just as an older dual core or a little more.. Simply the x5670 core is faster than 2218 core so it obtain a better performance.. but far far away from what it could do..
I read that new ontap upgrades should unlock NVRAM (now it can use just 2Gigs out of 8Gigs installed) and cache (now it can use 48Gigs out of 96Gigs) and should give better multithreading. Will these upgrades unlock some power out of our FAS6280?
Another strange thing is that the filer reach 100% busy CPU with extra low latencies. I read that is not recommended to take the filer upper than 90% cpu busy limit, because latencies could increase very fast and in an unpredictable way. This sounds reasonable, but for us is not useful at all to have application with 200us latencies.. we just need to stay under the 12ms limit.. For instance, if we touch the 100% CPU bound is it reasonable to continue to increase the filer usage until we reach, for example, 8ms medium latency for the slowest op? Or the latency could really explode in a unpredictable way causing problems?
What do you think? I sincerely don't know what other to refine, I think we have tried almost everything. Can we better balance CPU utilization? What do you suggest?
Of course I can post more diag commands output if you need it.
Thanks in advance,
Solved! See The Solution
Welcome to community!
Let me first tell you that, "you should never judge a netapp system's load by CPU usage." It's always a bad idea to think that a system is busy just by looking at CPU usage, to best use your netapp system always use multithreaded operation with recommended settings and look at latency values, which in your case is very good 200us.
Ontap ties every type of operation in a stack called domain, every domain is coded in a way that it will fist use the explicit assigned cpu and once it satuarates that CPU then it shifts its load to next CPU. Every domain does have it's own priority and they are hard coded so instance a houkeeping job domain will always have lower priority over NFS/CIFS domain and Ontap always makes sure that user requests always take priority over system internal work.
One more thing please don't look at CPU stat's 'ANY' counter always see 'AVG' as 'ANY' conter always give me a mild heart attack.
At last I would say that the issuse you are looking at is cosmetic, however If you think you have any performance problem run a benchmarck with SIO_ontap and then you can see the system's capaciry or open a support case but I am sure you will hear the same thing from support also.
Couple of things, we've been playing around with the 6280's for 3 or 4 months, and i have to say we've seen pretty spetacular NFS performance with them (peak 2.5-3GB/s off disk much higher off cache) and about 1GB/s straight onto disk. (144 spindles)
Couple of things, firstly sysstat -M 1 doesnt really work anymore, i'm assuming its because alot of the subsystems now reside in BSD, things like networking, raid and wafl (i believe, dont quote me on that)
Effectively with the way NVRAM is deployed you'll only get the benefit if you were running single systems i've been told that with clustering you're getting 2GB per head with the mirroring which will stay the same with the upgrade. you'll have 4Gb to play with single node.
with 1TB of PAM in there, the access to another 48Gb of memory probably isnt going to make a difference majorly to throughput.
A statit will give you much better/accurate information about how the cores are loaded up.
We personally havent been able to break it (other than a few software bugs we found in testing) we find the performance is pretty linear even with a couple of thousand workstations hammer a single node (deliberating trying to kill the node), obviously the total single node throughput decreases but the response times were very good.
A perfstat and/or a statit would help.
our application profile doesn't produce high troughput, instead it generates a lot of small IOPS, expecially metadata, causing high CPU utilization in filer.
At this time I can say that, with 40 dual xeon quadcore servers with our application installed, we reached the FAS limit causing high latency (that for us is: nfs read > 12ms) and high IOWAIT (>20%) on servers. To confirm that we reached the end of 6280 we tried to launch a simple ls -R on the same mount point from another not loaded server and we observed that the response was very very slow, but I have to say it, constant, about one dir list every five seconds).
With this particularly heavy load we reached 110-120k IOPS, with 500mbit/sec bandwidth usage and 40% disk utilization (160 spindles in two plexes).
Do you know a reliable way to know when the storage will be at the end? I mean, now I can make tests but when the storage will be in the production state I will need to know exactly how it can performs and anticipate when it will finish its performance capacity (Automated tests are gaving me some estimations that I hope to be as near as possible to production, but of course millions of real users are different!) .
As I said with EMC Celerras it is quite straightforward to understand storage utilization because 100% CPU is a real limit.. if you break it you are in trouble.. With netapp I was at 100% cpu busy when FAS was giving me 60k iops and I was at 100% cpu busy when I was at 100k iops too.. but, in both, with good performance anyway (as I said I don't need 0,2ms latency performances and unfortunately latency increase is not linear at all..)
Yes statit is cool but from what i've seen it is useful when you are in troubles and you want understand where the problem is. I'm not an expert on statit output and probably I cant completely understand its output very well. What are the main parameters to read out from statit output that can help me to anticipate problems and make previsions on them?
I've read that in 8.1 release netapp should fix some threading issues to better utilize modern processor with high cores number like x5670 in FAS6280.
Someone knows when it should be released?
CPU utilisation is a fair indication, but you need to be looking in the right place with 8.x sysstat -m 1 doesnt work anymore.
Domain utilisation in a statit is a much more reliable method for checking
I didn't mean to believe or not, I only replied to your statement:
with 8.x sysstat -m 1 doesnt work anymore
And I gave you output of this command from Ontap 8.0.1 which show, that this command works perfectly
but i have one more question. I often get questions from customers: i am copying (or dd) from/to netapp and i get like 15-20MB/s - it's slow etc. i know it's bullshit because his applications do not use dd/copy as main process, but still i can't find good answer for them. Can you suggest something?
and last question - how did you know that your filer is optimized for random io's not sequential?
why don't you use SIO_NTAP utility, look into the tools section of now site and you will find it there. In any way if you want to test any storage system's performance never ever use dd as it's the worst tool to judge performance since it's single threaded
Attach a properly configured lun with fcp or iscsi and use atto disk benchmark, gives you a nice and proper look at REAL performances. thats where you get nice values of 300mb/sec and more from a single lun : )
dd is single threaded, slow performance, open up more consoles and do more dds or even dds from different machines so you have a degree of paralism. its always reading a block, writing a block, plain & simple. windows is faster since it reads a few blocks, caches them, and writes them after a certain ammount of cached data etc.. dd is, never was, and probably will never be a proper method of measuring performance.
you might play around with bs, 4mb is usualy a better value, and google for "why is dd so slow?", you will find plenty of discussions, some suggest to use cat over dd, but you will see, as i said, that its a rather poor performance measuring tool.
You should not only play around with "
bs" but maybe also with the "
iflag=direct" and "
oflag=direct" for direct I/O's and with
"oflag=sync" and "
oflag=dsync" for synchronous tests.
I personally also don't really like "dd" as a performance testing tool. Nevertheless, we have UNIX guys here, which always use it for their quick performance tests. They use different options, yeap, and as their "dd test suite" is always the same since years, they claim the results do say something.
Btw., here is a maybe interesting "pro dd" read:
To understand what may happen with your "dd results" when Linux cached the files and how to prevent this with newer kernels, you may also find this one interesting:
When using iozone, you should always use the "-U" command line switch with "-a" to unmount the filesystem being used after each run. This has the effect of flushing and invalidating the cache so that your test run of the 8k blocks won't influence the 16k run, etc.
ok, I can understand this, but what to say to customers (who sometimes come from other place where they had "better copying performance" ? That it works like designed? There is huge difference between 20MB/s and 100MB/s. And that's my point
would you say that to your customers? Don't worry about copying speeds - that's normal...
We are gone further with our tests and probably we began to better understand how this filer works.
First of all Lovik seems to be right.
We pushed forward to the CPU utilization and what we have seen is that more you press and more core begins to be employed. We reached 7 cores utilization:
ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11
100% 33% 19% 0% 0% 0% 0% 0% 41% 38% 66% 68% 71% 90%
Pushing further we began to use CPU7, then CPU6. Latency increased but very low.. medium latency was 0.500 msec (Slowest operation in read op, 5ms, then mkdir with 2.4ms, then the others with under 1 ms latency).
The IOPS have doubled:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk
ops/s in out read write read write age hit time ty util
99% 160143 152620 300862 15218 127792 0 0 3 99% 100% : 12%
PAM (1TB) seems work good avoiding a lot of disk reads:
Usage Hit Meta Miss Hit Evict Inval Insert Chain Blocks Chain Blocks Replaced
% /s /s /s % /s /s /s /s /s /s /s /s
90 3947 3700 4254 48 105 0 6430 3887 3901 100 6425 3887
The strange thing is that CPU busy was 100% at 80000 iops, and continue to be 100% at 160000 iops, telling me that CPU Utilization is not a good storage usage indicator..
Now the 20 dual xeon test servers got saturated.. they have load at 300 and cannot spit out more cpu cycles.. anyway the IO Wait on client continue to be low, near 4%, and this seems to point out that we haven't reach the storage limit yet.
At this point the only reliable indication of storage usage seems to be the latency indication as Lovik said.. we'll try to increase test servers to understand where the limit is and to understand if pressing on storage will bring latency to explode or not (we want maximum 12ms latency)..
Thank you all,
I am glad you started understanding the netapp systems
One question from me (I also am having very serious performance problems which we don't know where they come from), you wrote that you disabled dedupe, but it didn't help.
Does it mean that you only used command sis off /vol ? or did you use sis undo command (real disable of dedupe)?
we only disabled dedupe (not udoing) and later on we couldn't undo (due to not enough space).
just do sis status - if it shows "disabled", then it still checks of changed blocks from your volume. Try to undo asis volumes.
Probably it's more correct to say that dedupe was never turned on and we mantain it disabled.
If I launch a sis status I receive a "No status entry found."
Is dedupe causing an excessive overhead to you? Netapp claims overhead is very low but I've never tried it and I'm interested on real customers application.
in my case i know dedupe is wrong in many ways. we made some mistakes earlier, like disabling dedupe (without undo) and later on we couldn't undo due to not enough space (overdeduplicated).
i also sometimes get a call that customer sees performance issue, and when i check there are let's say 3 concurrent drdupe tasks running (some in verification mode) and when i disable them problem seems to be solved afterwards.
that is why in new setup with clustered (no metro) fas3160 with pam2 i won't use deduplication because of performance implications.
in my case i could never get help from netapp as they always said that i have misaligned vm's.
may i ask btw if you are using your filers for vmware as well or just physical boxes for website? and did anything change after going from 1gbit to 10g?
This FAS6280 is planned to host only our mail application, we are planning a test with our web hosting service but not for now and no virtual machine services at all.
I've checked for difference between 1gbit link and 10gbit link but what i've found is that if you don't need bandwidth the CPU usage doesn't change very much.
I've checked FAS3160 (linked at 2 x 1Gbit/sec) and FAS6280 (linked at 2 x 10gbit/s). Using ps -c 1 in advanced mode I see an 8% CPU usage for each network thread in 1gbit and 10gbit configuration, so no difference.
A slightly better usage in 10gbit config is in the link thread e0X_qidX (10gbit) that seems to use lower CPU than Gb_Enet/eXX (1gbit) thread, but i'm speaking 1% against 3% so very very little difference and probably too low to be considered reliable.
I think that the needed networking processing power low and you should upgrade to 10gigs only if you need bandwidth.. but this is my opinion of course..