Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have just acquired a new shiny FAS6280 in metrocluster configuration with 2x512GB PAM per controller and a total of 336 FC 15k disks hoping for awesome performance, but now that we have set it up we are quite disappointed about NFS performance.
We already have a metrocluster FAS3160 (our first netapp), and I have to say that we were surprised by its performance reaching the 35-40k NFS v3 IOPS per controller (our application profile is 80% metadata, 12% reads, 8% writes on hundred of millions of files) saturating CPU (its bottlenek in our profile), with very low latency (disks and bandwidth were OK).
We use just NFS protocol, nothing more, nothing less but we use it heavily and we need very high IOPS capacity (we use storage for real not as many customers that use it without touching its limit), so we decided to move to the new FAS6280 hoping for a huge improvement in CPU performance (powerful exacore X5670 versus a tiny old dual core 2218 AMD Opteron). If you look at CPU benchmarks websites (like http://www.cpubenchmark.net) you can see something like 6x performance in pure CPU power so we hoped for at least 3x performance increment knowing that the bottleneck in our configuration was just the CPU.
A bad surprise.. using same application profile, a single controller seems barely reach 75-80k IOPS before 100% CPU busy is touched. So from our point of view just 2x performance more than a FAS3160. The bad thing is that obviously a FAS6280 doesn't cost 2x a FAS3160...
So we tried to investigate..
The real surprise is how the cpu is badly badly (let me say badly) employed on FAS6280. For a comparison here is the sysstat -m from our FAS3160 running near 100% cpu busy:
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3
100% 86% 70% 96% 90% 90%
How you can see the CPUs usage is quite well balanced and all core are employed. Good.
Now the sysstat -m on the FAS6280 with same application profile:
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11
99% 21% 6% 0% 0% 0% 0% 0% 0% 0% 48% 48% 58% 93%
How you can see the FAS6280 barely uses 4-5 cpu cores out of 12 available, expecially CPU11 reaching 100% cpu busy very soon. We tried to optimize everything, mount in V2, V3, V4, create more aggregates, more volumes, flexvols, traditional vols and so on, but nothing let us to increase performances..
From my personal point of view, it seems that the FAS6280 hardware is far far more advanced than the Ontap version (8.0.1) that probably can't take advantage of bigger number of cores of this new family of filers, so it finishes to use an advanced cpu like X5670 just as an older dual core or a little more.. Simply the x5670 core is faster than 2218 core so it obtain a better performance.. but far far away from what it could do..
I read that new ontap upgrades should unlock NVRAM (now it can use just 2Gigs out of 8Gigs installed) and cache (now it can use 48Gigs out of 96Gigs) and should give better multithreading. Will these upgrades unlock some power out of our FAS6280?
Another strange thing is that the filer reach 100% busy CPU with extra low latencies. I read that is not recommended to take the filer upper than 90% cpu busy limit, because latencies could increase very fast and in an unpredictable way. This sounds reasonable, but for us is not useful at all to have application with 200us latencies.. we just need to stay under the 12ms limit.. For instance, if we touch the 100% CPU bound is it reasonable to continue to increase the filer usage until we reach, for example, 8ms medium latency for the slowest op? Or the latency could really explode in a unpredictable way causing problems?
What do you think? I sincerely don't know what other to refine, I think we have tried almost everything. Can we better balance CPU utilization? What do you suggest?
Of course I can post more diag commands output if you need it.
Thanks in advance,
Lorenzo.
Solved! See The Solution
1 ACCEPTED SOLUTION
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Lorenzo,
Welcome to community!
Let me first tell you that, "you should never judge a netapp system's load by CPU usage." It's always a bad idea to think that a system is busy just by looking at CPU usage, to best use your netapp system always use multithreaded operation with recommended settings and look at latency values, which in your case is very good 200us.
Ontap ties every type of operation in a stack called domain, every domain is coded in a way that it will fist use the explicit assigned cpu and once it satuarates that CPU then it shifts its load to next CPU. Every domain does have it's own priority and they are hard coded so instance a houkeeping job domain will always have lower priority over NFS/CIFS domain and Ontap always makes sure that user requests always take priority over system internal work.
One more thing please don't look at CPU stat's 'ANY' counter always see 'AVG' as 'ANY' conter always give me a mild heart attack.
At last I would say that the issuse you are looking at is cosmetic, however If you think you have any performance problem run a benchmarck with SIO_ontap and then you can see the system's capaciry or open a support case but I am sure you will hear the same thing from support also.
Cheers!
43 REPLIES 43
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You will never be able to max out system with single single-stream command like dd. You need multiple concurrent IO streams for this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Aborzenkov,
thanks for your input, but why then if I setup normal server with Windows 2008 (physical one), and connect from another server via \\servername\c$, and get a file (or write a file), then I get speed 100-120MB/s constantly? I often get that question from my customers, and I can't really explain that behavior.
Do you have valid explanation to that? I would appreciate it - maybe i would understand that more
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This might be a matter of window size or packet size of your LAN connection or the NIC itself. Test your network connection with iperf using smaller and greater window size (parameter -w4K and -w128K). With 4k you get 300Mbps, while with 128k almost a full bandwidth. Moreover, non-server (ex. laptop) 1Gbps NICs can't usually do better then 400mbps. BTW, the interesting thing you achieve 100MB/s with Windows 2008. We can't get so much with 2003.
Paweł
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Most likely, the stack is also important Win2k8 is the first windows OS in years that can efficently use its tcp stack. Hence the massive network performance improvement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you show the exact dd command you used as well as NFS mount options?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
well, dd command is pretty common:
dd if=/dev/zero of=/tmp/testfile bs=1M count=2000
creates 2gb file
nfs mount is from ESX host, so basically default one, but we also did the same test on following nfs mount (physical linux server):
netapp:/vol/nfs1 /vhosts nfs defaults,noatime,hard,intr 0 0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi marek, i've launched 17 streams of dd from 17 different servers against the 6280, and i've reached 408MByte/sec writing, and 621MByte/sec reading. I have to say that our FAS is not optimized for throughtput but just for random IOs and doesn't have many disks loops (two controllers are installed in different places so we need to reduce fiber patches). This caused a B in sysstat CP field.
Probably with an higher number of loops performance could be increased.
Lorenzo.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
thanks for your answer. and how many disks per each controller you have?
were these servers physical linux or virtual?
and from each server when you did dd command - what result you got?
Marek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am glad you started understanding the netapp systems
Now to tune your filer further you can look at some other area like.
Try looking at the way files are organized and accessed by application, as a wildcard search or big/deep directory listings are very memory intensive and can easily bring your filer to it's knees and when you say you need to host billions of files you have to be very careful as you need to have a good design in place so your host OS limitations as well filer limitations are well understood, like filename, directory structure, number of files in a directory, number of files on a system and yada yada. As everytime you store a file/directory the inode and metadata information get's associated with it and for reading a file or directory all the details needs to be loaded in memorry, thought it depends of application working pattern however this puts a big load on server.
So, if you haven't done any design work on file/directory layout, I would strongly suggest you to get a consultant engaged and ask them to help you in designing the structure.
Cheers!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Couple of things, we've been playing around with the 6280's for 3 or 4 months, and i have to say we've seen pretty spetacular NFS performance with them (peak 2.5-3GB/s off disk much higher off cache) and about 1GB/s straight onto disk. (144 spindles)
Couple of things, firstly sysstat -M 1 doesnt really work anymore, i'm assuming its because alot of the subsystems now reside in BSD, things like networking, raid and wafl (i believe, dont quote me on that)
Effectively with the way NVRAM is deployed you'll only get the benefit if you were running single systems i've been told that with clustering you're getting 2GB per head with the mirroring which will stay the same with the upgrade. you'll have 4Gb to play with single node.
with 1TB of PAM in there, the access to another 48Gb of memory probably isnt going to make a difference majorly to throughput.
A statit will give you much better/accurate information about how the cores are loaded up.
We personally havent been able to break it (other than a few software bugs we found in testing) we find the performance is pretty linear even with a couple of thousand workstations hammer a single node (deliberating trying to kill the node), obviously the total single node throughput decreases but the response times were very good.
A perfstat and/or a statit would help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shane,
our application profile doesn't produce high troughput, instead it generates a lot of small IOPS, expecially metadata, causing high CPU utilization in filer.
At this time I can say that, with 40 dual xeon quadcore servers with our application installed, we reached the FAS limit causing high latency (that for us is: nfs read > 12ms) and high IOWAIT (>20%) on servers. To confirm that we reached the end of 6280 we tried to launch a simple ls -R on the same mount point from another not loaded server and we observed that the response was very very slow, but I have to say it, constant, about one dir list every five seconds).
With this particularly heavy load we reached 110-120k IOPS, with 500mbit/sec bandwidth usage and 40% disk utilization (160 spindles in two plexes).
Do you know a reliable way to know when the storage will be at the end? I mean, now I can make tests but when the storage will be in the production state I will need to know exactly how it can performs and anticipate when it will finish its performance capacity (Automated tests are gaving me some estimations that I hope to be as near as possible to production, but of course millions of real users are different!) .
As I said with EMC Celerras it is quite straightforward to understand storage utilization because 100% CPU is a real limit.. if you break it you are in trouble.. With netapp I was at 100% cpu busy when FAS was giving me 60k iops and I was at 100% cpu busy when I was at 100k iops too.. but, in both, with good performance anyway (as I said I don't need 0,2ms latency performances and unfortunately latency increase is not linear at all..)
Yes statit is cool but from what i've seen it is useful when you are in troubles and you want understand where the problem is. I'm not an expert on statit output and probably I cant completely understand its output very well. What are the main parameters to read out from statit output that can help me to anticipate problems and make previsions on them?
I've read that in 8.1 release netapp should fix some threading issues to better utilize modern processor with high cores number like x5670 in FAS6280.
Someone knows when it should be released?
Thanks,
Lorenzo.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
but i have one more question. I often get questions from customers: i am copying (or dd) from/to netapp and i get like 15-20MB/s - it's slow etc. i know it's bullshit because his applications do not use dd/copy as main process, but still i can't find good answer for them. Can you suggest something?
and last question - how did you know that your filer is optimized for random io's not sequential?
Marek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
dd is single threaded, slow performance, open up more consoles and do more dds or even dds from different machines so you have a degree of paralism. its always reading a block, writing a block, plain & simple. windows is faster since it reads a few blocks, caches them, and writes them after a certain ammount of cached data etc.. dd is, never was, and probably will never be a proper method of measuring performance.
you might play around with bs, 4mb is usualy a better value, and google for "why is dd so slow?", you will find plenty of discussions, some suggest to use cat over dd, but you will see, as i said, that its a rather poor performance measuring tool.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok, I can understand this, but what to say to customers (who sometimes come from other place where they had "better copying performance" ? That it works like designed? There is huge difference between 20MB/s and 100MB/s. And that's my point
would you say that to your customers? Don't worry about copying speeds - that's normal...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
You should not only play around with "bs
" but maybe also with the "iflag=direct
" and "oflag=direct
" for direct I/O's and with "oflag=sync
" and "oflag=dsync"
for synchronous tests.
I personally also don't really like "dd" as a performance testing tool. Nevertheless, we have UNIX guys here, which always use it for their quick performance tests. They use different options, yeap, and as their "dd test suite" is always the same since years, they claim the results do say something.
Btw., here is a maybe interesting "pro dd" read:
http://cuddletech.com/blog/pivot/entry.php?id=820
To understand what may happen with your "dd results" when Linux cached the files and how to prevent this with newer kernels, you may also find this one interesting:
http://blog.straylightrun.net/2009/12/03/clearing-the-linux-buffer-cache/
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When using iozone, you should always use the "-U" command line switch with "-a" to unmount the filesystem being used after each run. This has the effect of flushing and invalidating the cache so that your test run of the 8k blocks won't influence the 16k run, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
why don't you use SIO_NTAP utility, look into the tools section of now site and you will find it there. In any way if you want to test any storage system's performance never ever use dd as it's the worst tool to judge performance since it's single threaded
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you see, problem is that SIO_NTAP does not work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Attach a properly configured lun with fcp or iscsi and use atto disk benchmark, gives you a nice and proper look at REAL performances. thats where you get nice values of 300mb/sec and more from a single lun : )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CPU utilisation is a fair indication, but you need to be looking in the right place with 8.x sysstat -m 1 doesnt work anymore.
Domain utilisation in a statit is a much more reliable method for checking
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3
5% 2% 2% 1% 2% 3%
4% 2% 2% 1% 2% 3%
5% 2% 2% 1% 2% 2%
as you see sysstat -m 1 works perfectly on ontap 8.0.1