Solved: Large difference in Windows OS Avg. Disk sec/Transfer and lun avg_latency

bsti · ‎2011-05-22

Hello all,

I'm in the process of running some stress/performance tests on a new set of 6280 controllers attached to SAS 15K disks. I'm testing using NetApp SIO on an 84-disk aggregate current. In my latest batch of tests, I'm hammering it with 48 threads, 64k block sizes, 50 GB file size sample, 50% read, 100% random data samples. I've launched this process against multiple files on the aggregate.

My question is I'm seeing a large disparity between the OS reported Avg. Disk sec/Transfer and avg_latency reported by the stats command on the controller.

The OS is reporting some 7800 IOPS, 470 MB/s, and sec/transfer of a solid 33 msec (reported as .036 seconds in perfmon).

The stats command reports: 7800 IOPS, 480 MB/s, and only 1.57 ms of avg latency.

I'm liking the matching IOPS and MB/s, but am unsure why the large disparity in latency.

My first assumption is that PAM and read caching are frogging up my results, so I set this test to 100% random, use a file size of 50 GB ( larger than the 48 GB of RAM recognized by the system), and have disabled PAM (options flexscale.enable off). There is NO other traffic on this SAN, and the server is totally idle but for this testing.

My second assumption is latency introduced by the switch stack and or FC transport (I'm 100% FC, no iSCSI, NAS, or CiFS). I'm on a brand new Brocade switch with no other traffic, I'm using NetApp FC Host Utilities 5.3 (so my FC Adapter queues should be correct), NetApp MPIO, and SnapDrive. My zoning is correct, to my knowledge.

Any ideas on things I should check?

shaunjurr · ‎2011-05-26

Hi,

Well, I'm glad that most of the news here is positive. It seems that you have a "boatload" of resources and need new ways to beat up on it. I guess there are a number of different windows benchmarks but you probably need to find some way to run them in parallel with larger data sets to actually beat up on the cache enough. The "big boys" use specfs benchmarks for nfs/cifs and I guess the top SAN benchmark is from SPC. You can view results from a 3270 benchmark run (these go for long runs...) here: http://www.storageperformance.org/benchmark_results_files/SPC-1E/NetApp/AE00004_NetApp_FAS3270A/ae00004_NetApp_FAS3270A_SPC1E_executive-summary.pdf

As far as obtaining a copy, I think it is still members only and then $2500 to get the software. The main point here, even if the SPC benchmark was basically probably just done to stick it to EMC as a mid-range player, is that you might be able to get some idea of how you might extrapolate the results on to a 6280 as far as IO expectations go.

I'm not sure what workloads you are trying to benchmark for or if this is largely just academic/hacker interest, but you seem to be safely in the range of "big iron" in capacity and as long as you have enough disk capacity, you probably have a good deal of expansion room for more PAM cards if things ever get tight.

All that aside, whatever your planned implementation is, it will probably be things like the wrong LUN types, unexpected growth, poorly designed applications (or sql queries), client-side bugs, ontap bugs, stone age backup methods, that cause you more problems than raw I/O response times. Benchmarking the implementation, snapmanager software, snapvault or snapmirror backups, Operations manager, "data motion" in ONTap 8.x, if you're not already proficient in these areas, are probably equally important in getting a handle on a successful implementation.

Sorry about rambling on, but other than parallelizing your currently available benchmark software or spending a load of cash on SPC software, it might just be an idea to move on to other aspects of storage administration that can be equally challenging.

Good luck.

🙂

View solution in original post

bsti · ‎2011-05-22

I forgot to mention, I'm testing on a WIndows server 2008 x64 Enterprise SP2 server with 256 GB or RAM and 46 Processor cores.

shaunjurr · ‎2011-05-23

Hi,

I'm guessing some of the latency is simply coming from how and where it is measured. I guess you would need to now how deterministic the sampling rate is for perfmon on the Windows side. I don't know what stats you are collecting on the filer either. Response times from LUN stats on the filer may point to activity much farther down in the system than what is going out over the wire (fiber).

From that which I have collected from the Windows gurus here, there are a lot of patches to the SAN stack in 2008 as well. You might want to also test a up-to-date R2 build as well.

Are you using ALUA?

bsti · ‎2011-05-23

Thanks for the reply!

I've verified my iGroups all have ALUA enabled on them. As far as stats I'm collecting from the filers, I'm collecting LUN latency for now, but will be also collecting volume and aggregate latency. I'm not running anything else in the aggregate (or the rest of the controller, for that matter), so I'm pretty confident I should see the same numbers all the way up to the aggregate, but I'll double check to be sure.

I think you may be onto something with the Windows 2008 thing though. We have other DB servers I receive latency complaints on where the filers says latency is fine. Unfortunately, our main DB servers are in a MSCS Failover cluster, and I can't just upgrade or changeout the OS.

HOwever, just because I need to know, I will run these tests against a non-clustered server and an R2 machine and report back soon.

Thanks for the suggestions and if anyone has any other ideas they are welcome.

bsti · ‎2011-05-23

I have the same results on a Windows Server 2008 R2 server. Very large differential.

On a related note, I noticed the system is not load balancing across all HBAs. I have a separate thread open to address this:http://communities.netapp.com/thread/14882

When I have the RR load balacing policy set, the latency shoots way down, so this may be a big part of my issue.

In later tests, it does reduce the latency on the client side, but does not close the gap in ms latency between the server and SAN. I'm convinced something in the WIndows OS is adding the latency, but I've no idea what.

eric_lackey · ‎2011-05-24

I'm seeing similar results, but I'm using iSCSI. I'm also using Windows Server 2008 R2 SP1.

I'm running an Exchange JetStress analysis. Windows is reporting an average of 24ms reads, but the NetApp is rarely reporting over 10ms. The strange thing is that if I test with SQLIO, the latency numbers are almost identical between Windows and the filer. I've tested with a 20GB database file and a 350MB database file and the variances remain the same.

bsti · ‎2011-05-26

I've done a bunch more testing. If I run smaller throughput tests, then the latency numbers are really close. They became MUCH closer after I resolved the multipathing issue with my server's HBAs. (Turns out this issue is related to our Brocades not communicating the correct queue depth value to the MPIO software). The higher throughput I push in my test, the higher the latencies go, and the larger the spread between client and SAN.

This indicates to me that the latency the SAN communicates is not necessarily what the client will experience. I've always thought they would be the same. It makes sense when you think about it, though. The SAN is reporting how long it takes it to turn data around, but that does not necessarily mean the client can take it in that fast.

In my case, I was using one server (A large one nonetheless) to try and stress a 6280 with 240 spindles of 15k RPM SAS disk. That wasn't going to happen.

My biggest issue in testing now is that I can't seem to foil the read cache on the 6280s. They always report 80 - 90% hit ratios, even at the highest load. This means either the read cache is just that good, or NetApp SIO is not generating the right kind of data to stress the cache. I tried 100% sequential and random, and neither seems to miss the cache at all.

Anybody have any advice on that?

shaunjurr · ‎2011-05-26

Hi,

Well, I'm glad that most of the news here is positive. It seems that you have a "boatload" of resources and need new ways to beat up on it. I guess there are a number of different windows benchmarks but you probably need to find some way to run them in parallel with larger data sets to actually beat up on the cache enough. The "big boys" use specfs benchmarks for nfs/cifs and I guess the top SAN benchmark is from SPC. You can view results from a 3270 benchmark run (these go for long runs...) here: http://www.storageperformance.org/benchmark_results_files/SPC-1E/NetApp/AE00004_NetApp_FAS3270A/ae00004_NetApp_FAS3270A_SPC1E_executive-summary.pdf

As far as obtaining a copy, I think it is still members only and then $2500 to get the software. The main point here, even if the SPC benchmark was basically probably just done to stick it to EMC as a mid-range player, is that you might be able to get some idea of how you might extrapolate the results on to a 6280 as far as IO expectations go.

I'm not sure what workloads you are trying to benchmark for or if this is largely just academic/hacker interest, but you seem to be safely in the range of "big iron" in capacity and as long as you have enough disk capacity, you probably have a good deal of expansion room for more PAM cards if things ever get tight.

All that aside, whatever your planned implementation is, it will probably be things like the wrong LUN types, unexpected growth, poorly designed applications (or sql queries), client-side bugs, ontap bugs, stone age backup methods, that cause you more problems than raw I/O response times. Benchmarking the implementation, snapmanager software, snapvault or snapmirror backups, Operations manager, "data motion" in ONTap 8.x, if you're not already proficient in these areas, are probably equally important in getting a handle on a successful implementation.

Sorry about rambling on, but other than parallelizing your currently available benchmark software or spending a load of cash on SPC software, it might just be an idea to move on to other aspects of storage administration that can be equally challenging.

Good luck.

🙂

bsti · ‎2011-05-26

Thanks for the reply! It's always good to hear of other people's experiences, especially when they've done this sort of thing more times than I have. Thanks for the link to the benchmark report too. It will be interesting to parse through to learn how other people do it.

I'm in aggreement with your assesment. My job is to basically get the best understanding I can of where we will be at in terms of utilization and utilization ceiling on this new hardware before we migrate over to it. As I suspected, it's not a straightforward question to answer. I ended up settling on a number of parallel SIO tests from multiple hosts pushing multiple processes to try and determine a best-guess util at current load and max util set of data. It's not going to be 100% accurate, but I'm working with what I have.

At this point, I have some numbers and am satisified with my findings, but in the learning process found out some of the "gotchas" of trying to run this sort of testing.

Now I'm off to application testing, which should prove just as fun...

Thanks again!