Slow FCP transfer speeds

dha_net_apps · ‎2008-09-17

We are seeing slow FCP transfer speeds (65 MBps) over 4 Gbps fiber lines. Watching the server and the filer during the transfer, neither one is maxing out it's resources. We have tried different setups to eliminate possible bottlenecks: running through a switch, direct connect between filer port 0b and a blade, direct connect between filer port 0b and a server. In all cases we still see slow fiber speeds. Have updated HBA drivers, set speed settings to 4 Gbps, adjusted queue depth on the Windows server, etc to no avail.

Have an open ticket with NetApp Support ( # 2000136708) but that seems to be going nowhere.

Anyone else seen the same or similiar results with regards to transfer speeds?

stetson · ‎2008-09-17

Yes, I have. There are limitations as to how much you can push through a single thread. For example, a mere copying of a file is not a very good test because it is a single-threaded process. We have a tool calle SIO (Simulated IO) which allows you to craft the exact traffic pattern you want.

For a good test, I recommend you increase the number of threads that your host will 'inflict' upon the NetApp controller, then go from there. Read through the documentation for SIO and download it:

NetApp Simulated IO Tool:

http://now.netapp.com/NOW/download/tools/sio_ntap

Let us know how this goes.

scottgelb · ‎2008-09-17

Excellent response...we often find the same results over any protocol where sio can push a much higher throughput with multiple threads. Programmers at some of our customers have changed code to use more/less threads and block sizes based on sio what-if analysis. It's one of my favorite tools on now.

Some other things that I would check too (not likely the issue here at all...but adds to the discussion what you can look for generally with FCP performance troubleshooting...if you are the one who set up everything you'd know if these were set or not)...but anything that can be causing slower performance without having seen the configuration, I would check

1) priority settings... is flexshare setup?

2) is igroup throttling enabled?

3) queue depth setting on the host (with one copy process it won't be an issue..but could be when running multiple threads...we sometimes have to increase the number of queues from the hba utility at customers.. when we find we can't push performance (no bottleneck on host or fas controller we see, but the host runs out of queues)...

4) Switch QoS or a vendor with a blocking architecture on an oversubscribed port (you direct connected as well so not the case here for the one thread operation...but potentially could be with more threads)

dha_net_apps · ‎2008-09-18

Yes all good responses however the problems still remains. I have to deal with large SQL DB queries, data transfers, etc so large single threaded operations are a fact of life. We have tried multi-threaded operations and we do get a boost (from 65 MBps to around 90 MBps). Still far short of what you would expect from a 4 Gbps fiber line.

Question still remains: why can't we get faster speeds. I can understand maybe we need to tweak settings to get that last 5-10% of speed increase but when you're only at 16% of the rated speed there's a problem. From my point of view, 4 Gbps FCP is a well established technology. Therefore I should be able to just set it up and, right from the start, get 60 - 70% of the speed. Then I should have to do tweaking to get the extra speed increase. Not the case here.

Question back to the readers: What speeds have you seen and what settings are you using?

One answer that came up in researching this is changing block size. However, in my opinion, this just slants the fiber line towards that particular data type, in this case SQL. However my filer is used for more than just SQL so then I would be penalizing the other traffic.

Anyway more thoughts and ideas?

Thanks for the feedback.

stetson · ‎2008-09-18

Until you have unacceptable latency on the NetApp controller, this is not the place the start making any settings. You need to find a way to get the host to establish more throughput to the controller.

So is this a host bottleneck? Or is this a NetApp controller bottleneck?

A quick test to confirm or dismiss the controller is to monitor the FCP traffic latency while you run SIO as follows:

lun stats -i-o

Your avarage latency should be reasonable (what's 'reasonable' can vary) and you should not be seeing any partner ops.

So what are you seeing?

scottgelb · ‎2008-09-18

It might be a good idea to open a case on this...then you can perfstat, lun stat and measure performance after making changes they recommend. Most changes will be from the best practice guides for SQL on media.netapp.com (tech reports are public..search for sql). Recommendations in the reports include; changing minra and no_atime_update volume settings, increasing hba queue depth, sql params (affinity mask, memory, worker threads, etc.)..but follow the recommendations of support. GSC has helps quite a bit with performance cases..including other sql best practices from the host as well (location tempdb, system dbases, etc.)

dha_net_apps · ‎2008-09-18

Yes I have opened a support case and sent in the results of a perfstat.

I also ran a lun stats -o -i 1 <lun path> for both of the luns (lun being read from and the lun being written to).

Results are in the attached Excel file. Would be interested in the latency times and queue depth readings.

BrendonHiggins · ‎2008-09-19

Looking at the lun stats I can see both the read & write ops are low. You have not said what disks you are using.

Each disk in the aggregate can do about

60 ops - SATA

120 ops - 300Gb 10k FC

160 ops - 300Gb 15k FC

etc

The number of ops is increased with the number of disks used by WAFL. {Parity and DP disk not included}

The lun stat also shows the queue length raising ~ Above 2 bad in my book.

During the test is CPU high? What about disk utilization, above 70%? {sysstat -m}

I have also found performance starts to drops off when the aggregate is about 80% and above 90%... oh dear.

dha_net_apps · ‎2008-09-22

The disks are 300 GB 10k FC.

We're using 13 disks in our DP RAID and there's 52 disks in the aggregate. Aggregate is 79% full.

Summary of a data transfer using sysstat -b -s 1 shows:

Summary Statistics ( 61 samples 1.0 secs/sample)
CPU FCP iSCSI Partner Total FCP kB/s iSCSI kB/s Partner kB/s Disk kB/s CP CP Disk
in out in out in out read write time ty util
Min
20% 456 41 0 977 12029 16945 24 0 0 0 25510 0 0% * 12%
Avg
41% 1031 693 4 1846 32412 29936 2266 6079 24 0 41635 51304 74% * 35%
Max
80% 1527 6668 30 7768 53596 44085 69241 60573 186 8 101453 124016 100% * 95%

Sry the cut and paste didn't come out well. But as you can see Disk util averages 35 % well the CPU is 41%

Interesting comment about the aggregate fill level being a factor. First I've heard about that. Do you have a KB or some other reference you can send me for further reading?

BrendonHiggins · ‎2008-09-23

Your system is spiking high but the ave times are not too bad. 61 sec is not a very big sample however. I have just read the Storport in W2K3 and now know more than is healthy. Just had a look on one of my SQL boxes and the Storport was installed by NetApp host kit.

I am thinking

52 disks /13 raid group = 4 plexs

loss two disks in each plex for parity

11 x 4 = 44 usable disks

180 IOps per disks

7920 IOps available in aggregate

4kb to each IOp {31Mb per second}

These numbers can be much higher (200 - 400) IOps for each disk but 180 is a good starting point. The cache also improves performance. I would say the throughput you are reports in not to bad for your system.

The aggregate capacity vs performance is from what we have seen here on our filers. WAFL writes to available empty blocks. The fuller the aggregate the less chance of striping a full raid group per OP. Have a look at the statit output to see RAID Statistics (per second) and the blocks per stripe size to see how you are doing.

fjohn · ‎2008-09-22

Running SQL on a Windows 2003 host, when you have slow IO with a single thread to one LUN, then you may want to look at the Storport LUN queue in addition to the HBA Queue depth. Which HBA is it?

fjohn · ‎2008-09-23

The attach kit doesn't set the Storport LUN queues.

It works like this:

Qlogic - the storport LUN queue depth is set to 32 by default, although you can change it in the registry. See https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb26454

Emulex - In HBA anywhere, if you select LUN queues, the value you set is the LUN queue length. If you do not, the LUN queue is the set value divided by 4.

In this scenario, it's almost identical to the one that caused kb26454 to exist. The application is SQL, and all your IO is going to just one LUN and you are seeing poor performance with no apparant disk bottleneck. In the end, raising the Storport LUN queue solved the issue. Because the queue was too small for the load, IO was queuing between Storport and the miniport driver which caused throttling.

BrendonHiggins · ‎2008-09-23

kb26454 sounds like a good idea to me. I tryed raising the HBA queue from 32 to 128 when I was bench testing my SQL box before going live. I created load using MS SQLIO.exe but throughput stayed the same and the filer lunstat showed the queue length as 8+ which I took to be a bad thing, as we have several other servers connected to the SAN.

Would be keen to know if kb26454 works. I have emulex in my SQL box....

fjohn · ‎2008-09-23

I can say firsthand that the kb works.

For Emulex, if you don't select LUN queues, then the LUN queue is 1/4 the default 32 HBA queue depth or 8. That's why you saw 8. Which filer are you using? How many hosts? Whith multiple hosts, add the queues and make sure they don't exceed the depth on the port on the Filer. If most of your IO goes over just a LUN or 2, then click it over to LUN queues and raise it up.

dha_net_apps · ‎2008-09-23

I tried adjusting the queue depth per kb26454 with no results. Don't know if I edited the registry correctly for this. Especially Step 4 (didn't understand what they meant).

Use the following procedure to change the queue depth parameter:

Click Start, select Run, and open the REGEDIT/REGEDT32 program.
Select HKEY_LOCAL_MACHINE and follow the tree structure down to the QLogic driver as follows: HKEY_LOCAL_MACHINE SYSTEM\CurrentControlSet\Services\ql2300\Parameters\Device .
Double-click DriverParameter to edit the string.

Example: DriverParameter REG_SZ qd=0x20
Add "qd=" to the "Value data&colon;" field. If the string "qd=" does not exist, append it to end of the string with a semicolon.

Example: ;qd=value
Enter a value up to 254 (0xFE). The default value is 32 (0x20). The value given must be in hexadecimal format. Set this value to the queue depth value on the HBA.

Note:
The queue depth parameter can also be changed via the SANsurfer utility

My reg entry looked like this:

DriverParameterREG_SZqd=FE; qd=value (FE=254)

Also tried:

DriverParameter REG_SZ qd=FE

Server didn't crash but no increase in speed either. Also tried a queue depth of 128 with the same results.

Other item is the Note that refers to changing queue depth via SANsurfer. Have installed SANsurfer but can't find this option.

All tests are being done over a direct connect between the 3070 filer, port 0b and a Windows server. Both HBAs are rated for 4Gbps.

stetson · ‎2008-09-23

Is your motherboard bus also rated for 4Gbps? If so, is the HBA in the correct slot to achieve that?

dha_net_apps · ‎2008-09-23

Will have to check on the MB rating and slot placement. Having said that we are seeing the same FCP speeds across multiple platforms (HP Integrity server, blade servers, and the current standalone server). The only constant in all of this has been the NetApp filer. So either the manufacturers of all these other servers have messed up in their designs or there's a problem with the NetApp filer.

All of this is getting off target though.

This is suppose to be a 4Gbps pipe. I'm getting 65 MBps flow. Why, considering 4 Gbps FCP is a mature technology, did I not get at least 60-70% of the rated speed just in the initial setup? I would think I should only have to tweak settings to get that last 5-10% speed increase. Not tweaks to get better than 16% of the rated speed.

I HAVE to be able to do large, single threaded, data transfers. Suggesting I test with multi-threaded transfers doesn't solve the problem of slow data transfer of my data which is what I need to content with.

All of the above has been great input and suggestions and I've been trying them out, to no avail.

One thing no one has stated is that they have achieved 4 Gbps FCP data transfer speeds. In fact no one has stated the speeds they're getting. Is anyone getting 1, 2, 3, 3.5 Gbps FCP data transfer speeds? Would love to hear what speeds you're getting.

Interesting item to note is we've done data transfer tests with iSCSI and gotten 900 Mbps data transfer rates (yes, single threaded, large databases). This is on a 1 Gbps iSCSI pipe. And no I can't just switch to iSCSI.

Know I'm sounding frustrated but there's a lot of pressure about this transfer speed. Next step maybe bring in another vendor's equipment and test their FC products.

Thanks for the help and keep up the suggestions.

fjohn · ‎2008-09-23

Hi Keith,

The HBAs are configured for servers that have a lot lof LUNs out of the box. You situation is different in that you have one LUN that you want to move a lot of data to/from. So, you need to configure the HBA parameters for that scenario. Your iSCSI numbers, at 90% or the pipe, would tend to confirm that you need to do some optimizations for your FC connection.

John

fjohn · ‎2008-09-23

Should be qd=0xFE for the qlogic to set it to 254. One thing to be careful of: don't exceed the the HBA queue depth. If you've set that for 128, then don't set the LUN queues higher than that. This setting improves throughput when the majority of the IO goes to one or two LUNs. If you have IO going to many LUNs it probably won't help much.

Here's an example (assumes qlogic):

The port on my filer has a queue depth of 256. I am attaching two hosts, so I set the HBA queue depth on each host to 128. On host 1, I have 2 LUNs with heavy IO. On host 2, I have 8 LUNs. On host 1, since the HBA queue depth is 128, and I only have two LUNs, and the LUN queue depth is 32 (32*2=64) I will never utilize the full queue depth of the HBA. In this case, I would raise the storport LUN queue depth to 64. For host 2, 128/8=16 and the default is 32. I could run into a situation where an IO spike on a few LUNs could start throttling between the miniport driver and storport. I would want to consider lowering the storport LUN queue depth to 16.

It's interesting that on the Emulex cards if you choose an HBA queue instead of LUN queues, then the LUN queue is 1/4 the HBA queue. This really covers a lot of the cases where you have many LUNs, and only leaves you with those where you have a few. If I had 8 LUNs and was using Emulex, I'd just increase the HBA queue to 128 in our scenario. If I had only two LUNs, I would click LUN queue, and set it to 64.

I hope this helps explain it a little better.

John

fjohn · ‎2008-09-23

The max port queue depth on a single controller 3070 is 1720. So, I'd set the execution throttle to 256 and then qd=0xfe

John

BrendonHiggins · ‎2008-09-24

Here are some benchmark data

The servers are HP DL585 G1 8 way beasts with 32Gb Ram and dual 4Gb HBA. They connect to a pair of 4100 silkworms and then a pair of FAS3070. The test shelves are 300Gb 15k FC disks. The disks are in one large aggregate on one filer. Or splite over the two filers with and without software ownership.

Hope this helps.