Being a newbie on NetApp there might be some obvious answer to the following question, so bare with me. We have most our stuff running on vmware and nfs, but we still have two SQL Clusters running iscsi (though planned for virtualization later this year - that's another story)
I'm migrating a SQL Server 2008 R2 Cluster running on Windows 2008 R1 (latest patch level) from Equallogic iscsi to NetApp iscsi. The migration itself went without any problem (failing the current drives, repairing them by pointing to a NetApp iscsi lun, bring them online and copy the data - slow and safe). After the migration was completed, I wanted to make sure that everything was working (including multipathing), so I copied a large file to a mounted lun - no problem each nic was loaded approx. 20% as expected. Then I tried read, here I should have the same nic load, but it was between 1-2%. I checked against the Equallogic box and it read the file with the expected approx. 20%. Since I was short on time before I had to initiate a fallback, I googled a bit and came across an article about fragmentation within WAFL. I did a reallocate measure, which came out with a 3, which isn't that bad. But I forced a reallocate on one of the luns, but it didn't improve the read performance. I noticed also a high latency seen in NetApp OnCommand System Manager 2.0 on iscsi ranging in the 200-300 ms, but don't have anymore than that. It's a production system so I can't try everything and next maintenance day is feb 17th. Oh, and there are no snapshots on the volume. Dedub is however enabled (large amounts of unchanged data in our SQL) and it's all SAS drives.
Do any of you have valuable input to how to approach this read performance problem - assuming it can be solved to a competing level?
Funny footnote - we replaced the Equallogic with the Netapp because of the high latency we saw ranging from 20ms to 2000ms (mostly in the 20-200ms) and until now I've only seen a latency below 10ms on our NetApp.
Hopefully we can get to the bottom of this and you can get to experience the low latency level you enjoy with your other protocol. I have a few thoughts to start with. One thing you looked at was network. So perhaps we can look a little deeper into the network layer. For instance run a "ifstat -a" and see if any of the network interfaces show error. We can check if there is a jumbo frame mismatch by looking at mtu size with "ifconfig -a" and doing ping tests with large packet sizes. Are you using an ifgrp/vif? Either a "ifgrp status" or "vif status" can show you current configurations (look in particular to "broken indications"). Could it be that the iscsi traffic is going out a different port then intended? The "iscsi interface show" command will show you which interfaces are enabled. Also checking on compatability i.e. ONTAP version, DSM version, etc. Have you looked up the combination you are using against the interoperability matrix (http://support.netapp.com/matrix)? Finally have you experienced the latency after your initial configuration?
Looking forward to getting to the bottom of this with you,
After doing some SQLIO and NetApp SIO from the servers having the problem and other servers that didn't seem to have the problem. We could see that above a certain packet size in SIO, throughput dropped to almost nothing (1000-4000kB/s seen with sysstat -x 1).
After checking switch configurations, netapp configuration, we agreed that I would try a driver upgrade first and the support engineer left to support another customer. Updating the driver didn't fix anything. After that I was sure that it was some setting on the server related to the network and started the painstaking task of trying every single one of them with a reboot between each. It turned out after a few hours that it was Receive Window Auto-Tuning Level which is set through netsh among other places. The default value was normal and when I changed it to disabled, I suddenly saw a change from the 1000-4000kB/s to 110000kB/s. Disabling and enabling each nic also did the trick if reboot is out of the question or just takes too long.