I've followed most of the online guides on how to setup ESXI for ISCSI multipathing, but apparently I'm missing something. Technically, multipathing is working, but I'm unable to utilize more than 50% of both nics. I have setup a datastore that points to a NetApp lun. I'm doing a sqlio test from within a virtual machine writing to a hard drive on the NetApp datastore. I've also tried the test using 2 different virtual machines at the same time and still only get 50% total on each 1Gb connection. The sqlio process is running from an internal drive on the ESXI server to make sure that's not causing a bottleneck.
Here is my configuration:
- Both interfaces connected on first controller
- Unique IP address on each interface
- Not using switch trunk or virtual interfaces
- 2 VMKernel Ports on 2 separate 1Gb nic ports
- Unique IP addresses on each VMKernel port
- vSwitch setup for IP Hash load balancing
- Storage path selection set to Round Robin (VMWare)
- Storage array type set to VMW_SATP_ALUA
- 4 paths found that correspond to the 2 IP addresses assigned to controller on the NetApp
HP Procurve 2848 Switch
- Tried with trunking on and off
- All devices are connected directly to this switch.
I'm running sqlio with 64 threads. If I run one instance of this I get about 116MB per second. If I run two instances of this, I get about 60MB each.
My command looks something like this...
sqlio -t64 -b8 -s15 c:\temp\test.dat
I'm running the sqlio command from local storage in the ESXI server. That is drive letter E on the server i'm testing. The drive letter i'm writing to C:\ is on the NetApp. The LUN is on a flex vol which is on an aggregate with 14 drives.
I've tested ISCSI multi-pathing from a Windows 2008 box and get closer to 225MB, so I'm fairly certain the NetApp is setup correctly and can perform much better than it is. Both of these servers are using the same network configuration (switches, etc.). I think the bottleneck is somewhere on the ESX server.
Ok, I think i've figured this out. I missed in the docs about changing the round robin settings in ESXI. By default, the round robin method switches on every 1000 IOPS. So, path #1 gets 1000 IOPS, then path #2 gets 1000 IOPS. Well, the NetApp can finish handling 1000 IOPS by the time it comes back to the same path, so at times the path was just bored waiting for more. The solution is to modify this setting so that ESX switches much sooner. From the documents I read, this number should be around 3. That seemed to give me the best results as well in my testing as well.
To set this command, you need to get the LUN or device id from the Vsphere client or CLI interface. Then, login to the ESX server with SSH and run the following command...
The only thing that stinks is that it won't save this so you have to put that command in your /etc/rc.local file on the ESX server. If anyone knows of a better way, please let me know.
With this setting in place, I've hit as high as 200MB per second in my tests. I'll try it again tonight when there is nothing going on. That should give me a better indication of the maximum speed. But, at 200MB, I'm getting pretty close to saturing my 2Gb connection, so I'm very pleased with that.