New FAS2050 poor iSCSI performance in ESX4

dkedrowitsch · ‎2009-11-19

Greetings! I am new to iSCSI and I just installed a new FAS2050 configured with dual filers and all 15K SAS drives. It's configured for Active/Active and the disks are split evenly between the 2 filers. I created a 300 gig volume for VMWare on the first filter (fas2050a).

The clients are ESX4 servers, 2 new HP DL385G6 machines and 2 HP DL380G6 machines.

The switches dedicated to iSCSI are 2 ProCurve 2910AL-24G with 10gbit link between them.

Here are the relevant lines from the switch config.

interface 1
   name "FAS2050a_e0a"
   flow-control
exit
interface 2
   name "FAS2050a_e0b"
   flow-control
exit

trunk 1-2 Trk3 LACP

spanning-tree Trk3 priority 4

The DL385s are configured with 8 gbit NIC ports. The DL380s only have 2 (will upgrade later).

The filer fas2050a has both nics connected to one of the switches and are configured as virtual interface vif0 for lacp. Within that virtual inteface I created a virtual interface for vlan 100 and have the switch ports trunked for LACP tagged vlan 100. All interfaces that will be used for iSCSI are setup for an MTU size of 9000. vif0-2 is for our normal "server" vlan segmant and has iSCSI disabled.

fas2050a> ifconfig -a
e0a: flags=80908043<BROADCAST,RUNNING,MULTICAST,TCPCKSUM,VLAN> mtu 9000
        ether 02:a0:98:12:b4:f4 (auto-1000t-fd-up) flowcontrol full
        trunked vif0
e0b: flags=80908043<BROADCAST,RUNNING,MULTICAST,TCPCKSUM,VLAN> mtu 9000
        ether 02:a0:98:12:b4:f4 (auto-1000t-fd-up) flowcontrol full
        trunked vif0
lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 8160
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
        ether 00:00:00:00:00:00 (VIA Provider)
vif0: flags=80908043<BROADCAST,RUNNING,MULTICAST,TCPCKSUM,VLAN> mtu 9000
        ether 02:a0:98:12:b4:f4 (Enabled virtual interface)
vif0-2: flags=4948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM,NOWINS> mtu 1500
        inet 10.0.4.60 netmask 0xfffffc00 broadcast 10.0.7.255
        partner vif0-2 (not in use)
        ether 02:a0:98:12:b4:f4 (Enabled virtual interface)
vif0-100: flags=4948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM,NOWINS> mtu 9000
        inet 10.0.100.2 netmask 0xffffff00 broadcast 10.0.100.255
        partner vif0-100 (not in use)
        ether 02:a0:98:12:b4:f4 (Enabled virtual interface)

Since the DL380s only have 2 nics their configuration is simple so I will keep to that for now.

I hace 2 virtual switches with a single nic assigned to each. vSwitch0 is for the sevice console and VM Network, vSwitch1 is for the iSCSI software HBA. I configured both vSwitch1 and vmnic1 for a MTU size of 9000.

[root@ushat-esx03 ~]# esxcfg-vswitch -l
Switch Name    Num Ports   Used Ports Configured Ports MTU     Uplinks
vSwitch0       32          8           32                1500    vmnic0

PortGroup Name      VLAN ID Used Ports Uplinks
VM Network          0        5           vmnic0
Service Console     0        1           vmnic0

Switch Name    Num Ports   Used Ports Configured Ports MTU     Uplinks
vSwitch1       64          3           64                9000    vmnic1

PortGroup Name      VLAN ID Used Ports Uplinks
iSCSI_VMkernel0     100      1           vmnic1

vmnic1 has been properly bound to the iSCSI software hba.

[root@ushat-esx03 ~]# esxcli swiscsi nic list -d vmhba33
vmk0
    pNic name: vmnic1
    ipv4 address: 10.0.100.30
    ipv4 net mask: 255.255.255.0
    ipv6 addresses:
    mac address: 00:22:64:c2:2d:9e
    mtu: 9000
    toe: false
    tso: true
    tcp checksum: false
    vlan: true
    link connected: true
    ethernet speed: 1000
    packets received: 283013
    packets sent: 146301
    NIC driver: bnx2
    driver version: 1.6.9
    firmware version: 1.9.6

The test is copying very large files such as ISOs or server images from machine to machine through the local vswitch, or "migrating" virtual machines from the ESX server's fast local storage to the FAS2050 and back.

Here is a sample sysstat output from the filer while copying a 5 gig image file from a file server on the local network to a virtual machine hard drive on the SAN. The numbers are often much lower when migrating a VM from local to iSCSI.

CPU   NFS CIFS HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache CP   CP Disk    FCP iSCSI   FCP kB/s iSCSI kB/s
                                  in   out   read write read write   age   hit time ty util                 in   out    in   out
90%     0     0     0     698 72751   409   1168 45316     0     0     4s 99% 58% F   23%      0   698     0     0 72071    34
87%     0     0     0     678 70531   364    492 44516     0     0     4s 99% 56% F   25%      0   678     0     0 69861     0
90%     0     0     0     559 58775   345    968 44668     0     0     4s 99% 66% F   31%      0   559     0     0 58205    34
90%     0     0     0     685 70433   399    596 43224     0     0     4s 99% 52% F   23%      0   685     0     0 69761    34
91%     0     0     0     533 54872   361    380 43240     0     0     4s 100% 68% F   25%      0   533     0     0 54290    34
91%     0     0     0     692 72061   373    408 43368     0     0     4s 100% 64% F   22%      0   692     0     0 71420     0
77%     0     0     0     590 61025   351    756 43404     0     0     4s 99% 49% F   24%      0   590     0     0 60368    34
91%     0     0     0     571 59405   315    996 43720     0     0     4s 99% 65% F   29%      0   571     0     0 58902     0
89%     0     0     0     708 72635   408    580 43296     0     0     4s 99% 53% F   24%      0   708     0     0 71925    34
91%     0     0     0     634 66449   375    432 47828     0     0     4s 99% 62% Fn 27%      0   634     0     0 65809    34
90%     0     0     0     628 64908   368    611 54985     0     0     4s 99% 64% Fn 29%      0   628     0     0 64312    34
90%     0     0     0     666 68360   352   1292 54868     0     0     4s 100% 63% Ff 32%      0   666     0     0 67641     0
92%     0     0     0     566 59267   351   1944 41020     0     0     4s 99% 53% Fs 30%      0   566     0     0 58760    34
97%     0     0     0     345 34418   187    328 16743     0     0     4s 100% 24% :   12%      0   345     0     0 34022    11
89%     0     0     0     720 73894   416    804 38240     0     0     4s 99% 69% Ff 21%      0   720     0     0 73254    34
88%     0     0     0     616 63528   328    590 53546     0     0     4s 99% 59% F   29%      0   616     0     0 62879     0
89%     0     0     0     515 51532   365    412 45976     0     0     4s 99% 58% F   30%      0   515     0     0 51021    69
89%     0     0     0     523 54084   305    820 40496     0     0     5s 99% 63% Ff 33%      0   523     0     0 53599     0
83%     0     0     0     615 64082   365    564 48376     0     0     4s 99% 66% Fv 27%      0   615     0     0 63458    34
88%     0     0     0     681 70214   361    864 44672     0     0     4s 99% 58% F   25%      0   681     0     0 69562     0

It seems the performance tops out around 60-70 MB/s with rather high CPU usage on the filer. My understanding was we should see closer to 120 MB/s when using gigabit and jumbo frames. When disabling jumbo frames there is hardly any impact on performance.

The performance is a little better when testing on the DL385s where I have 4 nics dedicated to 4 seperate VMkernels with round robin providing 4 active paths (following TR-3749 as a configuration guide)--about 80-90 MB/s.

Am I right to assume we should be seeing quite a bit more throughput from this configuration? I was hoping to see >120 megs/sec since I have 2 gigabit nics in the FAS2050 filter in a LACP trunk, and 4 nics using round robin on the ESX servers AND using jumbo frames.

Any suggestions?

adamfox · ‎2009-11-19

First off, you REALLY should open a support case on this to make sure it's tracked and also because they have configuration experts who can typically help better.

That being said, I do have some comments.

1. From a single host, you should not expect more than 1Gb/s performance even with trunks/.mulit-mode vifs. It's just a matter of how this technology works. All data from a given host will be sent down the same path by the switch. This is how EtherChannel works. It's port aggregation for wide sharing across multiple hosts, not narrow sharing from a single host.

2. Are you sure you've configured vlan tagging on the switch? You can't just do it on the storage side and not the switch. I'm not an expert at reading switch configs (hence my suggestion for calling support), but I didn't see any reference to this.

3. Are you sure you've configured jumbo frames from end to end (storage, switch ports, and hosts)? If not, you will have performance issues.

Just a few things off the top of my head.

dkedrowitsch · ‎2009-11-19

Thanks for your comments.

Yes, vlan tagging is enabled in the switch for every port/trunk accessing the iSCSI vlan. Sorry I didn't include that line from the switch configuration.

I guess I assumed with round robbin using 4 nics with 4 seperate mac addresses creating 4 active paths all apparently carring i/o at the same time (you can actually see the round robbin working by watching the 4 switch ports activity lights go 1-2-3-4, 1-2-3-4, pretty neat), I would actually get near 4 gb/sec (obviously with overhead). Perhaps I assumed wrong.

As for the jumbo frames, yes, it's enabled end to end. ESX vSwitch -> vmnic -> HP switches -> fas vif LACP, fas vif-100 (target virtual interface vlan).

I've enabled flow-controll all the way through as well.

Normally I wait to call support for anything until I'm sure I have a firm grasp on the problem. I hate calling up and feeling silly when I'm asked a simple question I can not answer because I failed to do my homework. I've read plenty of VMware and NetApp material, now I'm asking the group. Next is opening a ticket.

Thanks again for the quick reply!

eric_barlier · ‎2009-11-19

Hi,

I think you may have answered the question yourself; adding paths is not going to increase speed, it will increase bandwidth. Think of it as having

one lane on the highway with 100 miles per hour limit. adding another 3 lanes with 100 mph limit will not enable you to do 400mph.

I d potentially still open a case with support if you think its too slow. Dont be shy, you can only learn from it if you re a bit wrong.

Cheers,

Eric

DidierToulouze · ‎2010-05-25

Hello,

Have you managed to improve performances of your FAS2050 ?

We have same poor performance issues.

We planned to activate jumbo frame but if the performance improvement is low, we may not waste our time this way.

It look like the bottleneck of this filer is the CPU (single core celeron AFAIK), do you know if we can install dedicated iSCSI hba target card in the 2050 ?

Thanks for reading,

Didier.

kevingolding · ‎2010-09-02

Hiya guys, I wondered if anyone got anywhere... I have a performance issue also with iSCSI... I'll tell you how I'm getting on...

Basically I have a dl380 g7 ESXi 4.1 host with a 4 port Broadcom iSCSI adapter onboard, and a regular intel quad port NIC in a PCI slot...

I have a dedicated vswitch for iSCSI traffic, and three vMkernel ports, all linked to separate VMHBA's (1 swiscsi vmhba, and 2 broadcom vmhbas). these all have individual IQN's, and the filer has all if these in an igroup and mapped a LUN to the group....

The PSP is set to RoundRobin for the RDM LUN with all three VMHBA's having active paths

on the filer is a dual port multimode VIF (IP Hash) (Does LACP make that much performance difference?) to a trunk group on a procurve switch.

I have the RDM mapped to a 2008 64bit VM (the OS disk is on an NFS export)...

i'm using iometer for the first time (feel like a noob but not really used to generating load), but I did all of the tests, and the CSV output said I'm getting 18-30mb throughput......

is there something better for measuring throughput?

or am I looking for something that isn't going to happen?

karl_pottie · ‎2010-09-07

You say you have a performance problem, but do you actually have a performance problem (i.e. an application runs slow)? Or did you just run iometer and think you have a performance problem ?

A benchmark is only valid if it simulates the IO characteristics of your application(s). Without that, and without a baseline, a benchmark is pointless and useless.

People often concentrate on throughput, while for many apps IOPS and latency are much more important. Iometer uses lots of random IO, and for random IO you can't expect high throughput (30MB/s can be pretty decent).

We have VMWare hosts running 20 guests running at 25Mbit/s (Mbit, not MBytes!) average on their storage NICs. That's because random IO for small block sizes translates in low throughput but high IOPS. And application performance is excellent !

kevingolding · ‎2010-09-07

well sort of both, basically I have a figure that I know a physical server is pushing on it's local disk, and I am trying to get the iSCSI links from the new VM environment to that level (essnetially 153MB/sec)... and I am trying to disprove that iSCSI wont get to that level...

karl_pottie · ‎2010-09-07

You need to measure your IOPS, average IO size,% random, % reads and % writes for your application. Then you can run a benchmark test that simulates these characteristics.

Only then can you draw any conclusions.

Or use the actual application if that's easier.