Data Backup and Recovery

NetApp DSM MPIO Dynamic Least Queue Depth not working

bsti
7,775 Views

I've a case open and am working with NetApp Tech Support, but I figured I'd post here in case someone has heard of this:

For some reason, when I enable Dynamic Least Queue depth LBP on my NetApp LUN, it does not load balance across HBAs.  I can see from my SAN Switch monitoring software that traffic ONLY goes across one HBA and only to one target port on the controller.  If I switch to Round Robin, it immediately load balances across both HBAs and all target ports on the controller. 

My environment: 

Windows Server 2008 R2 x64 Enterprise

FC Host Utilities 5.2

FC Host Utilities 5.3 (I tested both versions, same results)

NetApp MPIO 3.3.1

NetApp MPIO 3.4 (Both version exhibit same results)

SnapDrive 6.3

FAS 6280 HA Pair

ONTAP 8.0.1P2

All Fiber Channel (No iSCSI, NFS,etc..)

Two Brocade Switches

I'm kind of stumped.  One one had, I'd prefer LQD because it's what is recommended to me by pretty much everyone (including NetApp). I'd go to RR, but the issue I see is that in DSM 3.4 RR does not distinguish between optimized and non-optimized pathing.  The manual pointedly states this is NOT a recommended setting for FCP environments.  I'm thinking I may have to resort to RR with Subset (using the non-optimized paths as passive paths).

Anybody ever see this issue?

10 REPLIES 10

bsti
7,707 Views

I should clarify:  Failover is working with this LBP.  Load-balancing is not.

lee_harrison
7,708 Views

Hi, just wondering if there was any update on this issue or a fix?

We are experiencing a similar issue in that LQD does not balance the load. As far as I can see MPIO is all setup and working ok with individual source IP and target vifs traversing 2 x separate logic vlans. Our platform is based on Hyper-V Server 2008 R2 SP1 OS (similar to server core enterprise), running over Cisco Nexus 5k and 2k switches, linking to Dell Blade compute and Dell M6348 switches and FAS6280 Controllers. The platform is a Hyper-V cluster.

Round Robin does appear to work ok but we are following recommendations as per TR-3702 which says LQD is best practice!

Also, we are using iSCSI ...

Cheers.

bsti
7,708 Views

No, I never received a fix or solution to this.  I tested with Round Robin (with Subset) and that appears to balance things perfectly, so I stuck with it. 

aham_team
7,706 Views

Enable ALUA.

bsti
7,706 Views

Enabling ALUA is not the solution to the issue for me. I had ALUA enabled and it made no difference in my case. Thanks for the suggestion though.

lee_harrison
7,706 Views

I also tried enabling ALUA, even though it isnt for iSCSI, and it didnt make a difference. Interestingly, I have a smaller lun which does appear to be load balancing ok? Also, I tried removing the ONTAP DSM and setting the MS DSM to LQD and that is load balancing fine as well??

shrinivk
7,707 Views

Couple of thoughts……

What are the results of these 2 configs?

1) Have 2 I/O paths through a dual ported single HBA and see with LQD if the I/O’s are going through both of them.

2) Next with I/O path each through different HBA and with LQD check on the I/O’s to see if they are going through both the HBA’s.

In 1) and 2) above disable the port through which currently the I/O‘s are going through the HBA with LQD and see if fails over the path.

Now try by increasing the load with heavy I/O’s or multiple LUNs all owned by only one of the Controller.

mekka
7,706 Views

Just to clarify, this is expected behavior. Our load balance policies honor ALUA first unless explicitly overriden. giving the user control over path selection, as is the case with round robin with subset. We recommend LQD because it keeps I/O off of the non-optimized paths, which is desireable for load balancing at the controller as well,  and allows the DSM to select the best path based on the queue depth. Using all available paths is not necessarily the optimal way to load balance, depending on the configuration. If it is indeed desired, RRwS is the way to go. DSM 4.0 documentation has been updated to provide further clarification.

bsti
7,706 Views

I've actually revisited this recently.  My more recent tests show that LQD does work, but does not load balance the way I'd expect.  This began for me quite some time ago, so I don't have all of details still.  I think part if my issue was some misconfiguration or unexpected behavior somewhere in either the Brocade switch, MPIO, Snapdrive, or WIndows OS.  Part of it as well was a misunderstanding on my part of how LQD works. 

I'm in the process of retesting LQD and assuming it works properly I'll start switching servers back over to it.  My initial tests show that it fails over as expected when a path disappears and it does appear to use both paths, though not equally (which I think is fine).

Thanks for the update.

SLC_TAYLOR
6,481 Views

If your storage is not being heavily taxed, then all paths should be at 0 queue depth.   So it would make no difference which path it uses.    Round robin might load balance perfectly, but if you have 1% load on 4 paths vs 4% load on 1 path, the throughput is going to be a wash.  

LQD should start spreading across more links when the one becomes busy enough to start queuing.  So before assuming it doesn't work, check your Queue depth on the lun that has the bulk of the traffic.  

Public