Active (I/O) paths misconfigured on VMware with ALUA

pedro_rocha · ‎2013-04-08

Hello all,

I am seeing a strange behavior in as customer´s virtual environment. We have this vSphere 5.0 cluster which is accessing its Datastores via FCP. VSC 4.1 is installed and recommended settings are applied to the cluster. Igroups on the NetApp side have ALUA enabled.

On the vSphere side, path selection policy is round-robin and SATP is ALUA aware also (VMW_SATP_ALUA). There are 8 paths from each ESXi server to the NetApp HA pair (4 to each controller, running DOT 8.0.2P3 7-mode).

With this configuration we are seeing all paths as "Active (I/O)". Which for me indicates that all paths are active and I/O is passing through them all. I understand that with this configuration we would have 4 paths as "Active (I/O)" (optimized paths) and 4 paths as "Active" (non-optimized paths).

Does anyone know what could be possibly happening here? What's my mistake?

Kind regards,

Pedro Rocha.

maske · ‎2013-04-14

Please follow the KB on this issue:

https://kb.netapp.com/support/index?page=content&id=1011577&actp=search&viewlocale=en_US&searchid=null

pedro_rocha · ‎2013-04-16

Hi Doug,

Have done that before, no success. Case opened, let's see what is happening. Will post here...

Regards,

Pedro Rocha

Sergio_Santos · ‎2013-05-03

Hi Pedro,

I'm having the exact same problem and also have a ticket open but it's been a few weeks and so far no luck. Were you able to get a resolution? My setup is almost identical except my DOT is 8.0.2P4 and ESXi servers are 5.1 (though I had this problem with vSphere 5.0 as well). What kind of SAN switches are you using if I may ask? Thanks!

Sergio

pedro_rocha · ‎2013-05-03

Hi Sergio,

I have not got any answer yet. Case is still opened. We are using Brocade 5100 switches for the SAN environment.

I'll post here if I get any luck.

Regards,

Pedro Rocha.

werner_komar · ‎2013-05-15

Hello Pedro,

we have the same Configuratiuon / Problem. We use a FAS3240 Ontap 8.1.2 7 Mode with FC and Brocade 200E 4GB. The Servers come from HP CX7000 Blade

On the vSphere side, path selection policy is round-robin and SATP is ALUA.

All 4 Path are (Active IO). When we do a "CF Takeover", we lost the communication HOST <-> Datastore. If we "cf giveback" all 4 Path will come back online.

We open a Case since 29.4.2013. ....we are waiting...

Regards

Werner Komar

pedro_rocha · ‎2013-05-15

Werner,

We are now dealing with VMware. NetApp told us to contact them since the issue does not appear to be related to NetApp.

I'll tell you when we have something.

Regards,

Pedro

Pedro Rocha

+55 61 8203-5800

Enviado pelo meu BlackBerry®

gatorman1369 · ‎2013-05-23

Has this been sorted out, I am running into this issue with 5.1 hosts and ALUA enabled iGroups. I can give more details if needed but I am basically just reaching out before we submit a ticket on this.

Message was edited by: Allan Howard Here's a nmp list -d for one of my datastores ~ # esxcli storage nmp device list -d naa.60a98000572d43504f5a636d57537144 naa.60a98000572d43504f5a636d57537144 Device Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d43504f5a636d57537144) Storage Array Type: VMW_SATP_ALUA Storage Array Type Device Config: {implicit_support=on;explicit_support=off; explicit_allow=on;alua_followover=on;{TPG_id=2,TPG_state=AO}{TPG_id=2,TPG_state=AO}{TPG_id=2,TPG_state=AO}} Path Selection Policy: VMW_PSP_RR Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPathIndex=3: NumIOsPending=0,numBytesPending=0} Path Selection Policy Device Custom Config: Working Paths: vmhba2:C0:T1:L61, vmhba2:C0:T0:L61, vmhba3:C0:T0:L61, vmhba3:C0:T1:L61 Is Local SAS Device: false Is Boot USB Device: false

pedro_rocha · ‎2013-05-23

Hello,

Nothing yet. I would recommend to check this with VMware also. From NetApp that is what I got (a direction to check with VMware).

Regards,

Pedro.

Sergio_Santos · ‎2013-05-23

No luck on my end either. On my 5th NetApp tech and they referred me to VMware support to resolve the issue. I'm waiting to hear back from them, but they've been awfully quiet.

I'm about to change the PSP to using FIXED with a preferred path as I've been working on this for months and I need to move on with this project. Maybe someone with more time can follow up on this and get this resolved once and for all since it doesn't sound like just me with this problem. I'm gonna dump my setup below and some of the steps I've taken:

SETUP:

Brocade VA-40FC 8Gbp FC switches (independent dual-fabric)

DELL R910/R900 servers with two single port Qlogic QLE2460 HBAs (each HBA to one fabric)

NetApp V3240 controllers with DataONTAP 8.0.2P4, single_image cf mode (1st FC port to fabric 1, 2nd to fabric 2); no V features used (back-end storage array connections)

Single initiator, single target zoning (Each ESXi hba port to one NetApp FC port per controller)

TROUBLESHOOTING (none of the steps below had any affect on pathing):

* Was happening with ESXi 5.0 with same DELL servers and NetApp controllers, and McData DS4700M FC SAN switches

* Built new ESXi 5.1 server on DELL R720

* Tried old firmware HBA QLE 2460 4.00.012 later upgraded to latest 5.09.0000

* Tried old firmware dual-port HBA QLE 2462 4.00.030 later upgraded to latest 5.09.0000

* Change FC fill-word option on Brocade switch ports from 1 to 3 for esxi ports and netapp ports

* Injected latest Qlogic drivers into ESXi OS (originally using v911.k1.1-26vmw but now 934.5.20.0-1vmw)

* Disable port features (NPIV, trunking, QoS) on Brocade for ESXi port

* Upgraded to ESXi 5.1 U1

gatorman1369 · ‎2013-05-29

I've submitted a ticket to VMware.

I hate to be the guy who throws a wrench into this, but I've got a similar site as the one that isn't working, my HQ actually, with HP DL 380 G7's, and ESXi builds that were upgraded from build 9xxxxx to 1065491. ONTAP is 8.1.2 7-mode on 3250's. ALUA works just fine here.

Last week I went to my DR site to do a revamp, get more hosts and it's own vCenter, as my hosts there were managed by my vCenter at HQ. ALUA not working.

Here are a few of my details on the site that isn't working.

CISCO MDS 9124 (Using MDS 9148 at HQ)

NetApp 3140 HA pair ONTAP 8.1.2 P1 7-mode (yep, difference is the P1 compared to HQ)

HP DL380 G5

ESXi 5.1u1 Build 1065491 (Fresh install, not upgraded like HQ)

Emulex LPe12002 HBA Firmware: 2.00A4 Drivers:8.2.3.1-127vmw (Same as HQ)

I've configured iGroups for each of my hosts with ALUA enabled, yes I've rebooted them.

I've even run the NetApp VSC "Set recommended Values" for of the timeouts and such . Also did not help. Mind you, that I haven't run this back at HQ, and it doesn't seem to be an issue.

I've done similar TS steps as I am sure a lot of you have already done.. like blowing away the iGroups and creating them over again via System Manager GUI instead of using the command line originally. I've tried creating the iGroups without enabling ALUA first, saving the configuration, and then enabling ALUA both in SM and command line. Or creating a new LUN and seeing what the host decides the multipathing will be. I can not enable ALUA on the iGroup, and not have all paths active i/o... Going back to fixed path may be the thing I have to go with too.

I'll hopefully speak to VMware tomorrow.

pedro_rocha · ‎2013-05-29

Hi all,

We are now speaking with VMware support, who directed us back to NetApp

support.

I am going to try to make them speak to each other since this is really

annoying and several people is getting same results.

Regards,

Pedro.

On Wed, May 29, 2013 at 4:58 PM, gatorman1369 <

werner_komar · ‎2013-08-28

Hi Pedro,

this is our ALUA Story: Netapp 3240, FCP, ALUA Misconfigure
We have a case open over 6 Month and Netapp Support don’t know what’s going on. We always have 4 Active FCP Path in ESXi. (2 Active I/O and 2 Active should be normal). We check everything, the ESXi config, WWPNs Zoning, nothing help. After long time deal with Netapp Support, we do our own investigation and we found out that there is a Problem with the "local.single_image.key" and "partner.single_image.key". You see this in Lun config:
Code:
priv set diag
lun config
output: (the output has more information but this are the important)
local.single_image.key = "157459xxxxx"
partner.single_image.key = "15746xxxxx"
local.serialnum = "2000004xxxxx"
partner.serialnum = "5000002xxxxx"

This numbers should be correct on both HA Pairs visa versa. In our case the "partner.single_image.key" was not the "local.singel_image.key" from the other HA controller.
We ask Netapp Support what’s going on with this numbers and the told us that "could be" a Problem. Then we ask for an Action Plan to change the "partner.single_image.key" to correct the problem.
We also ask, if we change the partner.single_image.key, did anything else change? And Netapp Support said no.

Code:
priv set diag
fcp stop
lun config set partner.single_image.key xxxxxxxxxx (we enter the HA Partners local.single_image.key)
fcp start
priv set
reboot (do automatic takeover)

and... after reboot ... WWPN change automatically on the controller where we change the partner.single_image Key.
We think about to change manually the WWPNs back to the original setting. We don’t know if it’s open another problem. We decide to change the zoning and in both Brocade Switches and: HERE WE GO
We reboot our 12 ESXi Servers and ALUA did the Job. After this, we bring 200 VM back online.
The Problem start with a FAS3240 Motherboard replacement.

Regards
Werner

PSSUPPORT · ‎2013-05-17

We had a similar issue with two FAS 3240 in a streched Metro Cluster using FC.

The solution for our environment was to disable ALUA on IGroup within the Netapp.

Then the ESX-Servers use default-ALU (VMW_SATP_DEFAULT_AA). Path Policy is set to Round robin (VMWARE).

Maybe this helps.

Regards,

ps-support

PSSUPPORT · ‎2013-05-17

Additional info:

you have to reboot the esx servers after disabling ALUA on the netapp, then the esx servers should use VMW_SATP_DEFAULT_AA instead of VMW_SATP_ALUA.

Regards,

ps-support

pedro_rocha · ‎2013-05-23

Hi,

Disabling ALUA does not seem to be a solution. AFAIK it is recommended to use ALUA with roud robin policy for this environment. Or not?

We have opened a case with VMware since this appears to be a VMware issue. I'll post here what we find out.

Regards,

Pedro Rocha.

sinergy_storage · ‎2013-05-30

Hi everybody. Same issue to us. Customer with Metrocluster 3240 / two fabric (host ESXi 5.1 no update 1) with 2 HBA QLOGIC - 8 path all in Active I/O.

In the same cluster vmware also we are zoned with IBM SVC. SVC uses all active I/O path enabled.

We expect from Netapp the traditional ALUA, so Active I/O to half of the path and the Others only in "Active".

is there any response from VMWARE/NetApp

what about changing Advanced paramters

Disk.UseDeviceReset from 1 to 0 or

Disk.UseLunReset ????

sinergy_storage · ‎2013-05-30

what about the information regarding optimizations path written on this recent vmware kb?

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008113

It seems something tied to the single device

wainting further response

gatorman1369 · ‎2013-05-30

I've run the "esxcli storage core device list" command in a ALUA working environment as well as my non-working environment. Results from both are the same.

Particularly from this KB you mention....

Queue Full Sample Size: 0

Queue Full Threshold: 0

Example:

ALUA working

naa.60a98000572d43504b5a66444d755758

Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d43504b5a66444d755758)

Has Settable Display Name: true

Size: 512078

Device Type: Direct-Access

Multipath Plugin: NMP

Devfs Path: /vmfs/devices/disks/naa.60a98000572d43504b5a66444d755758

Vendor: NETAPP

Model: LUN

Revision: 811a

SCSI Level: 4

Is Pseudo: false

Status: on

Is RDM Capable: true

Is Local: false

Is Removable: false

Is SSD: false

Is Offline: false

Is Perennially Reserved: false

Queue Full Sample Size: 0

Queue Full Threshold: 0

Thin Provisioning Status: yes

Attached Filters: VAAI_FILTER

VAAI Status: supported

Other UIDs: vml.020026000060a98000572d43504b5a66444d7557584c554e202020

Is Local SAS Device: false

Is Boot USB Device: false

ALUA not working

naa.60a98000572d43504f5a6b337a643563

Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d43504f5a6b337a643563)

Has Settable Display Name: true

Size: 266254

Device Type: Direct-Access

Multipath Plugin: NMP

Devfs Path: /vmfs/devices/disks/naa.60a98000572d43504f5a6b337a643563

Vendor: NETAPP

Model: LUN

Revision: 811a

SCSI Level: 4

Is Pseudo: false

Status: on

Is RDM Capable: true

Is Local: false

Is Removable: false

Is SSD: false

Is Offline: false

Is Perennially Reserved: false

Queue Full Sample Size: 0

Queue Full Threshold: 0

Thin Provisioning Status: yes

Attached Filters: VAAI_FILTER

VAAI Status: supported

Other UIDs: vml.020064000060a98000572d43504f5a6b337a6435634c554e202020

Is Local SAS Device: false

Is Boot USB Device: false

Appreciate the time and\ the KB link.

gatorman1369 · ‎2013-05-30

While not fixed, I've run this command on my not working ALUA filers... I think this is our issue here and I'll update as this move along.

MyFiler> lun config_check -v

Checking for down fcp interfaces

======================================================

The following FCP HBAs appear to be down

0c LINK NOT CONNECTED

0a LINK NOT CONNECTED

Checking initiators with mixed/incompatible settings

======================================================

No Problems Found

Checking igroup ALUA settings

======================================================

(null) No Problems Found

Checking for nodename conflicts

======================================================

No Problems Found

Checking for initiator group and lun map conflicts

======================================================

No Problems Found

Checking for igroup ALUA conflicts

======================================================

No Problems Found

Checking for duplicate WWPNs

======================================================

The following WWPNs are duplicate:

500a0981892accc6

500a0982892accc6

500a0983892accc6

500a0984892accc6

sinergy_storage · ‎2013-05-30

now we are wainting for a time Windows on the customer's production infrastructure to trying a selective zoning.

In other words, the problem seems to be tied to the number o path to the devices that is major of 4.

In another cluster with only 4 path to datastores the ALUA works perfectly.

We'try to remove some path from fabric so that a max of 4 path will be available. Then see the results.

If somebedy could test this situztion in time before us, please write a feedback.

Bye

Sinergy