Re: Active (I/O) paths misconfigured on VMware with ALUA - Page 2

pedro_rocha · ‎2013-04-08

Hello all,

I am seeing a strange behavior in as customer´s virtual environment. We have this vSphere 5.0 cluster which is accessing its Datastores via FCP. VSC 4.1 is installed and recommended settings are applied to the cluster. Igroups on the NetApp side have ALUA enabled.

On the vSphere side, path selection policy is round-robin and SATP is ALUA aware also (VMW_SATP_ALUA). There are 8 paths from each ESXi server to the NetApp HA pair (4 to each controller, running DOT 8.0.2P3 7-mode).

With this configuration we are seeing all paths as "Active (I/O)". Which for me indicates that all paths are active and I/O is passing through them all. I understand that with this configuration we would have 4 paths as "Active (I/O)" (optimized paths) and 4 paths as "Active" (non-optimized paths).

Does anyone know what could be possibly happening here? What's my mistake?

Kind regards,

Pedro Rocha.

Sergio_Santos · ‎2013-05-30

Hi Sinergy,

We've got 4 paths total here and it's not working (they all have Active (I/O)). The four paths breakdown like this:

Fabric 1

Path 1: ESX vmhba1 to V3240 Controller 1 Port 0d

Path 2: ESX vmhba1 to V3240 Controller 2 Port 0d

Fabric 2

Path 3: ESX vmhba2 to V3240 Controller 1 Port 4a

Path 4: ESX vmhba2 to V3240 Controller 2 Port 4a

I would try to change your pathing scheme anyway because there does not seem to be a clear pattern when this happens and when it doesn't. No word from the VMware tech on my end. I don't know if I mentioned it, but all my ESX setups have been fresh--no upgrades from 4.0 to 5.0 to 5.1. I installed both a new vCenter sever and reinstalled ESXi each time.

Also, I was trying to pick the brain of the Netapp tech to find out if there's a way to debug the ALUA negotiation between the ESX and the controller. Either a log or in real time. Alas, it doesn't sound like there is any way to do that.

pedro_rocha · ‎2013-05-31

Hello all,

I am reopening my case with NetApp. VMware is telling me that the TPG parameter is being erroneously passed to the vSphere cluster for some Datastores (and that this is sent from the storage to the vSphere). Here's an example of bad configuration:

~ # esxcli storage nmp device list | grep -A 5 naa.60a9800064656d735a346b436a54426b

naa.60a9800064656d735a346b436a54426b

Device Display Name: NETAPP Fibre Channel Disk (naa.60a9800064656d735a346b436a54426b)

Storage Array Type: VMW_SATP_ALUA

Storage Array Type Device Config: {implicit_support=on;explicit_support=off; explicit_allow=on;alua_followover=on;{TPG_id=2,TPG_state=AO}{TPG_id=2,TPG_state=AO}}

Path Selection Policy: VMW_PSP_RR

Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPathIndex=9: NumIOsPending=0,numBytesPending=0}

Path Selection Policy Device Custom Config:

Best wishes,

Pedro Rocha.

gatorman1369 · ‎2013-05-31

Pedro,

This was exactly what I was seeing. TPG_id=x on all of my SATDC.

Please run a "lun config_check -v" from your filers. When I did yesterday, I noticed duplicate WWPN's from a filer head swap that was preformed a few weeks ago. Also run a "fcp show adapter" on both filers and compare the FC NODENAMEs and FC PORTNAMEs. Guess what, mine where identical on both filers.

https://kb.netapp.com/support/index?page=content&id=1013497&actp=search&viewlocale=en_US&searchid=null#Storage_controller_head_upgrade_best_practice_p...

Resolving duplicate fibre channel WWPNs between nodes in an HA-pair

"WWPNs can be duplicated across ports in an HA-pair under certain circumstances. As a side effect, to duplicate WWPNs, the ALUA configuration will also be duplicated causing all MPIO path states to be either optimized or non-optimized. This issue typically occurs when existing HA-pair systems configured with FCP are split and rejoined with the new nodes. The following procedure must be followed to resolve the duplicate configuration and restore the ALUA configuration. This procedure will cause WWPNs on one of the two nodes to change. The change to WWPNs will require fibre channel zoning configuration updates after the new WWPNs have been generated."

Hope this helps you.

Output follows from my filers to show you an example.

MyFiler> lun config_check -v

Checking for down fcp interfaces

======================================================

The following FCP HBAs appear to be down

0c LINK NOT CONNECTED

0a LINK NOT CONNECTED

Checking initiators with mixed/incompatible settings

======================================================

No Problems Found

Checking igroup ALUA settings

======================================================

(null) No Problems Found

Checking for nodename conflicts

======================================================

No Problems Found

Checking for initiator group and lun map conflicts

======================================================

No Problems Found

Checking for igroup ALUA conflicts

======================================================

No Problems Found

Checking for duplicate WWPNs

======================================================

The following WWPNs are duplicate:

500a0981892accc6

500a0982892accc6

500a0983892accc6

500a0984892accc6

MyFiler> Fri May 31 10:09:20 EDT [MyFiler:scsitarget.conflicting.wwpns:error]: Local node and partner node have conflicting WWPNs and ALUA states which will degrade host MPIO performance.

My Filer1

MyFiler1> fcp show adapter

Slot: 0c

Description: Fibre Channel Target Adapter 0c (QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:89:2a:cc:c6 (500a0980892accc6)

FC Portname: 50:0a:09:83:89:2a:cc:c6 (500a0983892accc6)

Standby: No

Slot: 0d

Description: Fibre Channel Target Adapter 0d (QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:89:2a:cc:c6 (500a0980892accc6)

FC Portname: 50:0a:09:81:89:2a:cc:c6 (500a0981892accc6)

Standby: No

Slot: 0a

Description: Fibre Channel Target Adapter 0a (QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:89:2a:cc:c6 (500a0980892accc6)

FC Portname: 50:0a:09:84:89:2a:cc:c6 (500a0984892accc6)

Standby: No

Slot: 0b

Description: Fibre Channel Target Adapter 0b (QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:89:2a:cc:c6 (500a0980892accc6)

FC Portname: 50:0a:09:82:89:2a:cc:c6 (500a0982892accc6)

Standby: No

MyFiler2

MyFiler2> fcp show adapter

Slot: 0c

Description: Fibre Channel Target Adapter 0c (QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:89:2a:cc:c6 (500a0980892accc6)

FC Portname: 50:0a:09:81:89:2a:cc:c6 (500a0981892accc6)

Standby: No

Slot: 0d

Description: Fibre Channel Target Adapter 0d (QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:89:2a:cc:c6 (500a0980892accc6)

FC Portname: 50:0a:09:82:89:2a:cc:c6 (500a0982892accc6)

Standby: No

Slot: 0a

Description: Fibre Channel Target Adapter 0a (QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:89:2a:cc:c6 (500a0980892accc6)

FC Portname: 50:0a:09:83:89:2a:cc:c6 (500a0983892accc6)

Standby: No

Slot: 0b

Description: Fibre Channel Target Adapter 0b (QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:89:2a:cc:c6 (500a0980892accc6)

FC Portname: 50:0a:09:84:89:2a:cc:c6 (500a0984892accc6)

Standby: No

pedro_rocha · ‎2013-05-31

Hi!

But correcting that solved your issue? Is everything fine now?

Regards,

Pedro Rocha.

On Fri, May 31, 2013 at 11:20 AM, gatorman1369 <

gatorman1369 · ‎2013-05-31

Have not preformed the operations just yet, I'd like all hosts that are connected to be down and need to schedule that out.

I've identified a major issue that points in the direction of not just ALUA, but all multipathing not functioning correctly. Since I've determined this to be an issue I've gone back to RHEL and Exchange servers connected to these filers, they are also affected by this but it wasn't apparent at first glance. The description of duplicate WWPN's and what ALUA is doing from the netapp kb certainly fits my issue. Let us know if the lun config_check -v comes up with anything.

pedro_rocha · ‎2013-05-31

Ok, I see.

I will not be able to run the command this week, since this is located at a

customer site.

Regarding duplicate WWPNs, I was able to check that via MyAutosupport and

it is not the case.

Regards,

Pedro.

On Fri, May 31, 2013 at 11:41 AM, gatorman1369 <

Sergio_Santos · ‎2013-05-31

Sweet lord this is the best solution I've heard in months! I don't have the exact setup but I might end up trying this anyway when I can schedule an outtage window. What differs in my situation is the four FC ports that I'm using do not have duplicate WWPNs thus the zoning "looks" correct. I also hypothesize that's why my "lun config_check -v" doesn't report the duplicate WWPNs message. But (and this is a BIG but), when I compare the complete "fcp show adapters" list on both filers there are links that are not connected with the exact same WWPNs between both. This may be enough to confuse the NetApps and send the incorrect ALUA info. That makes total sense to me. Here's my fcp show adapters dump (duplicates in bold & underlined):

filer1> fcp show adapter

Slot: 4a

Description: Fibre Channel Target Adapter 4a (Dual-channel, QLogic 2432 (2464) rev. 3)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:83:96:f7:b3:14 (500a098396f7b314)

Standby: No

Slot: 4b

Description: Fibre Channel Target Adapter 4b (Dual-channel, QLogic 2432 (2464) rev. 3)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:84:96:f7:b3:14 (500a098496f7b314)

Standby: No

Slot: 4c

Description: Fibre Channel Target Adapter 4c (Dual-channel, QLogic 2432 (2464) rev. 3)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:85:96:f7:b3:14 (500a098596f7b314)

Standby: No

Slot: 4d

Description: Fibre Channel Target Adapter 4d (Dual-channel, QLogic 2432 (2464) rev. 3)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:86:96:f7:b3:14 (500a098696f7b314)

Standby: No

Slot: 0d

Description: Fibre Channel Target Adapter 0d (Dual-channel, QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:82:96:f7:b3:14 (500a098296f7b314)

Standby: No

filer2> fcp show adapter

Slot: 4a

Description: Fibre Channel Target Adapter 4a (Dual-channel, QLogic 2432 (2464) rev. 3)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:81:96:f7:b3:14 (500a098196f7b314)

Standby: No

Slot: 4b

Description: Fibre Channel Target Adapter 4b (Dual-channel, QLogic 2432 (2464) rev. 3)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:82:96:f7:b3:14 (500a098296f7b314)

Standby: No

Slot: 4c

Description: Fibre Channel Target Adapter 4c (Dual-channel, QLogic 2432 (2464) rev. 3)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:83:96:f7:b3:14 (500a098396f7b314)

Standby: No

Slot: 4d

Description: Fibre Channel Target Adapter 4d (Dual-channel, QLogic 2432 (2464) rev. 3)

Adapter Type: Local

Status: LINK NOT CONNECTED

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:84:96:f7:b3:14 (500a098496f7b314)

Standby: No

Slot: 0d

Description: Fibre Channel Target Adapter 0d (Dual-channel, QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:86:f7:b3:14 (500a098086f7b314)

FC Portname: 50:0a:09:85:96:f7:b3:14 (500a098596f7b314)

Standby: No

Sergio_Santos · ‎2013-06-03

Hang tight fellas, I'm working on an outtage window to do the duplicate WWPN fix. I might be able to test it later this week.

gatorman1369 · ‎2013-06-13

Haven't forgetting about this or you guys

We are going to preform the maintenance today or tomorrow.

From our technical support engineer last correspondence.

//Start

I have finally found the root cause of why this happened, basically it is being caused by any of these 2 bugs

268320 - WWPN's are identical on both heads of a single_image cluster

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=268320

538094 - Check for duplicate WWPNs during bootup

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=538094

The steps to fix this issue are in the kb article you sent me before

https://kb.netapp.com/support/index?page=content&id=1013497

Below you will find the steps to change the WWPN, page 147

https://library.netapp.com/ecm/ecm_get_file/ECMM1277787

Steps

1. Take the adapter offline by entering the following command: fcp config adapter down

Example

fcp config 4a down

2. Display the existing WWPNs by entering the following command: fcp portname show [-v]

If you do not use the -v option, all currently used WWPNs and their associated adapters are displayed. If you use the -v option, all other valid WWPNs that are not being used are also shown.

3. Set the new WWPN for a single adapter or swap WWPNs between two adapters. If you are swapping between two adapters, make sure you take both adapters offline first.

Use this command . . . To do this . . .

fcp portname set adapter wwpn Set the new WWPN on a single adapter.

fcp portname swap adapter1 adapter2 Swap WWPNs between two adapters.

4. Example

5. fcp portname set 4a 10:00:00:00:c9:30:80:2

6. Example

7. fcp portname swap 3a 4a

8. Bring the adapter back online by entering the following command: fcp config adapter up

Example

fcp config 4a up

This is not a disruptive process, even though you have to bring the fcp port down there is a path misconfigured that is causing both controllers to use the same wwpn, so we have a duplicate path.

If you go ahead and make the changes in the partner controller, there won’t be any service affectation.

// End

It may be non-disruptive but I've been down that road a few times...Luckily for me, this is at a DR site where I can get away with powering off all of my workloads after hours. YMMV

gatorman1369 · ‎2013-06-13

Issue resolved

On one of our filers we did the following...

Issued a "fcp portname show [-v]" to get a list of all available WWPN's. Remember that ours had the same ones for both filers, so we need to select new ones from a list.

Then, we issued the "fcp portname set adapter wwpn" to change offending duplicate WWPN's

We did cf takeovers thinking that was all that their was to it. But then we noticed when issuing a "lun config" from both filers that the partner.single_image.key values were not set to be each others partner.

From the KB: 1013497

Resolving duplicate fibre channel WWPNs between nodes in an HA-pair

https://kb.netapp.com/support/index?page=content&id=1013497&actp=search&viewlocale=en_US&searchid=null#Storage_controller_head_upgrade_best_practice_p...

WWPNs can be duplicated across ports in an HA-pair under certain circumstances. As a side effect, to duplicate WWPNs, the ALUA configuration will also be duplicated causing all MPIO path states to be either optimized or non-optimized. This issue typically occurs when existing HA-pair systems configured with FCP are split and rejoined with the new nodes. The following procedure must be followed to resolve the duplicate configuration and restore the ALUA configuration. This procedure will cause WWPNs on one of the two nodes to change. The change to WWPNs will require fibre channel zoning configuration updates after the new WWPNs have been generated.

Stop the FCP service on both the nodes.

Run the following commands:

> priv set diag

*> lun config set local.single_image.key “”

*> lun config set partner.single_image.key “”

Repeat the process on the other node.

Perform a takeover and giveback of each node, or reboot each node.

After we did this we also had to re-zone the fabric because the filer got new WWN's.

Thinking back on this, I am not sure we even needed to do the "fcp portname set adapter wwpn" commands on all four FCP adapters because when we did the "lun config set..." commands the WWPN's changed again.

Also, there is no way I would have done this during the day as we definitely had all paths down a few times on ESXi hosts as well as some RHEL hosts. Luckily we planned for a few admins to help out with shutdowns and were able to knock this out successfully.

Happy this is resolved for us - I may share some of our before and after configs when I get back in the office in the AM.

gatorman1369 · ‎2013-06-14

Here is a before config where you can see that the partner.single_image.key values I mentioned above were not correctly set to each others local.single_image.key. So not only were our WWPNs the same across filers but we also had to make sure these keys were correctly configured. ALUA will not work until this partner configuration is correct. I also suspect this may very well be some of your issues who aren't seeing the duplicate WWPN messages when you run the "lun config_check -v" like pedro.rocha. His WWPns could be OK but the partner configurations may not be... I'd also take a look at your ispfct.single_image.nodename and related values.

MyFiler1> priv set diag

Warning: These diagnostic commands are for use by NetApp

personnel only.

MyFiler1*> lun config

local.single_image.key = "151702736"

partner.single_image.key = "151702726"

iscsi.nodename = "iqn.1992-08.com.netapp:sn.151702736"

local.serialnum = "70000802"

fc-port-0b = "9"

fc-port-0d = "9"

ispfct.local.nodename = "d0ccfa8980090a50"

fcp.fabric = "dual"

ispfct.portname.0d = "0"

ispfct.portname.0b = "1"

copy_offload.state = "true"

vaw.state = "true"

write_same.state = "true"

ispfct.portname.storage = "0d:0,0b:1,0c:2,0a:3,"

iscsi.disabled_interface = "e0M:vif1-340:"

ispfct.nodename = "c6cc2a8980090a50"

ispfct.single_image.nodename = "c6cc2a8980090a50"

partner.serialnum = "70000801"

filer.serialnum = "70000802"

ispfct.mode = "single_image"

ispfct.partner.nodename = "c6cc2a8980090a50"

MyFiler2> priv set diag

Warning: These diagnostic commands are for use by NetApp

personnel only.

MyFiler2*> lun config

local.single_image.key = "151702726"

partner.single_image.key = "151702484"

copy_offload.state = "true"

vaw.state = "true"

write_same.state = "true"

iscsi.nodename = "iqn.1992-08.com.netapp:sn.151702726"

local.serialnum = "70000801"

local.single_image.key = "151702726"

partner.single_image.key = "151702484"

ispfct.local.nodename = "c6cc2a8980090a50"

fc-port-0a = "9"

fc-port-0b = "9"

fc-port-0c = "9"

fc-port-0d = "9"

ispfct.nodename = "c6cc2a8980090a50"

ispfct.single_image.nodename = "c6cc2a8980090a50"

partner.serialnum = "70000802"

filer.serialnum = "70000801"

fcp.service = "on"

iscsi.disabled_interface = "e0M:vif1-340:"

ispfct.portname.storage = "0c:4,0d:5,0a:6,0b:7,"

ispfct.config.0a = "up"

ispfct.config.0b = "up"

ispfct.config.0c = "up"

ispfct.config.0d = "up"

ispfct.mode = "single_image"

ispfct.partner.nodename = "c6cc2a8980090a50"

iscsi.service = "on"

Sergio_Santos · ‎2013-06-21

Sorry for the delay guys (it was more difficult to get an outtage window than I thought). I tried the steps in Gatorman's link--this one under: "Resolving duplicate fibre channel WWPNs between nodes in an HA-pair" https://kb.netapp.com/support/index?page=content&id=1013497

And I'm glad to report everything is now working as expected! I did a full reboot of the nodes--not a takeover/giveback--and ran the "priv set diag" and then "lun config" to verify the local.single_image.key and partner.single_image.key were updated properly. My test ESX server is now back on the ALUA module and only 2 of the 4 paths have Active (I/O). Stopping the FCP service properly kills the paths on one node and fails over to the partner.

I'm not sure how duplicate WWPNs were missed by multiple NetApp techs, but I can't say I'm not a little disappointed. The Configuration Checker software also doesn't report this, nor is it caught in the NetApp autosupport cloud. Unless you find that specific KB article, you're in the Twilight Zone!

EDIT -- I forgot to mention the fix will only change the WWPNs on one node (the other's will remain the same). For the changed node's WWPNs you will need to go back and re-zone (or change aliases in my case). If you're quick to look over the 'fcp show adapters' output you might miss the changed WWPNs and think the fix didn't work.

gatorman1369 · ‎2013-06-24

I think the reason this was "missed" is because it falls a bit out of the norm on what they generally look for. In our case, the filers were not set up correctly from the initial setup and the configuration was overlooked from a NetApp SE. Glad to hear that we were able to solve this together. I think Pedro.Rocha's issue is still outstanding... any new news?

ritchi641 · ‎2013-07-18

While you are talking about ALUA and initiator group (igroup). We have lun only on one controler, the other one is CIFS only.

When we go from non-alua to alua, did we have to activate ALUA on each igroup on both controler in case of a failover.