Second controller acts like a teenager, decides it doesn't like running FCP protocol anymore - V-Ser...

mbonneau79 · ‎2012-06-19

I have an HA-pair FAS3140-equivalent (IBM N6040) that has been displaying some odd behavior.

The controller heads are attached to 4 ESXi 4.1 hosts and 1 ESXi 5.0 host through Cisco MDS 9124 switches and Qlogic 2492 HBAs. They are also connected to an IBM Bladecenter chassis with 9 online blades via Brocade switches.

Controller 1 has 7 x 300GB FC shelves connected and runs our production cluster and Oracle databases.

Controller 2 has 1 x 146GB FC and 2x 1TB SATA shelves connected and run boot-from-SAN for the Bladecenter (146GB) and CIFS (1TB SATA). I recently needed a chunk of storage for test VMs, so I built a small LUN for storing ISO files and another for storing a few VMs on the 1TB SATA disks since there was a lot of free space available.

The problem indicator happened to be when the two VMware LUNs on the affected controller would throw latency alerts in VMware on all the connected hosts. They were specifically regarding the two LUNs on the second controller, all the production LUNs on the first controller were fine.

At first I assumed it was because I had SIS running on multiple volumes in that particular aggregate. So I instead scheduled then to run over a 4-hour window of low I/O instead, thinking it would solve the problem. It helped a little bit (the window over which the latency issues occurred got a bit shorter), but not a lot. Then I went through all the test VMs and deleted a few VMware snapshots that were sitting around, hoping that would resolve the issue. Again, fractional improvement.

So I did some more digging and I found this:

netapp2> lun stats -o -i 1 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

Read Write Other QFull Read Write Average Queue Partner Lun

Ops Ops Ops kB kB Latency Length Ops kB

46 20 0 0 2592 2464 15.34 0.00 66 5056 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

78 41 0 0 4616 4810 8.10 0.01 119 9426 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

55 25 0 0 5152 4953 11.90 0.02 80 10105 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

68 25 0 0 5072 4953 10.17 0.01 93 10025 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

38 121 0 0 1416 2678 6.13 0.09 159 4094 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

52 94 0 0 4992 5609 7.23 0.05 146 10601 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

63 26 0 0 4800 4928 11.13 0.01 89 9728 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

75 28 0 0 5392 5185 9.16 0.00 103 10577 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

41 20 0 0 2408 2457 16.31 0.01 61 4865 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

45 39 0 0 2736 2845 12.25 0.02 84 5581 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

---

46 29 0 0 1512 1409 12.84 0.00 75 2921 /vol/SystemCenter_VMFS/SystemCenter_VMFS.lun

Partner OPS are obviously way higher than they should be (should obviously be 0, like the first controller) and those latency figures are not great. I mean, they aren't bad, but LUNs on the first controller are <5ms. So now we need to determine WHY partner ops are where they are. Let's take a look at the initiator group:

netapp2> igroup show -v

ALUA-VMware (FCP):

OS Type: vmware

Member: 21:00:00:24:ff:22:d0:e7 (logged in on: vtic)

Member: 21:00:00:24:ff:22:d0:c3 (logged in on: vtic)

Member: 21:00:00:24:ff:22:d0:e6 (logged in on: vtic)

Member: 21:00:00:24:ff:22:d0:9b (logged in on: vtic)

Member: 21:00:00:24:ff:22:d0:c2 (logged in on: vtic)

Member: 21:00:00:24:ff:22:d0:9a (logged in on: vtic)

Member: 21:00:00:24:ff:22:d0:a3 (logged in on: vtic)

Member: 21:00:00:24:ff:22:d0:a2 (logged in on: vtic)

Member: 21:00:00:24:ff:26:c3:a7 (logged in on: vtic)

Member: 21:00:00:24:ff:26:c3:a6 (logged in on: vtic)

UUID: 98fa8960-ed12-11e0-b946-00a098105dc6

ALUA: Yes

So here we see 10 initiators, which makes sense since the Qlogic 2492 is a dual-port 4Gbps HBA and the hosts need to be able to communicate to the second controller over all ports. ALUA is enabled and the pathing policy is set to Round Robin. But all of the initiators show as being logged in on VTIC, which as I understand it is the cluster interconnect. Why would a host that, in theory at least, is properly configured NOT allow for direct connections - all traffic is going over the interconnect?

I took a peek in the Virtual Storage Console and I see my controllers in there, 1 and 2. Both running the same version of ONTAP (8.0.2P4 7-Mode), status on both is Normal, both have a fair bit of free space, both have VAAI Capable set to Enabled, and both support FC, iSCSI. WAIT A MINUTE, the second controller only has iSCSI under "Supported Protocols"! It does NOT have FC listed.

I compare the output from the license command. They both show as being licensed for FCP. But if the second filer won't accept FC as a protocol, I suppose that would explain why it's not accepting any traffic itself - all traffic is moving across the interconnect. fcp status shows the fcp protocol is not running, which is weird since it was working fine previously.

Now, this system was previously configured as a gateway, front-ending IBM DS4300 storage. After some incompatibility issues where the DS4300 would reboot randomly, we made the decision to connect NetApp native disk shelves to it and chuck the DS4300 in the bin. There was no problem with operating in this capacity, all the way until we upgraded from ONTAP 7.3.2 to 8.0.2P4. After rebooting the controller head post-upgrade, it lost access to all it's disk shelves due to it still being licensed as a gateway and not having the fc-non-array-adapter setting set properly (a list of FC adapter ports required for connecting NetApp-native disk shelves to a gateway-licensed storage system). After setting this value and rebooting, it regained access to it's shelves and things were all good (not certain the first controller has this set at this point, I don't think you can check unless you're in the pre-boot environment).

Now, I need to add more shelves to this cluster, one pair of 2TB SATA shelves to an existing loop (1TB SATA) which is no problem, just break the loop, add the new shelves, watch the messages on the console, ensure proper multi-pathing and reconnect to the original ports to complete the loop. But then I have two other shelves to add to their own loop, which means changing the fc-non-array-adapter setting, which can only be done from the pre-boot menu. Should I try and run fcp start first, before rebooting the controllers one and a time to ensure the functional controller can actually pass it's FC LUNs to the second controller? This probably makes sense from a functional standpoint otherwise I'll need to do a complete shutdown of our environment.

I just have a bad feeling about throwing the switch on this since it's blown up in our faces a few times before. Has anyone else encountered this kind of weirdness with a V-Series?

And why would it just not start FCP from boot like it's supposed too?

isaacs · ‎2012-06-19

Hi,

Have you tried just starting the FCP process? What happens?

The fc-non-array-adapter-list variable only needs to be set for FC attached disk shelves. Newer SAS shelves do not use the FC initiators, and so are not subject to that requirement. Do you know what type of shelves you have? Model numbers?

- Dan Isaacs

mbonneau79 · ‎2012-06-19

Hi Dan,

I have not tried starting the FCP service, I will need to schedule a window to make that kind of change on the storage system unfortunately.

All of our storage is fiber attached, including the new shelves. 0a through 0d are connected to the fiber channel switches. We have two 4-port adapters installed in each controller, connected to disk shelves. 1b,1c,1d,4b,4c,4d are currently connected, leaving me with 1a and 4a on each controller as far as cabling up multi-path cabling for the new loop.

My current change request will be to connect the new disk shelves (two into an existing loop, two into a new loop), enable FCP on the second controller, cf-takeover, ensure the fc-non-array-adapter-list settings are set for all the ports required, rinse and repeat on the second controller, then test LUN access.

I think that's a good way to proceed, but am willing to yield to the experience of others before I blow something up.

Thanks,

Mike

slusnia · ‎2012-06-19

This sounds like a question that you should send to NetApp support. They may provide faster resolution.

isaacs · ‎2012-06-20

You may be well served to do as Steve suggested, and have NGS confirm your plan. That will provide them a handy histroy in the event something goes wrong and you need to call them during the outage. Could greatly reduce the time it takes to help you, helping ensure you don't overshoot your outage window.

Good luck!

mbonneau79 · ‎2012-06-22

Unfortunately it's an IBM unit and the support hasn't exactly been stellar, so that puts me in between a rock and a hard place at this point. I have an outage window for tomorrow so will enable FCP on the second controller and see what happens as a result. I expect no drama but want to ensure system stability in that configuration (with LUNs being accessed through optimal paths) before I think about connecting the new disk shelves since that will definitely require a reboot to add the new port numbers for FC-connected shelves using fc-non-array-adapter-list.

mbonneau79 · ‎2012-06-25

So re-enabling FCP was successful. Initiators with the proper DSM loaded (and VSC) automatically picked up the optimal paths. I still have 3 RHEL servers which are a pain in the butt but whatever, I'll get the MPIO package loaded on those shortly.

Now a ticket off to IBM support regarding the proper order to connect new disk trays to existing fiber ports which aren't specified by the fc-non-array-adapter-list variable.

Second controller acts like a teenager, decides it doesn't like running FCP protocol anymore - V-Series concerns before re-enabling