ESX hosts experiencing Intermittent disconnects from FC LUNs

steven_jenner · ‎2011-09-13

Hi,

I have an issue where within vsphere ESX hosts are reporting that their attached FC LUNs containing vmware vmfs file systems and virtual machine VMDK's are disconnecting, then reconnecting soon after. This is reporting within the ESX logs as below:

Aug 1 11:24:01 vobd: Aug 01 11:24:01.053: 502404320475us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba1:C0:T3:L1 is down. Affected datastores: "FC_DW_vol01"..
Aug 1 11:24:01 Hostd: [2011-08-01 11:24:01.052 FFBB7B90 info 'ha-eventmgr'] Event 1191 : Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba0:C0:T3:L1 is down. Affected datastores: "FC_DW_vol01".
Aug 1 11:24:01 vobd: Aug 01 11:24:01.053: 502404321292us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba0:C0:T2:L1 is down. Affected datastores: "FC_DW_vol01"..
Aug 1 11:24:01 vobd: Aug 01 11:24:01.054: 502404322123us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba1:C0:T2:L1 is down. Affected datastores: "FC_DW_vol01".

VMWare support have been engaged and are pointing the finger at either the storage or the fabric.

It is interesting to note that although we are seeing these errors on the ESX hosts nothing is reporting on the storage and no VMs have gone down which I would expect if their storage disappeared. The ESX hosts predominately run 4.1 though some are still running 4.0 and this issue has been reported against all hosts.

Thinking the likely cause was the fabric I have a case open with IBM (the switches are IBM SAN32b and SAN16b devices) and they have identified the NetApp controllers as a 'slow draining device' suggesting that a reboot of the controllers might resolve the issue.

Has anyone had a similar issue and if so what was the root cause?

Thanks in advance,

Steve.

grajendr · ‎2011-11-29

We are facing the exact same issue on our ESXi 4.1 environment...Were you able to find any resolution to this issue?

Thanks

lmunro_hug · ‎2011-11-29

Steve,

We are also facing the same issue with ESXi 4.1. We also got VMware support involved and we found that connectivity was being lost for +- 5 seconds on random paths then coming back. We only see this on 1 controller in our cluster not both. Our guests do not go down during this period either.

I think I am seeing a pattern to this that follows LUN latency. Are you monitoring LUN latency in vSphere and with ops manager? If you are you can cross correlate and look for a pattern that matches you alerts. I think ESXi has some sort of limit for high latency and if it’s hit the LUNs are detected as path down. Also another thing to check is you dedupe schedule. When these errors occur is it during your scheduled dedupe runs?

When disabling our scheduled dedupe runs every night and moved them to the weekend only I have noticed this alert is less frequent.

Luke

grajendr · ‎2011-11-29

In our scenario, we did move the dedupe schedule around and spread them across. But the disconnects are still happening between 12am - 12:30am each day. Thanks for the pointer on latency, i will check that as well.

grajendr · ‎2011-11-30

@imunro.hug - It was the Latency and the Filer being extremely busy during the window that caused the resets. I re-scheduled some backups and spread the dedupe schedule that helped the situation in our case. On to buying more spindles now....

Thanks for the help

kishorchandra · ‎2011-11-30

Steve,

I hope you have configured the correct HBA timeouts using config_hba script and queue_depth on ESX servers HBA cards using NetApp ESX Host Utilities 5.x.

Kishorchandra

garciam99 · ‎2011-12-16

Is anyone who is experiencing these LUN Disconnects also seeing messges like this:

Wed Dec 14 04:22:32 CST [filername: raid.disk.offline:notice]: Marking Disk /aggr8/plex0/rg1/4a.19 Shelf 1 Bay 3 [NETAPP X269_WMARS01TSSX NA01] S/N [WD-WMATV8096873] offline.

Wed Dec 14 04:22:47 CST [filername: raid.disk.online:notice]: Onlining Disk /aggr8/plex0/rg1/4a.19 Shelf 1 Bay 3 [NETAPP X269_WMARS01TSSX NA01] S/N [WD-WMATV8096873].

We are also having disconnects as well as the messages above. This has been identified as burt 525279; we plan to upgrade to DOT 802p4 to fix this bug. Not sure if this will stop the lun disconnects

garciam99 · ‎2011-12-22

Our ESX LUN disconnect issue is solved. It was a hardware issue. Backend fibre cable into the FC initiator card needed to be re-seated. Took a little too long to diagnose, but at least the problem is gone. Now on to the next one.

DMITRYLYEYPI · ‎2012-02-28

Did anyone ever figure out the cause of the lost connectivity issue?

garciam99 issue was related to disk disconnects.

In our case we see lost connectivity or redundant path degradation to storage device.

VMware errors corresponds to very high NetApp latency (in thousands milliseconds), high HBA latency and sometimes higher than usual throughput (mb/s) on all datastores. There is no errors in NetApp syslog.

We also have LUNs presented to SQL server, latency is reported on those LUNs as well, but they do not get disconnected.

Datastores connected via FC and FCoE. We have 10 ESX hosts with 20 datastores, most datastores are presented to majority of ESX hosts.

Connectivity issue happens randomly between any ESX host and any datastore. I cannot pinpoint what could root cause of such behavior.

We do have misaligned LUNs due to misaligned Win 2003 and Linux servers, but we are working towards addressing this and so far it did not make a difference at all.

ESX hosts lose connectivity 2-3 times a day.

We do have 2 controllers and only one controller experience such behavior.

ESX 4.1, NetApp v3140, ONTAP 8.0.1, fibre channel and FCoE connectivity, some datastores are as large as 1.8 TB, there are 10-25 guests per datastore.

lmunro_hug · ‎2012-02-28

Dmitry,

I had a call logged with VMware about this early on when we moved to NetApp. It seems that there is some kind of maximum latency figure that when breached on a path ESX detects that as path down. I have been trying to find a way to tweak this value to a higher value but have not found any. The issue is definitely related to higher than normal workloads occurring on the filer, we guaranteed to see this during our backup window when pushing about 400MB/sec over FC and also when the dedupe scans run (not at same time)

If anyone knows how to remedy this i would be keen to know, getting tired of ack'ing VMware alerts....

Luke

keitha · ‎2012-02-28

The misalignment will certainly not be helping this situation. Aligning a single VM will not effect its performance, heck aligning a dozen VMs may not effect any of their performance however what you are doing is reducing the load on the controller and reducing how many partial blocks the controller is having to deal with. Under normal load the storage controller can handle the partial blocks without any change to the performance however it has to do some fancy footwork to make this happen. (Duck on water, calm on surface but paddling like crazy underneath). However if you suddenly throw a bunch of partial writes at the controller or ask it to do something else while handling those partial writes (like Dedupe) then you could have a period of very high latency.

I can't say for sure that this is what is happening but is could be. I would certainly try to address the misaligned VMs and see if that eliminates or reduces the latency spikes. VSC 4 beta allows you to build Optimized LUNs for the misaligned VMs and quickly eliminated that as a possibility.

That is where I would start.

Keith

FULCRUM1972 · ‎2012-03-14

But we have made significant changes to our misalignment -- we even did the infamous "shimmed NetApp LUN" to provide temporary relief to any misalignment woes. Not sure how that would fit in your "truck with flat tires" metafore from your blog post a while back. Did it solve all our problems? No, but it did cut down the noise and help us focus our investigation in other areas.

Oh, and I've had a ticket open with VMware tech support now for about a week -- they have been less than helpful. As soon as Dmitry or I find the root cause (more likely, root causes) of it, I'm sure we'll share.

DMITRYLYEYPI · ‎2012-03-15

We noticed that our latency spikes correlated with slight throughput (mb/s) increase on all datastores. Using SCOM veeam package we were able to narrow down to VMs that was pushing more IO during these times, not significantly more, but it was hundreds VMs do.

Further investigation revealed that we have Symantec Endpoint Protection pushing anti-virus updates to those VMs at the same time.

After randomizing anti-virus updates schedule, things improved significantly and we even did not have latency spikes for a few days.

But it did happen again and this time there was no additional IO throughput coming through datastores.

So now we are having a few spikes per week versus a couple daily.

Next step is to upgrade ONTAP to 8.0.2 from 8.0.1

Will update post if it makes any difference.

keitha · ‎2012-03-15

Hi Guys,

Interesting stuff. Has any work been done on the fabric side of things? Do we know if all the NetApp LUNS are connected on the right path? It seems odd that the ESX servers are having this problem but not the application VMs on the ESX servers. If it was something at the NetApp level you would expect everything connected to it have problems. I wonder if there is perhaps some sort of path thrash error or partner path problem? Just a wild guess.

Dmitry did you mean DOT 8.1?

Keith

DMITRYLYEYPI · ‎2012-03-15

Hi Keith,

We did check physical connections to the best of our knowledge.

We do have occasionally (once weekly, not particular pattern) "HA Group Notification from NETAPP_CONTROLLER (FCP PARTNER PATH MISCONFIGURED) ERROR", but it does not look like cause any issues or corresponds with our controller latency spikes.

Yes we are planning to upgrade to 8.1RC3 from 8.0.1

Dmitry.

raetof · ‎2012-04-02

Hi Dmitry

Your Path misconfiguration is due to wrong FC acces to your LUN. This occurs if you had a failover and the fc path is still going over the failover conteoller. If youre looking to the preferred path of your ESX HBA you will see that he connects to the WWPN of the Netapp Cluster Partner. This probably also has some impacts to your latency. Because your access to the LUN is always going over the "wrong controller" then over the cluster interconnect to the "LUN attached controller".

If your setting the prefered path to an WWPN of your physical controller where the LUN resides on, it could help you.

Rgds

Reto

SarathPV · ‎2014-11-12

Hi,

Can i know How many storage paths are configured for this ESXi