2011-09-13 10:09 AM
I have an issue where within vsphere ESX hosts are reporting that their attached FC LUNs containing vmware vmfs file systems and virtual machine VMDK's are disconnecting, then reconnecting soon after. This is reporting within the ESX logs as below:
Aug 1 11:24:01 vobd: Aug 01 11:24:01.053: 502404320475us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba1:C0:T3:L1 is down. Affected datastores: "FC_DW_vol01"..
Aug 1 11:24:01 Hostd: [2011-08-01 11:24:01.052 FFBB7B90 info 'ha-eventmgr'] Event 1191 : Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba0:C0:T3:L1 is down. Affected datastores: "FC_DW_vol01".
Aug 1 11:24:01 vobd: Aug 01 11:24:01.053: 502404321292us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba0:C0:T2:L1 is down. Affected datastores: "FC_DW_vol01"..
Aug 1 11:24:01 vobd: Aug 01 11:24:01.054: 502404322123us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba1:C0:T2:L1 is down. Affected datastores: "FC_DW_vol01".
It is interesting to note that although we are seeing these errors on the ESX hosts nothing is reporting on the storage and no VMs have gone down which I would expect if their storage disappeared. The ESX hosts predominately run 4.1 though some are still running 4.0 and this issue has been reported against all hosts.
Thinking the likely cause was the fabric I have a case open with IBM (the switches are IBM SAN32b and SAN16b devices) and they have identified the NetApp controllers as a 'slow draining device' suggesting that a reboot of the controllers might resolve the issue.
Has anyone had a similar issue and if so what was the root cause?
Thanks in advance,
2011-11-29 12:16 PM
We are also facing the same issue with ESXi 4.1. We also got VMware support involved and we found that connectivity was being lost for +- 5 seconds on random paths then coming back. We only see this on 1 controller in our cluster not both. Our guests do not go down during this period either.
I think I am seeing a pattern to this that follows LUN latency. Are you monitoring LUN latency in vSphere and with ops manager? If you are you can cross correlate and look for a pattern that matches you alerts. I think ESXi has some sort of limit for high latency and if it’s hit the LUNs are detected as path down. Also another thing to check is you dedupe schedule. When these errors occur is it during your scheduled dedupe runs?
When disabling our scheduled dedupe runs every night and moved them to the weekend only I have noticed this alert is less frequent.
2011-11-29 02:12 PM
In our scenario, we did move the dedupe schedule around and spread them across. But the disconnects are still happening between 12am - 12:30am each day. Thanks for the pointer on latency, i will check that as well.
2011-11-30 02:14 AM
I hope you have configured the correct HBA timeouts using config_hba script and queue_depth on ESX servers HBA cards using NetApp ESX Host Utilities 5.x.
2011-11-30 01:15 PM
@imunro.hug - It was the Latency and the Filer being extremely busy during the window that caused the resets. I re-scheduled some backups and spread the dedupe schedule that helped the situation in our case. On to buying more spindles now....
Thanks for the help
2011-12-16 02:58 PM
Is anyone who is experiencing these LUN Disconnects also seeing messges like this:
Wed Dec 14 04:22:32 CST [filername: raid.disk.offline:notice]: Marking Disk /aggr8/plex0/rg1/4a.19 Shelf 1 Bay 3 [NETAPP X269_WMARS01TSSX NA01] S/N [WD-WMATV8096873] offline.
Wed Dec 14 04:22:47 CST [filername: raid.disk.online:notice]: Onlining Disk /aggr8/plex0/rg1/4a.19 Shelf 1 Bay 3 [NETAPP X269_WMARS01TSSX NA01] S/N [WD-WMATV8096873].
We are also having disconnects as well as the messages above. This has been identified as burt 525279; we plan to upgrade to DOT 802p4 to fix this bug. Not sure if this will stop the lun disconnects
2011-12-22 02:00 PM
Our ESX LUN disconnect issue is solved. It was a hardware issue. Backend fibre cable into the FC initiator card needed to be re-seated. Took a little too long to diagnose, but at least the problem is gone. Now on to the next one.
2012-02-28 03:21 AM
Did anyone ever figure out the cause of the lost connectivity issue?
garciam99 issue was related to disk disconnects.
In our case we see lost connectivity or redundant path degradation to storage device.
VMware errors corresponds to very high NetApp latency (in thousands milliseconds), high HBA latency and sometimes higher than usual throughput (mb/s) on all datastores. There is no errors in NetApp syslog.
We also have LUNs presented to SQL server, latency is reported on those LUNs as well, but they do not get disconnected.
Datastores connected via FC and FCoE. We have 10 ESX hosts with 20 datastores, most datastores are presented to majority of ESX hosts.
Connectivity issue happens randomly between any ESX host and any datastore. I cannot pinpoint what could root cause of such behavior.
We do have misaligned LUNs due to misaligned Win 2003 and Linux servers, but we are working towards addressing this and so far it did not make a difference at all.
ESX hosts lose connectivity 2-3 times a day.
We do have 2 controllers and only one controller experience such behavior.
ESX 4.1, NetApp v3140, ONTAP 8.0.1, fibre channel and FCoE connectivity, some datastores are as large as 1.8 TB, there are 10-25 guests per datastore.
2012-02-28 03:43 AM
I had a call logged with VMware about this early on when we moved to NetApp. It seems that there is some kind of maximum latency figure that when breached on a path ESX detects that as path down. I have been trying to find a way to tweak this value to a higher value but have not found any. The issue is definitely related to higher than normal workloads occurring on the filer, we guaranteed to see this during our backup window when pushing about 400MB/sec over FC and also when the dedupe scans run (not at same time)
If anyone knows how to remedy this i would be keen to know, getting tired of ack'ing VMware alerts....