VMware Solutions Discussions

ESX hosts experiencing Intermittent disconnects from FC LUNs

steven_jenner
27,445 Views

Hi,

I have an issue where within vsphere ESX hosts are reporting that their attached FC LUNs containing vmware vmfs file systems and virtual machine VMDK's are disconnecting, then reconnecting soon after. This is reporting within the ESX logs as below:

Aug 1 11:24:01 vobd: Aug 01 11:24:01.053: 502404320475us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba1:C0:T3:L1 is down. Affected datastores: "FC_DW_vol01"..
Aug 1 11:24:01 Hostd: [2011-08-01 11:24:01.052 FFBB7B90 info 'ha-eventmgr'] Event 1191 : Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba0:C0:T3:L1 is down. Affected datastores: "FC_DW_vol01".
Aug 1 11:24:01 vobd: Aug 01 11:24:01.053: 502404321292us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba0:C0:T2:L1 is down. Affected datastores: "FC_DW_vol01"..
Aug 1 11:24:01 vobd: Aug 01 11:24:01.054: 502404322123us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60a98000486e58776b5a56574f617069. Path vmhba1:C0:T2:L1 is down. Affected datastores: "FC_DW_vol01".

VMWare support have been engaged and are pointing the finger at either the storage or the fabric.

It is interesting to note that although we are seeing these errors on the ESX hosts nothing is reporting on the storage and no VMs have gone down which I would expect if their storage disappeared. The ESX hosts predominately run 4.1 though some are still running 4.0 and this issue has been reported against all hosts.

Thinking the likely cause was the fabric I have a case open with IBM (the switches are IBM SAN32b and SAN16b devices) and they have identified the NetApp controllers as a 'slow draining device' suggesting that a reboot of the controllers might resolve the issue.

Has anyone had a similar issue and if so what was the root cause?

Thanks in advance,

Steve.

16 REPLIES 16
Public