FAS270c direct attach to ESX 3.5 question.

sigma77in · ‎2008-06-23

Hello All

I have a FAS270c directly attached to 2 ESX 3.5 hosts via FCP. I have presented one lun each to each host. Both the hosts see their assigned Lun, but the second host can't write to the lun. I get following error on second host when doing any disk opration "SCSI disk error : host 2 channel 0 id 0 lun 0 return code = be0000"

The cfmode setting is single_image. The language setting on volume is POSIX.

Has anyone come accross this ? how can I fix it ?

Thanks

Raghav Sood

madden · ‎2008-06-24

From the info you've provided it doesn't sound familar.

Do you have more logs/errors from the esx or Data ONTAP sides?

Maybe you can post the output of:

netapp>lun show -m

netapp>fcp show initiator

netapp>igroup show

sigma77in · ‎2008-06-24

The outputs are

NetAppSRM> lun show -m

LUN path Mapped to LUN ID Protocol

/vol/vol1/srm 10.11.32.44 0 FCP

NetAppSRM> fcp show initiator

Initiators connected on adapter 0c:

Portname Group

21:00:00:e0:8b:18:24:e0 10.11.32.44

NetAppSRM> igroup show

10.11.32.44 (FCP) (ostype: vmware):

21:00:00:e0:8b:18:24:e0 (logged in on: 0c)

21:00:00:e0:8b:18:ad:df (not logged in)

> The second HBA is not logged in because it is a direct attach setup.

Also from the vmware this is what is get

# vmkpcidivy -q vmhba_devs

vmhba1:0:0 /dev/sda

> This is the LUN from the NetApp

vmhba2:0:0 /dev/sdb

> This is the internal disk

When I try to create a new partitoin on this disk it fails and the logs say this

Jun 24 08:38:54 srmsrv2 kernel: SCSI disk error : host 2 channel 0 id 0 lun 0 return code = be0000

Jun 24 08:38:54 srmsrv2 kernel: I/O error: dev 08:00, sector 0

# fdisk -l /dev/sda

Disk /dev/sda: 84.8 GB, 84828749824 bytes

255 heads, 63 sectors/track, 10313 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sda doesn't contain a valid partition table

Thanks

Raghav Sood

madden · ‎2008-06-24

Hi,

It doesn't look like your 2nd host/hba (21:00:00:e0:8b:18:ad:df) is online. Even though it is physically connected to the other controller because it is directly attached it should show up as logged in over "vtic", or the virtual interconnect.

Here is an example (mine is a single initiator that is fabric connnected):

netapp-top*> igroup show

viaRPC.10:00:00:00:c9:2f:6c:84.WIN2003 (FCP) (ostype: windows):

10:00:00:00:c9:2f:6c:84 (logged in on: vtic, 0c)

The things you could check are:

1) Cluster is enabled ("cf status" shows enabled)

2) Port on other controller is configured as a target port and online ("fcp config" from the other controller would show).

3) An igroup show from the other controller shows the login

I'm not sure why you even see a LUN from your 2nd host because if that initiator isn't logged in there is no LUN to see. Could be a stale reference in ESX, if you reboot that ESX host you might find that the LUN isn't there anymore.

sigma77in · ‎2008-06-24

Hi

In my setup only 1 HBA from each host is connected to each filer head. The second HBA (21:00:00:e0:8b:18:ad:df) is not cabled at all. The reason it didn't show up as logged in on vtic is because I didn't create a igroup for the host in this controller. I fixed it. But the end result is the same: I can't write to the disk from second host. So the outputs are

NetAppSRM2> igroup show

10.11.32.44 (FCP) (ostype: vmware):

21:00:00:e0:8b:18:24:e0 (logged in on: vtic)

21:00:00:e0:8b:18:ad:df (not logged in)

10.11.32.195 (FCP) (ostype: vmware):

21:00:00:e0:8b:18:f3:df (not logged in)

21:00:00:e0:8b:18:17:df (logged in on: 0c)

NetAppSRM> igroup show

10.11.32.195 (FCP) (ostype: vmware):

21:00:00:e0:8b:18:17:df (logged in on: vtic)

21:00:00:e0:8b:18:f3:df (not logged in)

10.11.32.44 (FCP) (ostype: vmware):

21:00:00:e0:8b:18:24:e0 (logged in on: 0c)

21:00:00:e0:8b:18:ad:df (not logged in)

cf status is enabled.

NetAppSRM2> fcp config

0c: ONLINE <ADAPTER UP> Loop No Fabric

host address 0000da

portname 50:0a:09:81:95:84:49:dd nodename 50:0a:09:80:85:84:49:dd

mediatype auto speed auto

NetAppSRM> fcp config

0c: ONLINE <ADAPTER UP> PTP No Fabric

> Here the FC connect is PTP instead of Loop it might be the reason

host address 0000ef

portname 50:0a:09:81:85:84:49:dd nodename 50:0a:09:80:85:84:49:dd

mediatype auto speed auto

I will try to change the ptp to loop and let you know if it fixes it.

Thanks

Raghav Sood

madden · ‎2008-06-24

Your LUN is also mapped incorrectly.

NetAppSRM> lun show -m

LUN path Mapped to LUN ID Protocol

/vol/vol1/srm 10.11.32.44 0 FCP

And then igroup "10.11.32.44" is mapped to:

NetAppSRM> igroup show

10.11.32.44 (FCP) (ostype: vmware):

21:00:00:e0:8b:18:24:e0 (logged in on: 0c)

21:00:00:e0:8b:18:ad:df (not logged in)

But from your new output I think your initiators are:

21:00:00:e0:8b:18:17:df

21:00:00:e0:8b:18:24:e0

Both of these need to be in the same igroup, or you need to map your LUN to both igroups.

sigma77in · ‎2008-06-24

Hi

I have changed the ptp to loop in HBA bios. It helps in that the system can see the disk and do 'write'. I can partition the disk but not create a filesystem. So somethings are working but not everything.

Then, I have changed the lun mappings and now both the igroups have access to the lun. But the result is the same I can't create a filesystem on it. On the ESX console the error message is

"Cannot find path to device in a good state"

Thanks

Raghav

madden · ‎2008-06-25

Hi,

If your LUN is mapped to both initiators WWPNs, and the NetApp shows the logins (igroup show should show one iniitator on the 0c port, and the other on vtic) then from a netapp perspective things should be okay. Also if you can create a partition but not a filesystem this would indicate something on the host or ESX that is the issue.

Also, I don't know your entire solution design but if you're intending to use both ESX servers as production, with the LUNs hosted off one controller, you probably won't be happy with the performance of a direct connect solution.

In NetApp SAN both controllers are connected via a cluster interconnect. On each controller you can then create LUNs owned by that controller. To enable multipathing and support controller failover the LUN is actually presented out both controllers. Paths to the controller where you created the LUN are "primary" and full performance and paths to the opposite controller are "partner" and slower performance. Normally on the host you would setup path preferences such that "primary" paths are used before any "partner" paths are tried to be used. In the case of controller failover however the "partner" path actually becomes a "primary" path. If you have a FC switch its possible to configure all hosts to use "primary" paths, but because you have no switch you can't share the same LUN as you've intended (with full performance).

So to show the IO path when accessing via the opposite controller:

host -> partner controller -> cluster interconnect -> controller -> disk IO to get data -> controller -> cluster interconnect -> partner controller -> host

Whereas straight IO would go:

host -> controller -> disk IO to get data -> controller -> host

Also, the FAS270 is an older model, and at introduction was the entry-level SAN solution, so the CPU horesepower (that includes driving the cluster interconnect) is not that strong.

Have you thought about using NFS datastores?

sigma77in · ‎2008-06-25

Thanks for detailed answer. This is a test setup and if testing goes through the implementation will be done on higher end NetApps but I have to use FC for testing. I used FAS270c because it supports FC and it was available for testing :-).

Also, I am not sharing a single lun between 2 directly connected ESX hosts. I have created 2 LUNs, one each on each controller. Basically created 2 aggregates, one each on each controller, then created one volume per aggregate and then 1 LUN each and presented LUN to respectively direct attached ESX host. So that ESX would see only its own LUN. My whole intend in doing so was to bypass the vtic interconnect, since each controller will present it's own single lun to a host.

First_ESX_Host

>Controller A

> LUN X

> Vol X-----> Aggregate X

Second_ESX_Host

> Controller B

> LUN Y

> Vol Y-----> Aggregate Y

So I was able to present the LUNs, scan the respective lun on hosts. I was also able to create filesystem on first host, and I am using that lun for IO. The problem is on the second host. Yesterday I thought it was a problem with the HBA settings on the second host. I changed the setting to loop from point to point, and thought it will fix the issue. But it didn't.

ESX has inbuilt multipathing. Also I have specified fixed paths to the LUNs on the ESX hosts. Works on one host but not on other.

My question now is why would the second host go thru the cluster interconnect to access the lun at all. I think it should not. But from what I am seeing it looks like it is :-(.

Thanks

Raghav Sood

madden · ‎2008-06-25

I had thought you were creating an ESX datacenter and wanted to make the same LUN available on both hosts. If you have two LUNs, one for each host, then you should only map each LUN to one igroup, and have that one igroup include just the WWPN of the host.

So with the picture:

First_ESX_Host_WWPN_1

>Controller A

> LUN X

> Vol X-----> Aggregate X

Second_ESX_Host_WWPN_1

> Controller B

> LUN Y

> Vol Y-----> Aggregate Y

You do these steps for Controller A/First_ESX_Host (and repeat with Controller B/Second_ESX_Host) :

<div># On Controller A create an igroup that includes only First_ESX_Host_WWPN_1. ("igroup create xxx")</div>

On Controller A create a LUN. ("lun create xxx")
On Controller A map the LUN. ("lun map LUN igroup")

So after these steps you have a 1:1:1 relationship between Host:LUN:Controller.

If you're still having problems I think you need to look to VMWare for advice, because at that point it's not related to NetApp...

Good luck!

sigma77in · ‎2008-06-26

I think it might be a HBA issue .... I will change the HBAs today and see if that helps. Thanks for your help.

triantos · ‎2008-06-26

I believe you'll be wasting your time with the HBA. Most likely not the issue. You will need to use the /sbin/partedUtil command for the partitioning.

partedUtil set <device> "partNum startSector endSector type attr"

If you have a partition alignment of 128, you calculate endSector = <Value of stored blocks> + 127

endSector = 4294961557 + 127 = 4294961684.

Type = 251 is a decimal representation of 'fb' partition id.

So therefore the command to set the partition table right was:

#partedUtil set <device> "1 128 4294961684 251 0"

sigma77in · ‎2008-06-26

Thanks but I have already tried it out. I started with fdisk and then after searching web I came across partedUtil. But it will only create a partition not the filesystem. My problem starts when I try to create a filesystem. And then the path to the disk mysteriously dies/lost.

But the other host connected to the other controller doesn't has no problem at all.

triantos · ‎2008-06-26

Configure iSCSI map the LUN to the ESX and retry. If you have the same issue there, then you know this is not FC HBA or protocol specific.

sigma77in · ‎2008-07-01

This is fixed now. It was a bad fc cable that was causing it. I never thought this could have been a issue but turns out swapping the fc cable for a new one fixes the 'problem'.

Thank-you madden & triantos for replying to my question.

-Raghav