Solved: MSCS Cluster Failover problem with NetApp shared storage.

ian_groves · ‎2010-03-25

Hi all,

I've been tearing my hair out with this one for the past week, we are in the process of implementing a 3-node Active\Acitve\Passive Exchange Cluster, the problem i'm having is that during a failover simulation (hard-resetting one of the active nodes) physical disk resources are failing to be brought online by the passive node. I can move all resources between all nodes no problem, if I shutdown an active node gracefully, 9 times out of 10 all resources will failover to the passive without a problem, although occaisionally I have seen physical disk failures these are infrequent. If I shutdown a server abruptly to simulate a power outage disks always fail to come online and can't be brought online by either of the remaining active nodes. They cannot be brought online on either node for approximately 10 minutes, however if the downed node is brought back up, they can be brought online again by any node. It is only Physical Disk resources that fail, all other resources (that do not depend on disk resources to come online obviously) failover fine

I suspect our problems are MPIO related, though my understanding of mpio is fairly limited, our setup is as follows. (each node is identical)

HP Proliant DL380 G5

Server 2003 Enterprise R2 SP2

Exchange 2003 Enterprise SP2

Microsoft iSCSI initiator v2.08

Snapdrive v6.02

Netapp host utilities v5

Snap Manager for Exchange

Each node has 5 NICS

2x iSCSI LAN just for iSCSI traffic, disabled for Cluster use and not teamed, on a dedicated iSCSI GbE switch 192.168.12.0/24

2x Public Network Interfaces HP Teamed on 10.10.0.0/16 enabled for Client Access in the Cluster

1x Heartbeat network isolated from iSCSI and Public (just on a broadcom hub 10Mb/Half duplex)

Backend Storage is a NetApp FAS3140HA dual head filer using Fibre Channel Disks

All LUNS created and managed through Snapdrive. (24 LUNS in total)

2 EVS

MSDTC in its own Cluster Group

Single Quorum drive (Q:)

When the problems happen I get lots of bus reset requests and clusdisk errors in the event log aswell as snapdrive event 310 (Failed to Enumerate LUN) with the error description being that it may still be owned by another node in the applications log. I really am at a loss of ideas here, the failures are consistent across all nodes and the time it takes to be able to online them again is also consistent, which makes me think i'm waiting for some kind of time-out, it seems that when the node fails the node trying to take over the resources can't break the scsi reservations on the disk. As I say this is only in a sudden shutdown scenario, other times the disks can move around all nodes without issue

Would really appreciate any insight on this

Thanks

Ian

davemaynard · ‎2010-03-26

Ian

Had a similar problem here

Configuring a persistent reservation key solved our problem.

Check out documentation in MS iSCSI initiator User Guide

Regards,

Dave

View solution in original post

davemaynard · ‎2010-03-26

Ian

Had a similar problem here

Configuring a persistent reservation key solved our problem.

Check out documentation in MS iSCSI initiator User Guide

Regards,

Dave

ian_groves · ‎2010-03-26

Dave, you are an absolute legend, drinks are on me.

Configured a persistent reservation key for each node, rebooted them all and now failover is working fine.

Thankyou so much!

Ian