2010-03-25 03:38 PM
I've been tearing my hair out with this one for the past week, we are in the process of implementing a 3-node Active\Acitve\Passive Exchange Cluster, the problem i'm having is that during a failover simulation (hard-resetting one of the active nodes) physical disk resources are failing to be brought online by the passive node. I can move all resources between all nodes no problem, if I shutdown an active node gracefully, 9 times out of 10 all resources will failover to the passive without a problem, although occaisionally I have seen physical disk failures these are infrequent. If I shutdown a server abruptly to simulate a power outage disks always fail to come online and can't be brought online by either of the remaining active nodes. They cannot be brought online on either node for approximately 10 minutes, however if the downed node is brought back up, they can be brought online again by any node. It is only Physical Disk resources that fail, all other resources (that do not depend on disk resources to come online obviously) failover fine
I suspect our problems are MPIO related, though my understanding of mpio is fairly limited, our setup is as follows. (each node is identical)
HP Proliant DL380 G5
Server 2003 Enterprise R2 SP2
Exchange 2003 Enterprise SP2
Microsoft iSCSI initiator v2.08
Netapp host utilities v5
Snap Manager for Exchange
Each node has 5 NICS
2x iSCSI LAN just for iSCSI traffic, disabled for Cluster use and not teamed, on a dedicated iSCSI GbE switch 192.168.12.0/24
2x Public Network Interfaces HP Teamed on 10.10.0.0/16 enabled for Client Access in the Cluster
1x Heartbeat network isolated from iSCSI and Public (just on a broadcom hub 10Mb/Half duplex)
Backend Storage is a NetApp FAS3140HA dual head filer using Fibre Channel Disks
All LUNS created and managed through Snapdrive. (24 LUNS in total)
MSDTC in its own Cluster Group
Single Quorum drive (Q
When the problems happen I get lots of bus reset requests and clusdisk errors in the event log aswell as snapdrive event 310 (Failed to Enumerate LUN) with the error description being that it may still be owned by another node in the applications log. I really am at a loss of ideas here, the failures are consistent across all nodes and the time it takes to be able to online them again is also consistent, which makes me think i'm waiting for some kind of time-out, it seems that when the node fails the node trying to take over the resources can't break the scsi reservations on the disk. As I say this is only in a sudden shutdown scenario, other times the disks can move around all nodes without issue
Would really appreciate any insight on this
Solved! SEE THE SOLUTION
2010-03-26 01:46 AM
Had a similar problem here
Configuring a persistent reservation key solved our problem.
Check out documentation in MS iSCSI initiator User Guide
2010-03-26 04:42 AM
Dave, you are an absolute legend, drinks are on me.
Configured a persistent reservation key for each node, rebooted them all and now failover is working fine.
Thankyou so much!