I'm testing failover on a new Windows Server 2008 R2 Active/Passive failover cluster. Part of my testing involves simulating a loss of SAN storage connectivity. Each of my cluster nodes has 2x FC SAN connections and they are utilizing MPIO with NetApp's DSM 3.5. When I pull both fibre connections on the currently-active node in the cluster, it takes ~2 minutes 15 seconds for Windows to see that the disk I/O path is down and then trigger a failover of the resources. In the System event log I can see that the DSM sees the connectivity loss immediately but Windows (event source "Disk") doesn't see that connectivity loss for a full 2 minutes later. It doesn't seem to matter whether there is active disk I/O going on or not.
I've been able to workaround the issue by messing with the PDORemovePeriod registry parameter (HKLM\SYSTEM\CurrentControlSet\Services\ontapdsm\Parameters\PDORemovePeriod). I set the value to 30 seconds and failover occurred ~33 seconds after disconnecting the storage paths. Per documentation, ONTAP DSM sets this value to 130 seconds, which explains the 2 minute failover time experienced originally. Why is this value set so high and what are the ramifications of my changing this setting to something more reasonable?
This is the maximum time Windows waits for a disk to recover. Which in case of NetApp means - maximum time to perform takeover if one partner fails. If you set it to 30 seconds, you are solely responsible to ensure that takeover will complete in 30 seconds under any conditions. Default value is conservative, but it ensures that CFO will complete for any system (and if not, this is a bug that NetApp must investigate and fix).
I would open case with NetApp support asking them to evaluate your configuration and provide recommendation as to how far this timeout may be decreased. This will keep you supported in case of problems. But I do not hold my breath ...