It is normal/expected that the connectivity to the LUNs is interrupted during failover/giveback. The duration depends on system load, but if it is under one minute that should be ok. In my experience this was somewhere between 20-45 seconds.
The reason is that disks are "owned" by one of the controllers, and that includes all the volumes and LUNs that are stored on them. A short period of time is needed for one controller to release the disks and until the second controller takes over control all data access has to be paused.
You should use Netapp's Virtual Storage Console software to set recommended values on the HBAs of your ESXi servers. Also, it is very important to have identical configuration of igroups on both controllers - same name, settings and members. This could improove/shorten the timeout.
"With both active-active and active-passive storage arrays, you can set up your host to use different paths to different LUNs so that your adapters are being used evenly. If a path fails, the surviving paths carry all the traffic. Path failover might take a minute or more, because the SAN might converge with a new topology to try to restore service. This delay is necessary to allow the SAN to stabilize its configuration after topology changes."
Also, my understanding of the takeover event is that (essentially) a vfiler is spun up on the other controller to continue serving data, so I'm not sure that ownership actually changes. Is this documented somewhere?
Most important to us, I think, is what is the state of the guests on the datastores during this time period? Because if they are halted or paused for the duration of the event, this seems unacceptable for a variety of applications.
You are right about the ownership, it does not change in terms of physical controller (like disk assign), but the failing controller is moved from one hardware to another (essentialy becoming a virtual netapp controller). If you look at the logs you will see that the disks are being released and the data cannot be accesses until the failing controller is restarted on the surviving controller.
In a scenario where you have a failed path without controller failover, vmware should recover much faster and you should not see an all paths down event in the logs.
the guests access to disk is also delayed, but most operating systems should be able to survive such events. There is a best practices document writen by Netapp and VMWare that has more information. I have many different apps at customers sites and so far did not have any that could not survive a controller failover/giveback.
Thanks for the responses. We have set the timeout values appropriately and our problem is not really with guest crashes. It is simply that the entire system pauses for one minute.
I dismissed the datastore drops as a function of the filer failover itself (disk reassignment, etc.) because we have SQL servers attached to LUNs on these systems and to my knowledge we never see any drops during takeovers. However, we also don't generally do takeovers during the day and given that SQL may be using the OS timeout values (or using its own timeout values), I'm willing to concede that it is possible that the entire system does hang for a minute.
However, while we don't have a system to test FC attached ESX, we dohave a system to test NFS attached datastores on a 3000 series. We performed several takeovers and the worst we saw was a 1 second pause. Sometimes we didn't see anything.
So this leads me to believe the problem is either with the filer and the FC protocol specifically, or with ESX MPIO on FC. With FC, we totally drop the datastores. With NFS, we just see path changes. All we own currently is NetApp so I have no other point of reference.
I am still curious to understand how some critical systems are supposed to stay up during maintenance with this. Or, at the very least, not have an outage...a 1 minute pause on a guest OS is an outage, just the same as if the box went down and came back up in 1 minute, right? Consider that one scheduled maintenance (takeover+giveback) would put you nearly halfway to failing 5 9's...
Maybe it is normal and I'm just overreacting but this seems unacceptable for critical apps. We want to upgrade our block filers to OnTap 8, but now we have to contend with 2 one minute pauses on sensitive systems.