2012-06-14 11:24 AM
We recently started testing fibre channel datastores in our environment, having always run out of NFS.
During a recent takeover/giveback for maintenance,the ESX hosts appeared to fully lose connectivity to the FC datastores for just under 1 minute during both the takeover and the giveback.
My question is, is this expected? I would assume given MPIO, etc. that we would see a path change but not a full loss.
We are using ALUA, PSP is currently set to round robin. ESX hosts are 4.1.0. We have 2 separate fabrics, one single port HBA per fabric per host.
Is there something we can modify to maintain connectivity?
2012-06-18 06:21 AM
It is normal/expected that the connectivity to the LUNs is interrupted during failover/giveback. The duration depends on system load, but if it is under one minute that should be ok. In my experience this was somewhere between 20-45 seconds.
The reason is that disks are "owned" by one of the controllers, and that includes all the volumes and LUNs that are stored on them. A short period of time is needed for one controller to release the disks and until the second controller takes over control all data access has to be paused.
You should use Netapp's Virtual Storage Console software to set recommended values on the HBAs of your ESXi servers. Also, it is very important to have identical configuration of igroups on both controllers - same name, settings and members. This could improove/shorten the timeout.
2012-06-18 06:38 AM
Thanks for the response.
After doing a bit of research it was our understanding that this was a function of MPIO on ESX.
"With both active-active and active-passive storage arrays, you can set up your host to use different paths to different LUNs so that your adapters are being used evenly. If a path fails, the surviving paths carry all the traffic. Path failover might take a minute or more, because the SAN might converge with a new topology to try to restore service. This delay is necessary to allow the SAN to stabilize its configuration after topology changes."
Also, my understanding of the takeover event is that (essentially) a vfiler is spun up on the other controller to continue serving data, so I'm not sure that ownership actually changes. Is this documented somewhere?
Most important to us, I think, is what is the state of the guests on the datastores during this time period? Because if they are halted or paused for the duration of the event, this seems unacceptable for a variety of applications.
2012-06-18 07:32 AM
You are right about the ownership, it does not change in terms of physical controller (like disk assign), but the failing controller is moved from one hardware to another (essentialy becoming a virtual netapp controller). If you look at the logs you will see that the disks are being released and the data cannot be accesses until the failing controller is restarted on the surviving controller.
In a scenario where you have a failed path without controller failover, vmware should recover much faster and you should not see an all paths down event in the logs.
the guests access to disk is also delayed, but most operating systems should be able to survive such events. There is a best practices document writen by Netapp and VMWare that has more information. I have many different apps at customers sites and so far did not have any that could not survive a controller failover/giveback.
2012-06-18 10:05 AM
My 2 cents re guest OS timeout values:
A guest OS may bluescreen / hang during fail-over, unless Disk Timeout value is set to 190s:
There is a script available for automating this when a lot of Windows VMs needs to be tweaked:
(unfortunately changing this requires rebooting a VM)
2012-06-20 05:50 AM
Thanks for the responses. We have set the timeout values appropriately and our problem is not really with guest crashes. It is simply that the entire system pauses for one minute.
I dismissed the datastore drops as a function of the filer failover itself (disk reassignment, etc.) because we have SQL servers attached to LUNs on these systems and to my knowledge we never see any drops during takeovers. However, we also don't generally do takeovers during the day and given that SQL may be using the OS timeout values (or using its own timeout values), I'm willing to concede that it is possible that the entire system does hang for a minute.
However, while we don't have a system to test FC attached ESX, we do have a system to test NFS attached datastores on a 3000 series. We performed several takeovers and the worst we saw was a 1 second pause. Sometimes we didn't see anything.
So this leads me to believe the problem is either with the filer and the FC protocol specifically, or with ESX MPIO on FC. With FC, we totally drop the datastores. With NFS, we just see path changes. All we own currently is NetApp so I have no other point of reference.
I am still curious to understand how some critical systems are supposed to stay up during maintenance with this. Or, at the very least, not have an outage...a 1 minute pause on a guest OS is an outage, just the same as if the box went down and came back up in 1 minute, right? Consider that one scheduled maintenance (takeover+giveback) would put you nearly halfway to failing 5 9's...
Maybe it is normal and I'm just overreacting but this seems unacceptable for critical apps. We want to upgrade our block filers to OnTap 8, but now we have to contend with 2 one minute pauses on sensitive systems.
2012-06-21 02:10 AM
With FC, we totally drop the datastores.
Maybe it is normal and I'm just overreacting but this seems unacceptable for critical apps.
No - it is not normal & you are not overreacting: FC datastores should not be dropped during takeover.
2014-01-31 05:31 AM
Do you think it is normal to see 20-45s delays in FC Datastores with takeovers and givebacks?
Because I am experiencing this with FC DS, but not with NFS Datastores.
Moreover, consider an environment with almost 500 VMs. I guess a 190s timeout value for GOS would be more suitable. What do you think (consider 4x FAS3270 controllers)?
2014-02-03 08:32 AM
I'm not sure whether NetApp gives a proper definition of "normal" takeover times - however anything under a minute sounds OK(-ish) to me.
So with disk timeout values inside guest set to 190s, everything should go relatively smoothly if takeover generates 60s blackout.