What are Some Real World Cluster Failover Times?

erick_moore · ‎2009-06-09

We are getting ready to upgrade the OnTap version on a FAS2050 and a FAS3040. I am looking for any information anyone has on what type of timeout we will see during takeover/giveback operations. To be honest with you one of the reasons we have waited to upgrade was due to the fear that our Windows systems, VM's, and applications would not be able to handle an IO timeout upwards of 10-15 seconds. We are 100% Fiber Channel in our environment. Can anyone shed some light on this? How have your systems performed during takeover/giveback in FCP environments?

Thanks,

E-

amiller_1 · ‎2009-06-09

Somewhere in the 10-15 second range is what I've seen on 3040's...probably 15 seconds (or a bit more) for 2050's.

However, if you're in a fully FC environment, you should be completely fine as long as you have the NetApp Host Utilities installed on all the machines in question (will fully take care of timeout/retry values).

What machines/OSes do you have in your environment?

erick_moore · ‎2009-06-09

Most of our machines are Windows VM's so there aren't any host utilites installed, but I did write a Powershell script that set the scsi timeout value in the registry on those systems to 180 seconds. The script is posted somewhere on this site if interested. We have an outage window scheduled so we are actually shutting down our direct attached systems (but they do have the host utilities installed). All of our ESX hosts have the NetApp utilities installed. We haven't really seen how ESX reacts if it can't get to storage for 15+ seconds, but needless to say we are a bit worried. Do you happen to know the failover time on a 3170? I have heard it is much better and we are thinking of upgrading.

erick_moore · ‎2009-06-11

I forgot to mention that one of the things we will be doing is shelf and disk firmware upgrades on the 2050. That system only has SATA disks, and until we get on 7.3.1 I was told that this process IS disruptive to IO. NetApp gave us a worst case scenario of 60 seconds, but was told that the actual time would be lower. Again, does anyone have any real world times for this type of upgrade? Cann I expect Windows or ESX to disconnect my LUNS?

pascalduk · ‎2009-06-15

In the past upgrading AT-FCX modules on ontap 7.0 the disruption was 70 seconds, but I have been told that the time is less on 7.2.

Cluster failover/giveback times on our FAS6080 and FAS3050 clusters running ontap 7.2 with a moderate load of nfs/cifs/iscsi takes about 30 seconds. The lower the load and less flex vols, the faster the failover/giveback. In a test environment without any load the failover of a fas6080 cluster took 15 seconds.

You should be ok when the timeouts have been set correctly on your hosts (virtual and physical).

erick_moore · ‎2009-06-19

So just to give everyone an update the upgrades worked as planned. The 2050 that had nothing but SATA shelves, actually performed the shelf firmware upgrade without interrupting any of our virtual machines. I will chalk that up to the script I wrote to set the Windows guest timeouts to 190 seconds, and the NetApp host utilities installed and configured on each of our ESX hosts.

Additionally I want to make a point of clarification on the failover times. I came from an EMC background, and when I heard I would be looking at a 15-60 second failover time I was terrified. The truth of the matter is that the entire NetApp may take 15-60 seconds to failover everything your applications are not down for that entire period. We saw no interruption (just doing pings) until about 15 seconds into the failover, and at that point we saw 1 dropped ping, and then all the HBA's switching to the other controller. There was still about another 10 seconds after that before the NetApp reported that the takeover was complete. So while the entire time to failover every running service on the NetApp may take a while, the individual protocols and services appear to failover much, much quicker than we have been told.

Thanks,

E-

mgrau · ‎2009-06-22

With ONTAP 7.3 failovers and givebacks will even be faster.

If you set the correct HBA timeout values on the ESX side and on the VMs (see ESX Host Utilities). There is absolutely no problem with ESX.

Cheers,

Markus

cannavaro · ‎2011-08-04

Would reckon 15 seconds would be the ceiling of any failovers that I have seen in the past on the filers that I have had exposure to over the past 5/6 years.

Had anyone expienced any issues with FCP connected VMWare failovers out of interest, I guess as long as the failover occurs within the timeout period then there "shouldn't" be...

cannavaro · ‎2011-10-21

Just to update we never experienced any issues with timeouts from the ESX 4.0 hosts while failing the filer heads over.