2012-03-12 06:21 AM
I have a NetApp FAS3240 metrocluster in our production and I'm looking to buy another NetApp storage for expanding our total capacity. I'm looking for more information about the exact operational details occuring during a takeover/failover in the metro cluster and what happens on a technical level to plan further regarding our OpenVMS setup, failover and the a new storage. Can anyone help me find some documentation of how HA/failover works? The more information the better. One of the things I want to know is how a controller takes over from the other? Multipathing, vifs? Is it just a multipath failover or are the controllers using virtual interfaces so they can take over the partner controller mac address? Can a controller serve data from the failed partner controller disks directly (it has path to those disks) or will it always be using the local partner mirror?
I've tried searching for informations and I've tried NetApp university, but haven't been able to find anything sufficient to my needs. Hope you can help and thank you very much.
Have a nice day.
2012-03-12 08:00 AM
Usually if we see NetApp sales pitch they say no downtime with Active/Active cluster, but I beg to differ, because though it’s a good solution but not best of breed. Let’s see why it’s not.
Active/Active configuration involves two controllers connected to same disk shelves and both of them keep talking with each other through their NVRAM module connections, so anytime if one system goes down other system takes identity of its partner.
That sounds good doesn’t it, yeah except few glitches when a system goes down unexpected or experience any failure and partner start takeover process it can take more than 90 seconds which may be fine in the case of NAS environment but for FC and iSCSI it’s more then enough time for host to declare the lun as dead and fail your application.
Now 90 seconds was the time surviving node takes, to have identity of partner but if you don’t have RLM card (which gives hardware level assistance to cluster also) it takes additional 30 seconds for surviving node to declare its partner as dead and start takeover process, which goes whopping 120 seconds.
Now let’s see other scenario, NDU software update.
What I understand from NDU is Non Disruptive Update, that means if I am doing a software update I can failover and failback partner nodes and simulate a reboot to put new code in effect without any downtime. But as per NetApp KB-22909 failover and failback can take as much as 180 seconds.So how does 180 seconds of downtime on each controller can be called as non-disruptive?
Now that was for worst-case scenario but what I have seen with my systems so far is they take less than a minute to do a failover and failback. On my V6080 systems 35 seconds and on V3170 filer, 22 seconds rather than 90econds (observed with ping), and both of them are loaded with multiple vifs, CIFS shares, NFS exports, qtree quotas and snapmirror, though your mileage may vary as it depends on system configuration but that’s not bad for NAS only environment.
To prove this few week before we have done some tests on our V3170 systems in order to check VMs of a new project, as they wanted to see how it affects while a system goes offline due to a hardware failure or any other reason and all 300 odd VMs were running fine without any glitches. Even while doing the test we run a script on all 300 odd Linux VMs, which was using DD to write and delete 100MB file every 2 seconds on /root and /tmp, a few of them were modified to write 500MB file and 1MB file.
While all of the VMs were running the script, we have done failover/failback as well hard reboot, which left I/O suspended for whopping 3 minutes but surprisingly none of the VMs had kernel panic, RO file system or stopped writing however during the filer reboot they were pathetically slow or frozen. Now as you must be wondering why none of the VMs crashed as 180 seconds of no response from disk will put any OS to it’s knees, so what was special? Well here the magic, if you look into VM best practices and search NetApp site for VM disk timeout settings change you will find they recommend changing the disk timeout error to 190 seconds so it can survive any kind of controller reboot or failover and call it non disruptive.
So next time if someone says with active/active cluster you don’t have any downtime, don’t forget to ask him how do you handle any system crash or upgrade activity and if you want to deploy VM in your environment over NetApp heads don’t forget to change that parameter otherwise even in 22 seconds of I/O pause will make big impact on your VM environment.
Thanks & Regards,
2012-03-12 08:59 AM
As the post above is a direct copy of Mohit Agarwal's post, I would contact him about his feelings. Here is the link that the above post was lifted from:
Mohit is an active participant on the communities, so I would reach out to him about his post as well.
When VMware and NetApp agree that 180 seconds is allowable, it seems that those better positioned(VMware and NetApp) have the best information.
The 180 second mark for a failover is with the number of volumes approaching the maximum number, and/or an NVRAM log that is heavily loaded and needs to replay an extremely large number of events. I have managed of 700 systems and can count the number of failover issues in the past 4 years on one hand.