Solved: Re: Cluster Failover (CF -Takeover and CF -Giveback)

qanderson · ‎2012-04-02

Recently I upgraded my 3140 to 7.3.6 OS . I followed the upgrade advisor, I upgraded both nodes, then went to the 2nd node of my active/active cluster and did a cf -takeover. When I did this the 1st node rebooted. My CIFS shares seemed to stay up and running, however many iSCSI Luns hooked upto database boxes went away, along with a bunch of NFS exports for the virtual enviroment. I was under the impression that the active/active cluster configuration allowed all data services from one node to be run on the other. Am I thinking correctly or no? I still have to reboot the 2nd node and management is not too happy with me that the "Non-Disruptive Upgrade" I promised yanked the guts from out underneath our VM enviroment and a couple DB Clusters!

mcope · ‎2012-04-02

Did you read the Data ONTAP Upgrade Guide before doing the upgrade? There's a section about halfway in that lists all the caveats for Non-Disruptive Upgrades. CIFS and iSCSI are two of the reasons why you would have wanted to do a disruptive upgrade because they are stateful protocols. It sounds like you may also want to review the configuration of your NFS and iSCSI clients to ensure they are configured to tolerate up to 120 second interruptions.

View solution in original post

mcope · ‎2012-04-02

Did you read the Data ONTAP Upgrade Guide before doing the upgrade? There's a section about halfway in that lists all the caveats for Non-Disruptive Upgrades. CIFS and iSCSI are two of the reasons why you would have wanted to do a disruptive upgrade because they are stateful protocols. It sounds like you may also want to review the configuration of your NFS and iSCSI clients to ensure they are configured to tolerate up to 120 second interruptions.

aborzenkov · ‎2012-04-02

“cf takeover” does shut down services on one controller and boots them on another. So you have some period of service unavailability. Clients must be properly configured (timeouts set etc) to be able continue to wok uninterruptedly, but it does mean some pause before IO can continue. Takeover may theoretically take up 2 minutes so you clients must be prepared to wait this time until services are back.

qanderson · ‎2012-04-02

Thanks for pointing me in the right direction guys. Is there documentation on how to configure the clients to accept the 2 min outage? Im digging thru the upgrade guide but havent stumbled on it yet...

aborzenkov · ‎2012-04-02

For NFS mounting with option “hard” ensure client will never time out and will retry connection to server indefinitely. For LUN (FCP/iSCSI) there are Host Utilities for all supported operating systems which either provide tools to automatically setup necessary parameters or describe how to do it. Notice that you may need to reboot hosts so changed values are actually enabled. CIFS client just reconnects; this normally is transparent for simple file browsing (like Explorer) but may be disruptive for something like database on a CIFS share.

In all cases you are responsible for setting up client applications so they do not fail in case of IO pause. In case of hypervisors this translates to – not only hypervisor must have the correct timeouts, but every guest must have correct timeouts as well.

aborzenkov · ‎2012-04-02

And please understand that your problem is not specific to software update. If it fails during NDU, it will also fail in case of controller malfunction. So basically your configuraton is not High Available and it should be your primary concern. Doing "cf takeover"/"cf giveback" is essential step in testing NetApp system configuration for basic HA features before putting it in production.

qanderson · ‎2012-04-03

I have our VM team checking to ensure the Hard option is set for the NFS exports. As far as iSCSI, is this something that is set in SnapDrive?

aborzenkov · ‎2012-04-03

No. SnapDrive is unrelated product and is not related to controller failover. Correct settings (on Windows) are implemented either by Host Utilities or by MPIO DSM package (at least, recent versions; I believe in the past Host Utilities was prerequisite).

qanderson · ‎2012-04-03

Fantastic, thanks for the assistance. So you're thinking that is all iSCSI Luns have the timeout set to 2 mins and the hard option is enforced on the nFS exports that when I do a failover those protocols should be ok?

scottgelb · ‎2012-04-03

Check your partner interfaces and network to confirm that everything can come up on the other node. Even if partner interfaces are correct, sometimes a switch config won't allow the second IP to work or is on another VLAN for example, so even though the takeover works the ports are not usable depending on setup external from NetApp. Aborzenkov made a great point that we test this prior to production on installations to make sure the partner interfaces and protocol works on failover. But with timeouts set the ndu upgrades work great. CIFS is the only protocol that has to reconnect when all is setup to best practice.

DEF__HEAD · ‎2012-06-26

Hi All

Im hoping someone can help me as yesterday i upgraded from 8.02 to 8.02P7 following the non-disruptive procedure guided for me via AutoSupport upgrade advisor and i too had an outage.

My environment is a Netapp FAS3210 HA 3 LUNS on Controller A and 3 LUNS from Controller B connected via FCP to a WIndows Server 2008 R2 file cluster with 3 Node Active / Active / Passive. Communication is HA paired between partners via VIF0 and we thoroughly tested this before going into production, so this being our 1st upgrade i was expecting a non disruptive process.

When i completed step cf takeover on Controller A, services shutdown on B and moved to A but our windows server file cluster Node B also failed its resources and went offline and slowly failed over to passive node C. Once controller B rebooted i completed cf giveback. Same happened again when i upgraded Controller B.

Note: We do not have Windows host utilities installed on our file cluster nodes as we have DSM MultiPath IO 3.5 installed instead. I read on DSM 3.5 new features that WIN Host Utils is no longer required. Is this correct or is this where my problem could be do you think?

http://support.netapp.com/NOW/download/software/mpio_win/3.5/

New Features

Thank you

aborzenkov · ‎2012-06-26

Host Utilities were never necessary for operation, they simply helped to automate host settings.

During setup of Microsoft Cluster some disk IO related parameters may get reset (it was definitely the case in the past). So you have to check all settings documented in Host Utilities manual and correct them if changed. Host Utilities could as be run in Recover mode to reapply these settings; I do not remember if DSM also has an option to reapply settings post-installation.