2012-04-02 07:03 AM - last edited on 2015-06-08 11:56 AM by alissa
Recently I upgraded my 3140 to 7.3.6 OS . I followed the upgrade advisor, I upgraded both nodes, then went to the 2nd node of my active/active cluster and did a cf -takeover. When I did this the 1st node rebooted. My CIFS shares seemed to stay up and running, however many iSCSI Luns hooked upto database boxes went away, along with a bunch of NFS exports for the virtual enviroment. I was under the impression that the active/active cluster configuration allowed all data services from one node to be run on the other. Am I thinking correctly or no? I still have to reboot the 2nd node and management is not too happy with me that the "Non-Disruptive Upgrade" I promised yanked the guts from out underneath our VM enviroment and a couple DB Clusters!
Solved! SEE THE SOLUTION
2012-04-02 07:28 AM
Did you read the Data ONTAP Upgrade Guide before doing the upgrade? There's a section about halfway in that lists all the caveats for Non-Disruptive Upgrades. CIFS and iSCSI are two of the reasons why you would have wanted to do a disruptive upgrade because they are stateful protocols. It sounds like you may also want to review the configuration of your NFS and iSCSI clients to ensure they are configured to tolerate up to 120 second interruptions.
2012-04-02 07:30 AM
“cf takeover” does shut down services on one controller and boots them on another. So you have some period of service unavailability. Clients must be properly configured (timeouts set etc) to be able continue to wok uninterruptedly, but it does mean some pause before IO can continue. Takeover may theoretically take up 2 minutes so you clients must be prepared to wait this time until services are back.
2012-04-02 07:48 AM
Thanks for pointing me in the right direction guys. Is there documentation on how to configure the clients to accept the 2 min outage? Im digging thru the upgrade guide but havent stumbled on it yet...
2012-04-02 08:03 AM
For NFS mounting with option “hard” ensure client will never time out and will retry connection to server indefinitely. For LUN (FCP/iSCSI) there are Host Utilities for all supported operating systems which either provide tools to automatically setup necessary parameters or describe how to do it. Notice that you may need to reboot hosts so changed values are actually enabled. CIFS client just reconnects; this normally is transparent for simple file browsing (like Explorer) but may be disruptive for something like database on a CIFS share.
In all cases you are responsible for setting up client applications so they do not fail in case of IO pause. In case of hypervisors this translates to – not only hypervisor must have the correct timeouts, but every guest must have correct timeouts as well.
2012-04-02 11:57 AM
And please understand that your problem is not specific to software update. If it fails during NDU, it will also fail in case of controller malfunction. So basically your configuraton is not High Available and it should be your primary concern. Doing "cf takeover"/"cf giveback" is essential step in testing NetApp system configuration for basic HA features before putting it in production.
2012-04-03 06:08 AM
No. SnapDrive is unrelated product and is not related to controller failover. Correct settings (on Windows) are implemented either by Host Utilities or by MPIO DSM package (at least, recent versions; I believe in the past Host Utilities was prerequisite).
2012-04-03 07:08 AM
Fantastic, thanks for the assistance. So you're thinking that is all iSCSI Luns have the timeout set to 2 mins and the hard option is enforced on the nFS exports that when I do a failover those protocols should be ok?
2012-04-03 08:03 AM
Check your partner interfaces and network to confirm that everything can come up on the other node. Even if partner interfaces are correct, sometimes a switch config won't allow the second IP to work or is on another VLAN for example, so even though the takeover works the ports are not usable depending on setup external from NetApp. Aborzenkov made a great point that we test this prior to production on installations to make sure the partner interfaces and protocol works on failover. But with timeouts set the ndu upgrades work great. CIFS is the only protocol that has to reconnect when all is setup to best practice.