Solved: Expected outage on F-A Metrocluster during network migration

sta · ‎2015-09-10

Hi all,

We have 2 x FAS3140 with Data ONTAP 7.3, on 2 distinct sites, in fabric-attached metrocluster mode.

Each node uses 4 network interfaces for data delivery, joined into LACP VIF.

The only used protocol is NFS. It delivers NFS datastores to vmWare ESX and NFS volumes to some Linux servers.

We need to upgrade Ethernet switches to which these filers are connected.

I consider 2 migration pathways:

1- Perform a takeover / giveback, changing connectivity on the offline node

2- Without takeover, unplug one single network cable from the old switch and plug it to the new switch. Then, reiterate the operation 3 times.

My questions:

-In case of takeover, what about the duration of the expected outage : I am afraid that the migration of the VIF from the failed node could take up to 180 seconds and if it is the case, I will have an issue with vmWare datastores. And, also, with Linux physical servers, that cannot support such a long NFS timeout.

-If the network cables are unplugged, then replugged, somehow, the link aggregate will be dispatched accross different switches (distinct vendors). I am not sure that LACP across different switches is supported.

Do you have any suggestion to perform the migration in non-disruptive way ? Or any return of experience about outage duration ?

sta · ‎2015-09-15

Actually, the affirmation about transparency is totally false:

-Netapp KB recognizes that a takeover cuts all CIFS connections

-Netapp support recongnizes that 45-seconds unavailability is usually noticed.

We have justed experienced a forced takeover. After solving the problem, we made a giveback. The timings are the following:

-Duration of the giveback : 53 seconds

-Duration of the unavailability of network interfaces : 88 seconds.

View solution in original post

aborzenkov · ‎2015-09-10

Normally LACP across switches of different vendors is not possible, so takeover/giveback is the only way. It should be transparent if everything is configured correctly. But please note that there is always possibility of misconfiguration when you perform giveback to new switches, so you should really plan maintenance window for such activity.

sta · ‎2015-09-11

I am just uncertain about the "45 seconds grace period" for NFS occurring during takeover, that could climb up to 120 seconds, according to several KBs. In this case, all the VMs will be freezed, since the ESX will consider the datastore as offline.

Finally, it seems to be less academic, but also less risky to (disable network interfaces on the switch + unplug network cables from the old switch) and then (replug cables to the new switch + enable interfaces).

Anyway, a maintenance window is scheduled. I'll try to keep you updated about the results, if interested.

aborzenkov · ‎2015-09-11

NetApp redundancy is based on the assumption that takeover/giveback are transparent. If you have reasons to believe they are not, you have much larger problem and need to address it - it implies that one controller failure cannot be handled properly.

sta · ‎2015-09-15

Actually, the affirmation about transparency is totally false:

-Netapp KB recognizes that a takeover cuts all CIFS connections

-Netapp support recongnizes that 45-seconds unavailability is usually noticed.

We have justed experienced a forced takeover. After solving the problem, we made a giveback. The timings are the following:

-Duration of the giveback : 53 seconds

-Duration of the unavailability of network interfaces : 88 seconds.

niels · ‎2015-09-16

Hi sta,

it all depends on the definition of "transparent".

NetApp controllers and the failover/giveback mechanism are designed to work "transparent" in the sense that the client will not have to remount the volume/LUN and that the OS and application continue to run without disruption.

NetApp does not claim 0 second failover and as such - yes - there is a short pause in IO involved. That's what we document in KBs, best practice guides and we even encourage people to install the respective host utilities kits and especially for virtual environments, have the guest OS time-outs adjusted.

This is true for block protocols, NFS and SMBv3 configured with CA shares. CIFS with SMBv1 and SMBv2 experience a disconnect during the failover/giveback, which is caused by the protocols being stateful. Clients configured correctly will just reconnect automatically once the failover occured.

So other than having a 53 second giveback that caused a 88 second IO pause, was there anything that needed to be restarted, remounted, reconfigured? (given the NetApp best practices have been followed)

If not, than I'd consider the giveback "transparent".

Out of cusiosity, what's your FAS model and ONTAP version? There have been improvements in failover/giveback timings in each release. Also cDOT has shorter storage failover times than 7mode. We learn and improve.

Additionally, some of the IO pause may be caused by network configuration as 53 vs. 88 seconds sounds pretty high.

It’s a good idea to configure either port-fast (or the equivalent) on switch ports facing business critical, non-network devices (not just our stuff), or to disable spanning tree on those ports completely, making the port transition from down to forwarding (after the link-up event) nearly instantaneously, instead of up to 45 seconds late

Kind regards, Niels

sta · ‎2015-09-28

Hi Niels,

Thank you for such a complete answer.

OK, if we consider 88 seconds I/O pause as the expected behaviour, the takeover / giveback run as expected.

After cross-checking with Network guys, the spanning tree is enabled on the concerned ports, but in "portfast" mode. Anyway, that is not really an issue for me.

For this particular case, nothing had to be reconfigured, since the OSs of servers using NFS shares were able to handle a 88-second pause.

The only reason to worry concerns maybe vmWare datastores. Even with timeouts configured following NetApp recommendations (and it has been done), 88 seconds are dangerously close to vmWare limits:

-t0: loss of the connectivity to the NFS datastore

-t1 = t0 + 120s : NFS connection expires and VMs receive "I/O failure" messages

-t2 = t0 + 140s : The Datastore expires and has to be reconnected

-t3 = t0 + 320s : The VM becomes inconsistent.

aborzenkov · ‎2015-09-16

-Duration of the giveback : 53 seconds
-Duration of the unavailability of network interfaces : 88 seconds.

This is indication that switch ports, to which NetApp is connected, have spanning tree enabled.