We have 2 x FAS3140 with Data ONTAP 7.3, on 2 distinct sites, in fabric-attached metrocluster mode.
Each node uses 4 network interfaces for data delivery, joined into LACP VIF.
The only used protocol is NFS. It delivers NFS datastores to vmWare ESX and NFS volumes to some Linux servers.
We need to upgrade Ethernet switches to which these filers are connected.
I consider 2 migration pathways:
1- Perform a takeover / giveback, changing connectivity on the offline node
2- Without takeover, unplug one single network cable from the old switch and plug it to the new switch. Then, reiterate the operation 3 times.
-In case of takeover, what about the duration of the expected outage : I am afraid that the migration of the VIF from the failed node could take up to 180 seconds and if it is the case, I will have an issue with vmWare datastores. And, also, with Linux physical servers, that cannot support such a long NFS timeout.
-If the network cables are unplugged, then replugged, somehow, the link aggregate will be dispatched accross different switches (distinct vendors). I am not sure that LACP across different switches is supported.
Do you have any suggestion to perform the migration in non-disruptive way ? Or any return of experience about outage duration ?
Normally LACP across switches of different vendors is not possible, so takeover/giveback is the only way. It should be transparent if everything is configured correctly. But please note that there is always possibility of misconfiguration when you perform giveback to new switches, so you should really plan maintenance window for such activity.
I am just uncertain about the "45 seconds grace period" for NFS occurring during takeover, that could climb up to 120 seconds, according to several KBs. In this case, all the VMs will be freezed, since the ESX will consider the datastore as offline.
Finally, it seems to be less academic, but also less risky to (disable network interfaces on the switch + unplug network cables from the old switch) and then (replug cables to the new switch + enable interfaces).
Anyway, a maintenance window is scheduled. I'll try to keep you updated about the results, if interested.
NetApp redundancy is based on the assumption that takeover/giveback are transparent. If you have reasons to believe they are not, you have much larger problem and need to address it - it implies that one controller failure cannot be handled properly.
it all depends on the definition of "transparent".
NetApp controllers and the failover/giveback mechanism are designed to work "transparent" in the sense that the client will not have to remount the volume/LUN and that the OS and application continue to run without disruption.
NetApp does not claim 0 second failover and as such - yes - there is a short pause in IO involved. That's what we document in KBs, best practice guides and we even encourage people to install the respective host utilities kits and especially for virtual environments, have the guest OS time-outs adjusted.
This is true for block protocols, NFS and SMBv3 configured with CA shares. CIFS with SMBv1 and SMBv2 experience a disconnect during the failover/giveback, which is caused by the protocols being stateful. Clients configured correctly will just reconnect automatically once the failover occured.
So other than having a 53 second giveback that caused a 88 second IO pause, was there anything that needed to be restarted, remounted, reconfigured? (given the NetApp best practices have been followed)
If not, than I'd consider the giveback "transparent".
Out of cusiosity, what's your FAS model and ONTAP version? There have been improvements in failover/giveback timings in each release. Also cDOT has shorter storage failover times than 7mode. We learn and improve.
Additionally, some of the IO pause may be caused by network configuration as 53 vs. 88 seconds sounds pretty high.
It’s a good idea to configure either port-fast (or the equivalent) on switch ports facing business critical, non-network devices (not just our stuff), or to disable spanning tree on those ports completely, making the port transition from down to forwarding (after the link-up event) nearly instantaneously, instead of up to 45 seconds late