Subscribe

production outage, multimode dynamic vif up but no connectivity

Hi,

We have a FAS3070 cluster with dual 6x1gbit LACP trunks, one to each
core.  One of our cores lost a line card and as a result packet
forwarding was impacted[1].  The filer never tried to use its backup
multimode vif because the switch was still successfully negotiating
LACP.
Is there an option to monitor for packet flow or a simple arp(similar to
arp_interval / arp_ip_target functions in the linux bonding driver) in
OnTap?  This is a failure mode which appears to have potential to affect
other customers.  It's not out of the question to have a host with a
LACP link negotiated but zero packet flow or inability to arp anything
else as we just experienced it.
I know the cf.takeover.on_network_interface_failure will cause a
takeover but that wouldn't have been triggered in our case because the
interface was up and clean.
Any thoughts or recommendations to make our environment more resilient
to failure modes such as this are appreciated.
Cheers,
Brian
[1] The LACP link established properly I can confirm this from the
lacp_log.  We even tried rebooting the filer that lost network
connectivity but it came up fine and again successfully negotiated a
data layer LACP connection and brought up its 802.1q tagged interfaces.
The failure mode in the switch was a layer 3 issue which essentially
blackholed all traffic in and out.

Re: production outage, multimode dynamic vif up but no connectivity

Here are the relevant network configurations that have been sanitized for netapp1 and netapp2, the two cluster members.

hostname netapp1
vif create lacp -b ip Multi1 e0a e0c e1a e1c e4a e4c
vif create lacp -b ip Multi2 e0b e0d e1b e1d e4b e4d
vif create single Single1 Multi1 Multi2
vif favor Multi1
vlan create Single1 101 101 102 103 104 105 106 107 108 110
ifconfig Single1-101 `hostname`-Single1-101 netmask 255.255.248.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-101
ifconfig Single1-102 `hostname`-Single1-102 netmask 255.255.254.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-102
ifconfig Single1-103 `hostname`-Single1-103 netmask 255.255.252.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-103
ifconfig Single1-104 `hostname`-Single1-104 netmask 255.255.252.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-104
ifconfig Single1-105 `hostname`-Single1-105 netmask 255.255.255.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-105
ifconfig Single1-106 `hostname`-Single1-106 netmask 255.255.255.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-106
ifconfig Single1-107 `hostname`-Single1-107 netmask 255.255.255.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-107
ifconfig Single1-108 `hostname`-Single1-108 netmask 255.255.224.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-108
ifconfig Single1-109 `hostname`-Single1-109 netmask 255.255.252.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-109
ifconfig Single1-110 `hostname`-Single1-110 netmask 255.255.252.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp2-Single2-110
route add default 10.10.10.10

###

hostname netapp2

vif create lacp -b ip Multi3 e0a e0c e1a e1c e4a e4c

vif create lacp -b ip Multi4 e0b e0d e1b e1d e4b e4d

vif create single Single2 Multi3 Multi4

vif favor Multi4

vlan create Single2 101 102 103 104 105 106 107 108 109 110

ifconfig Single2-101 `hostname`-Single2-101 netmask 255.255.248.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-101

ifconfig Single2-102 `hostname`-Single2-102 netmask 255.255.254.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-102

ifconfig Single2-103 `hostname`-Single2-103 netmask 255.255.252.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-103

ifconfig Single2-104 `hostname`-Single2-104 netmask 255.255.252.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-104

ifconfig Single2-105 `hostname`-Single2-105 netmask 255.255.255.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-105

ifconfig Single2-106 `hostname`-Single2-106 netmask 255.255.255.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-106

ifconfig Single2-107 `hostname`-Single2-107 netmask 255.255.255.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-107

ifconfig Single2-108 `hostname`-Single2-108 netmask 255.255.224.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-108

ifconfig Single2-109 `hostname`-Single2-109 netmask 255.255.252.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-109

ifconfig Single2-110 `hostname`-Single2-110 netmask 255.255.252.0 broadcast x.x.x.x mtusize 9000 -wins partner netapp1-Single1-110

route add default 10.10.10.10

Re: production outage, multimode dynamic vif up but no connectivity

Hey All,

NetApp support opened RFE 488056 for this issue.  It is worth noting that this outage was caused by a faulty line card and not a configuration snafu.  It could happen to you if you use LACP in your environment.

For now, it appears that single interfaces are resilient to this failure mode because each filer tries to arp the other filer.  Support told me that LACP links do _not_ arp back and forth and that is why we got bit.  Once a LACP link establishes itself that's all the filer cares about in terms of the network health on that interface.  It seems Ontap should go farther than that considering they do it with single links already.

-Brian