Interface NFO funkyness

rkaramchedu1 · ‎2008-08-10

I have a cluster simulator running on which I am trying to test the Negotiated Failover (NFO) functionality for interface. As a background, NFO can be enabled on a physical interface and configured so that if an NFO-enabled interface fails on the partner, a CF event occurs (at least in theory)

Here is the node1 configuration of the cluster:

node1> options cf
cf.giveback.auto.cifs.terminate.minutes 5          
cf.giveback.auto.enable      off        
cf.giveback.auto.terminate.bigjobs on         
cf.giveback.check.partner    off        
cf.takeover.change_fsid      on         
cf.takeover.detection.seconds 10         
cf.takeover.on_disk_shelf_miscompare off        
cf.takeover.on_failure       on         
cf.takeover.on_network_interface_failure on         
cf.takeover.on_network_interface_failure.policy any_nic    (same value in local+partner recommended)
cf.takeover.on_panic         on         
cf.takeover.on_short_uptime  on         

node1> ifconfig -a
ns0: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.97.133 netmask 0xffffff00 broadcast 192.168.97.255
        partner inet 192.168.97.135 (not in use)
        ether 00:50:56:1b:03:f8 (Linux AF_PACKET socket)
        nfo enabled
ns1: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether 00:50:56:1c:03:f8 (Linux AF_PACKET socket)
lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4064
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
        ether 00:00:00:00:00:00 (Shared memory)


node1> cf status
Cluster enabled, node2 is up.
Negotiated failover enabled (network_interface).
node1>

And here's the node2 configuration of the cluster

node2> cf status   
Cluster enabled, node1 is up.
Negotiated failover enabled (network_interface).
node2> options cf
cf.giveback.auto.cifs.terminate.minutes 5          
cf.giveback.auto.enable      off        
cf.giveback.auto.terminate.bigjobs on         
cf.giveback.check.partner    off        
cf.takeover.change_fsid      on         
cf.takeover.detection.seconds 10         
cf.takeover.on_disk_shelf_miscompare off        
cf.takeover.on_failure       on         
cf.takeover.on_network_interface_failure on         
cf.takeover.on_network_interface_failure.policy any_nic    (same value in local+partner recommended)
cf.takeover.on_panic         on         
cf.takeover.on_short_uptime  on         


node2> ifconfig -a
ns0: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.97.135 netmask 0xffffff00 broadcast 192.168.97.255
        partner inet 192.168.97.133 (not in use)
        ether 00:50:56:0f:25:e3 (Linux AF_PACKET socket)
        nfo enabled
ns1: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether 00:50:56:10:25:e3 (Linux AF_PACKET socket)
lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4064
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
        ether 00:00:00:00:00:00 (Shared memory)

node2> cf status
Cluster enabled, node1 is up.
Negotiated failover enabled (network_interface).
node2>

However, when I down the ns0 interface on a node, nothing really happens..

node2> date; ifconfig ns0 down
Sun Aug 10 14:43:53 GMT 2008

node2> ifconfig -a                  
ns0: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.97.135 netmask 0xffffff00 broadcast 192.168.97.255
        partner inet 192.168.97.133 (not in use)
        ether 00:50:56:0f:25:e3 (Linux AF_PACKET socket)
        nfo enabled
ns1: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether 00:50:56:10:25:e3 (Linux AF_PACKET socket)
lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4064
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
        ether 00:00:00:00:00:00 (Shared memory)


node2> ping 192.168.97.133
ping: wrote 192.168.97.133 64 chars, error=Network is down
ping: wrote 192.168.97.133 64 chars, error=Network is down
ping: wrote 192.168.97.133 64 chars, error=Network is down
ping: wrote 192.168.97.133 64 chars, error=Network is down
ping: wrote 192.168.97.133 64 chars, error=Network is down

coffee break..........

node2> date
Sun Aug 10 14:51:29 GMT 2008

node2> cf status   
Cluster enabled, node1 is up.
Negotiated failover enabled (network_interface).

node2>

What am I missing ?

kusek · ‎2008-08-10

Great question,

It is my understanding that NFO on a network interface operates on link failure, as opposed to administratively downing the interface.

I'll need to confirm this in my lab to validate it, but that is how I understood it to operate.

(I'd hate if my cluster just up and failed on me while I'm doing admin work against it, vs actually downing and destroying interfaces)

Let me know if you see something else, and either way, I'll try to lab this sometime next week.

Christopher

philiparnason · ‎2008-08-10

By chance I was testing the same thing today. We don't have enougn NICS in our 3100 to do multimode vif, so we had to use NFO for failover. I disabled the switchport and it worked fine, you may want to try that instead. One thing I noticed is that upon reboot and giveback, looking at the ifconfig -a information shows that the nfo configuration is gone. Not sure if this is by design, but that means it needs to be reconfigured after any takeover event.

Philip Arnason

philiparnason · ‎2008-08-10

To answer my own question. Here is a quote from the documentation:

***

You must include this option in

the /etc/rc file for it to persist

across reboots.

***

Philip Arnason

rkaramchedu1 · ‎2008-08-10

With a simulator, I do not know how to test a "failed" nic without actually downing it.. (it is actually a simulator in a VM on a mac)

I knew that it is not persistent across reboots.

Now, I am drawing a blank, but if we configure a VIF (single or multimode) and there is a VIF failure, it would cause a cf event, correct, even without NFO enabled ? 'Cause I noticed that one can enable NFO on a VIF and it got me thinking as to its use cases.

The active/active guide mentions it but it would help if they added some clarity to it. For e.g. ( say "Yes, if set up to do so using NFO" as opposed to just "Yes, if set up to do so" )

http://now.netapp.com/NOW/knowledge/docs/ontap/rel7251/html/ontap/cluster/failing_over/reference/r_oc_fo_failover-events.html

kusek · ‎2008-08-10

I was going to say the same thing, that you do need to commit the configuration to the rc file in order for it to be retained across reboots (and especially to be retained during a take over.

The way I understand it, if you have it configured for a failover of a VIF (single or multi) and the entire VIF fails, while it may generate an NFO event, especially if enabled on the interface. If the filer isn't configured to fail-over as a result of the NFO event, it would result in no failure occurring.

The use of the NFO switch can be dangerous if not properly implemented in production based scenarios, as then you're dealing with outside forces (Networking) which could cause your filer to change its operational state. Only under certain circumstances and configurations would I advise ever setting up this setting.

anilmnayak · ‎2011-02-09

Why doesn't a "ifconfig vif down" trigger a nfo failover ?

I am trying to test network interface failure failover before I release the filer for production but do not see failover happening.

Any help here ?

Thanks

-Anil

pratipalsingh001 · ‎2011-11-04

Hi Guys,

Anyone got any success with it, I am facing the same issue with 2 NetApps

My Netapps are going in production next week.

Any help will be really appreciated

Regards,

Pratipal

grantklin · ‎2012-02-22

just tested this. you must simulate a link breakage from the switch. pull out the cable or down the port on the switch.

if you down the interface then assign the interface 0.0.0.0 - it will down the link but still the failover will not work properly.