Simulator Discussions
Simulator Discussions
I have a cluster simulator running on which I am trying to test the Negotiated Failover (NFO) functionality for interface. As a background, NFO can be enabled on a physical interface and configured so that if an NFO-enabled interface fails on the partner, a CF event occurs (at least in theory)
Here is the node1 configuration of the cluster:
node1> options cf cf.giveback.auto.cifs.terminate.minutes 5 cf.giveback.auto.enable off cf.giveback.auto.terminate.bigjobs on cf.giveback.check.partner off cf.takeover.change_fsid on cf.takeover.detection.seconds 10 cf.takeover.on_disk_shelf_miscompare off cf.takeover.on_failure on cf.takeover.on_network_interface_failure on cf.takeover.on_network_interface_failure.policy any_nic (same value in local+partner recommended) cf.takeover.on_panic on cf.takeover.on_short_uptime on node1> ifconfig -a ns0: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.97.133 netmask 0xffffff00 broadcast 192.168.97.255 partner inet 192.168.97.135 (not in use) ether 00:50:56:1b:03:f8 (Linux AF_PACKET socket) nfo enabled ns1: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:50:56:1c:03:f8 (Linux AF_PACKET socket) lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4064 inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1 ether 00:00:00:00:00:00 (Shared memory) node1> cf status Cluster enabled, node2 is up. Negotiated failover enabled (network_interface). node1>
And here's the node2 configuration of the cluster
node2> cf status Cluster enabled, node1 is up. Negotiated failover enabled (network_interface). node2> options cf cf.giveback.auto.cifs.terminate.minutes 5 cf.giveback.auto.enable off cf.giveback.auto.terminate.bigjobs on cf.giveback.check.partner off cf.takeover.change_fsid on cf.takeover.detection.seconds 10 cf.takeover.on_disk_shelf_miscompare off cf.takeover.on_failure on cf.takeover.on_network_interface_failure on cf.takeover.on_network_interface_failure.policy any_nic (same value in local+partner recommended) cf.takeover.on_panic on cf.takeover.on_short_uptime on node2> ifconfig -a ns0: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.97.135 netmask 0xffffff00 broadcast 192.168.97.255 partner inet 192.168.97.133 (not in use) ether 00:50:56:0f:25:e3 (Linux AF_PACKET socket) nfo enabled ns1: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:50:56:10:25:e3 (Linux AF_PACKET socket) lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4064 inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1 ether 00:00:00:00:00:00 (Shared memory) node2> cf status Cluster enabled, node1 is up. Negotiated failover enabled (network_interface). node2>
However, when I down the ns0 interface on a node, nothing really happens..
node2> date; ifconfig ns0 down Sun Aug 10 14:43:53 GMT 2008 node2> ifconfig -a ns0: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.97.135 netmask 0xffffff00 broadcast 192.168.97.255 partner inet 192.168.97.133 (not in use) ether 00:50:56:0f:25:e3 (Linux AF_PACKET socket) nfo enabled ns1: flags=808042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:50:56:10:25:e3 (Linux AF_PACKET socket) lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4064 inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1 ether 00:00:00:00:00:00 (Shared memory) node2> ping 192.168.97.133 ping: wrote 192.168.97.133 64 chars, error=Network is down ping: wrote 192.168.97.133 64 chars, error=Network is down ping: wrote 192.168.97.133 64 chars, error=Network is down ping: wrote 192.168.97.133 64 chars, error=Network is down ping: wrote 192.168.97.133 64 chars, error=Network is down
coffee break..........
node2> date Sun Aug 10 14:51:29 GMT 2008 node2> cf status Cluster enabled, node1 is up. Negotiated failover enabled (network_interface). node2>
What am I missing ?
Great question,
It is my understanding that NFO on a network interface operates on link failure, as opposed to administratively downing the interface.
I'll need to confirm this in my lab to validate it, but that is how I understood it to operate.
(I'd hate if my cluster just up and failed on me while I'm doing admin work against it, vs actually downing and destroying interfaces)
Let me know if you see something else, and either way, I'll try to lab this sometime next week.
Christopher
By chance I was testing the same thing today. We don't have enougn NICS in our 3100 to do multimode vif, so we had to use NFO for failover. I disabled the switchport and it worked fine, you may want to try that instead. One thing I noticed is that upon reboot and giveback, looking at the ifconfig -a information shows that the nfo configuration is gone. Not sure if this is by design, but that means it needs to be reconfigured after any takeover event.
Philip Arnason
To answer my own question. Here is a quote from the documentation:
***
You must include this option in
the /etc/rc file for it to persist
across reboots.
***
Philip Arnason
With a simulator, I do not know how to test a "failed" nic without actually downing it.. (it is actually a simulator in a VM on a mac)
I knew that it is not persistent across reboots.
Now, I am drawing a blank, but if we configure a VIF (single or multimode) and there is a VIF failure, it would cause a cf event, correct, even without NFO enabled ? 'Cause I noticed that one can enable NFO on a VIF and it got me thinking as to its use cases.
The active/active guide mentions it but it would help if they added some clarity to it. For e.g. ( say "Yes, if set up to do so using NFO" as opposed to just "Yes, if set up to do so" )
I was going to say the same thing, that you do need to commit the configuration to the rc file in order for it to be retained across reboots (and especially to be retained during a take over.
The way I understand it, if you have it configured for a failover of a VIF (single or multi) and the entire VIF fails, while it may generate an NFO event, especially if enabled on the interface. If the filer isn't configured to fail-over as a result of the NFO event, it would result in no failure occurring.
The use of the NFO switch can be dangerous if not properly implemented in production based scenarios, as then you're dealing with outside forces (Networking) which could cause your filer to change its operational state. Only under certain circumstances and configurations would I advise ever setting up this setting.
Why doesn't a "ifconfig vif down" trigger a nfo failover ?
I am trying to test network interface failure failover before I release the filer for production but do not see failover happening.
Any help here ?
Thanks
-Anil
Hi Guys,
Anyone got any success with it, I am facing the same issue with 2 NetApps
My Netapps are going in production next week.
Any help will be really appreciated
Regards,
Pratipal
just tested this. you must simulate a link breakage from the switch. pull out the cable or down the port on the switch.
if you down the interface then assign the interface 0.0.0.0 - it will down the link but still the failover will not work properly.