Something not quite right with my 2040 cluster

KSTEWARDSON · ‎2012-02-08

Something just isn't quite right with my cluster. I have a fas2040, the top controller e0a and e0b are VIFs , on the bottom controller e0a and e0b are VIFs, both are set to the same network address.

They plug into a Dell PowerCOnnect 5524 which supports LACP and LAGs. The port mapping looks like this:

NodeT to 5524

e0a to port 1

e0b to port 2

NodeB to 5524

e0a to port 3

e0b to port 4

Then I setup in the 5524, port 1,2,3,4 as Lag1 with LACP. So on the switch they're all clustered together. My goal here is to get 2GB to NodeT or NodeB, whichever is the active node.

Now I did some testing.

First I will note that before I had all four ports in Lag1 I had two different LAGs, one for each node and I was getting TCP/IP conflicts because the two nodes were conflicting with each other since they had the same addresses. Clustering them together into one LAG fixed that, though i'm not sure if that was the appropriate solution to the problem.

So I start a ping to the Netapp, and get a response. I notice the lights for the two ports connected to Node B are actively blinking so B must be the active node. So I start my first test, on Node T I initiate a CF Takeover, Node T becomes the active node with no interruption to the ping. However I notice an oddity, the switch lights are still very active for Node B like the traffic is still on those ports, my expectations were that the Node T lights would become the active ones. Then i CF Giveback to hand control back to Node B, again no interruption in the ping and the switch lights remain active for the ports wired to Node B.

Next I test by restarting Node B. So i execute a console reboot command on Node B without first doing a CF Takeover to hand control over to Node T. My ping dies. But here's the weird thing, the switch lights for Node T now act like they're processing the ping traffic but my ping response is dead. Once Node B comes backup, the ping remains dead, and the lights for Node T remain active. It's not until Node B is up and from Node T i do a CF Takeover does my ping start to respond again. Then i do a CF Giveback on Node T to hand control back to Node B, the ping remains uninterrupted but the switch link lights for Node T remain active.

Can someone help me understand what it is i'm seeing here or what's not working? Do the nodes not automatically take over for each other , they rely on my to do a CF Takeover to move control around?

SEANMLUCE · ‎2012-02-08

I think the problem is that you have both heads set to the same IP. That is not how an active/active configuration works. Each controller should have a unique IP for each network you are connecting to. In the event of a takeover, the surviving controller takes on the IP identity of both controllers.

T Controller = e0a+e0b=vif1=192.168.1.100 -> lag1

B Controller = e0a+e0b=vif1=192.168.1.101 -> lag2

Switch ports 1+2=Lag1

Switch ports 3+4=Lag2

Even if you are going for an active/passive setup, each controller still needs it's own unique presence on the network.

Also, by default a takeover will not occur on link failure. This is a configurable option though.

KSTEWARDSON · ‎2012-02-09

SEANMLUCE wrote:

I think the problem is that you have both heads set to the same IP. That is not how an active/active configuration works. Each controller should have a unique IP for each network you are connecting to. In the event of a takeover, the surviving controller takes on the IP identity of both controllers.

T Controller = e0a+e0b=vif1=192.168.1.100 -> lag1

B Controller = e0a+e0b=vif1=192.168.1.101 -> lag2

Switch ports 1+2=Lag1

Switch ports 3+4=Lag2

Even if you are going for an active/passive setup, each controller still needs it's own unique presence on the network.

Also, by default a takeover will not occur on link failure. This is a configurable option though.

This sounded good but today when I tried the above config, it didn't work. I was getting responses back to pings on both 1.100 and 1.101 after i reconfigured. However when I had Node B take over for Node T, my ping to Node T died until I did a give-back. To confirm the result I had Node T take over Node B and I saw Node B's ping die until I did a giveback. So clustering doesn't seem to be working at all like this.

ANyone have any tech docs that talk about how Netapp dual controller 2040 clustering works and how it should be setup?

SEANMLUCE · ‎2012-02-09

You need to set the partner interface in each controllers RC file.

I will get you some sample RC files shortly.

Sean Luce

KSTEWARDSON · ‎2012-02-09

Ok that makes sense, I had not touched any of the file system files directly yet. Looking forward to seeing some samples, thanks.

SEANMLUCE · ‎2012-02-09

Below is a sample RC file from one controller of an active active pair:

If you don't feel comfortable modifying this file by hand, you can run 'setup' on each controller to generate these RC files for you. Modifying the settings in system manager or filerview will not be persistent. The important part of the rc file you are probably missing if the failover is not working correctly is "partner vif1". This tell the controller which interface it will be taking over in case of a fail over.

If you want, paste your RC files in here and I can show you what needs to be changed.

############

hostname myhostname

vif create lacp vif1 -b ip e0a e0b

ifconfig vif1 `hostname`-vif1 mediatype auto netmask 255.255.255.0 partner vif1

ifconfig e0M `hostname`-e0M netmask 255.255.255.0 mtusize 1500 trusted wins mediatype 100tx-fd-cfg_down flowcontrol full down

route add default 1.1.1.1

routed on

options dns.domainname my.domain

options dns.enable on

options nis.enable off

savecore

##########

aborzenkov · ‎2012-02-09

If you don't feel comfortable modifying this file by hand, you can run 'setup' on each controller to generate these RC files for you. Modifying the settings in system manager or filerview will not be persistent.

FilerView or System Manager do make persistent changes - that's what configuration GUI is for, to avoid direct editing of configuration files. They also will make run time changes at the same time, while editing /etc/rc directly requires reboot or invoking the same commands manually (and I have seen enough examples when correct command was used but line in /etc/rc contained typo so after reboot nothing worked).

The only gotcha is, you can't of course modify interface settings while being connected through the same interface.

aborzenkov · ‎2012-02-08

Then I setup in the 5524, port 1,2,3,4 as Lag1 with LACP.

That won't work. Two controllers in FAS2040 are completely separate systems and cannot be part of the same LACP group. You can form aggregate only within single controller.

And of course as already mentioned each controller must have different IP.

KSTEWARDSON · ‎2012-02-13

I've had some time to tinker. But here is my RC file as it currently stands:

RC on the top controller

#Auto-generated by setup Fri Feb 10 09:48:10 PST 2012

hostname NODET

vif create lacp -b ip vif0-nodeT e0a e0b

ifconfig vif0-nodeT 192.168.1.15 255.255.255.0 partner vif0-nodeB

ifconfig vif0-nodeT 192.168.1.15 up

route add default 10.200.4.1 1

routed on

options dns.

domainname my.domain.com

options dns.enable on

options nis.enable off

savecore

RC on Bottom controller

#Auto-generated by setup Fri Feb 10 09:48:10 PST 2012

hostname NODEB

vif create lacp -b ip vif0-nodeB e0a e0b

ifconfig vif0-nodeB 192.168.1.15 255.255.255.0 partner vif0-nodeT

ifconfig vif0-nodeB 192.168.1.15 up

route add default 10.200.4.1 1

routed on

options dns.

domainname my.domain.com

options dns.enable on

options nis.enable off

savecore

So right now, both NodeT (top) and NodeB (bottom) are configured the same on the RC file except the partner parameter is different in each. Each is configured into their own LAG on the switch and the status shows tcp/ip conflicts which I would expect, i'm basically back to the original post setup at this point.

Testing with pings though when i cf takeover and cf takeback on each node, the ping never drops despite the tcp/ip address conflict messages appearing on the console.

The only real difference i see here is in the sample RC file you have this line which I do not.

ifconfig e0M `hostname`-e0M netmask 255.255.255.0 mtusize 1500 trusted wins mediatype 100tx-fd-cfg_down flowcontrol full down

I'm not sure what port this is but I notice you've got it configured as down.

SEANMLUCE · ‎2012-02-15

You still have both controllers set to the same IP address (.15). Each controller (top and bottom) must have its own identity on the network. When one controller takes over for the other it takes on the identity of both controllers.

Do not think of either node as primary or secondary. NetApp controller pairs only support Active/Active configurations.

There are some situations where people do a pseudo-active/passive config. For example, the "passive" controller only owns 4 disks (3 disk aggregate for rootvol/OS + 1 spare), and the active controller owns the rest. All clients point to the active controller for storage. The passive controller only really does anything if the primary controller fails. At that time the passive controller takes on the identity of both controllers and continues serving data on behalf of the primary controller.

SEANMLUCE · ‎2012-02-15

You have the following 2 lines in your RC file:

options dns.

options dns.enable on

I believe the first one somehow got fragmented and can be removed. (remove the line "options dns.".

KSTEWARDSON · ‎2012-02-14

Here are some errors I'm getting when I NodeT is the primary and on NodeB i do cf takeover. It says my 'mime' file is missing so I'll have to dig into that.

Tue Feb 14 09:20:30 PST [NODET/NODEB: httpd.config.mime.missing:warning]: /etc/httpd.mimetypes file is missing.

vif: vif0-nodeT is not mapped to a local vif

vif: vif1-nodeT is not mapped to a local vif

ifconfig: vif0-nodeT: no such interface

Tue Feb 14 09:20:31 PST [NODET/NODEB: net.ifconfig.noPartner:error]: ifconfig: 'vif0-nodeT' cannot be configured: Address does not match any partner interface.

ifconfig: vif1-nodeT: no such interface

ifconfig: vif0-nodeT: no such interface

ifconfig: vif1-nodeT: no such interface

Tue Feb 14 09:20:31 PST [NODET/NODEB: net.ifconfig.noPartner:error]: ifconfig: 'vif1-nodeT' cannot be configured: Address does not match any partner interface.

add net default: gateway 10.200.4.1: network unreachable

Tue Feb 14 09:20:31 PST [NODET/NODEB: net.ifconfig.noPartner:error]: ifconfig: 'vif0-nodeT' cannot be configured: Address does not match any partner interface.

Tue Feb 14 09:20:31 PST [NODET/NODEB: net.ifconfig.noPartner:error]: ifconfig: 'vif1-nodeT' cannot be configured: Address does not match any partner interface.

Tue Feb 14 09:20:31 PST [NODET/NODEB: ip.drd.vfiler.info:info]: Although vFiler units are licensed, the routing daemon runs in the default IP space only.

KSTEWARDSON · ‎2012-02-14

I tested the following changes today to my RC file on NodeB

#Auto-generated by setup Fri Feb 10 09:48:10 PST 2012

hostname NODEB

vif create lacp -b ip vif0-nodeB e0a e0b

ifconfig vif0-nodeB partner vif0-nodeT

route add default 10.200.4.1 1

routed on

options dns.

domainname my.domain.com

options dns.enable on

options nis.enable off

savecore

This seems to work great failing over and back. I get a half dozen dead ping responses back then it starts replying again on the other node, same when i goes back. Is there a better way to do this? I think this sets it up in active/passive mode correct?

aborzenkov · ‎2012-02-14

Takeover and giveback both are basically reboot of filer. Filer can not answer ping while it is being rebooted. So it is normal that some ping requests are lost.

Regarding " this sets it up in active/passive mode correct?" - what "this" do you mean?

KSTEWARDSON · ‎2012-02-15

'this' meaning the NodeB RC file change i made with this command ifconfig vif0-nodeB partner vif0-nodeT

since no IP address is assigned to NodeB, it's not available in parallel with nodeT, it must fail over and configure itself based on it's partner NodeT.

aborzenkov · ‎2012-02-17

You probably can call it active/passive setup, yes. But do not forget, that NodeB stil has own address and is available under it.

You seem to (try to) treat NetApp HA pair as a single system. It is not. It is two independent systems configured in mutual failover cluster.

SEANMLUCE · ‎2012-02-15

To properly test cluster failover and give back, open 2 cmd windows and start a ping -t to both controllers. Initiate takeover from top controller. You should see a few pings lost and then it should recover. Typically you should only see 3 or 4 lost if everything is configured correctly.

Reboot the bottom controller

Initiate giveback from the top controller

Repeat the process from the bottom controller.

SEANMLUCE · ‎2012-02-15

One more thing... have you ran WireGauge against the configuration?

If not, download the latest version of WireGauge from the now site. It tests both controllers for proper cabling and also proper cluster configuration. It will tell you what is wrong with your configuration.

KSTEWARDSON · ‎2012-02-15

Thanks for the tip on WireGauge, i'll go seek that out right now.