Issues with clustering on a FAS2050

npcarlson · ‎2009-09-08

Hi,

I've got a FAS2050 which is no longer covered by a service contract. It has dual controllers; however, clustering is refusing to work. After setting up the partner system ID's, etc, I've got it to the point where if I run 'cf enable', the cluster comes up, but the logs never sync up. One of the controllers (c1) throws some odd errors:

c1> cf enable
c1>

Tue Sep 8 21:57:39 GMT [c1: cf.misc.operatorEnable:warning]: Cluster monitor: operator initiated enabling of cluster
Tue Sep 8 21:57:39 GMT [c1: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of c0 disabled (cluster takeover disabled by partner)
Tue Sep 8 21:57:39 GMT [c1: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of c1 by c0 disabled (unsynchronized log)
Tue Sep 8 21:57:40 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep 8 21:57:43 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep 8 21:57:45 GMT [c1: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of c0 disabled (unsynchronized log)
Tue Sep 8 21:57:47 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep 8 21:57:51 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep 8 21:58:01 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2

The errors I'm wondering about are the 'Interconnect nic 0 has error on VI <...>' errors - I only see them on this head, and not the other one (c0). Swapping the positions of the controllers makes no difference. Is this likely a hardware issue with the integrated Infiniband controller on c1, or could it be something else?

Thanks!

npcarlson · ‎2009-09-08

I suppose I should cover what I've already done. I read the post at http://communities.netapp.com/message/9031, and attempted things mentioned in there. Interconnect is integrated in a FAS2050, so nothing I can do about the cable. Only a single cluster interconnect also. Wiped all disks in the system clean and started from scratch with new mailbox disks, etc; did not help. Reseated controllers, did not help. Tried pretty much everything I can think of!

danielpr · ‎2009-09-08

Carlson,

Can you share the /etc/rc configuration file for both the nodes?

Thanks;

Daniel

npcarlson · ‎2009-09-08

Certainly..

c0 (one without the error):

#Auto-generated by setup Fri Sep 4 23:15:09 GMT 2009
hostname c0
ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner 192.168.0.34
ifconfig e0b `hostname`-e0b mediatype auto flowcontrol full partner 192.168.0.36
route add default 192.168.0.30 1
routed on
options dns.enable off
options nis.enable off
savecore

c1 (one with the error):

#Auto-generated by setup Fri Sep 4 23:17:55 GMT 2009
hostname c1
ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner 192.168.0.33
ifconfig e0b `hostname`-e0b mediatype auto flowcontrol full partner 192.168.0.35
route add default 192.168.0.30 1
routed on
options dns.enable off
options nis.enable off
savecore

npcarlson · ‎2009-09-08

Also, here is the output of 'cf monitor' and 'cf status':

c1> cf status
c0 is up, takeover disabled because of reason (unsynchronized log)
c1 has disabled takeover by c0 (unsynchronized log)
VIA Interconnect is down (link up).

c1> cf monitor
current time: 09Sep2009 04:40:28
UP 00:08:12, partner 'c0', cluster monitor enabled
VIA Interconnect is down (link up), takeover capability off-line (unsynchronized log)
takeover by partner off-line (unsynchronized log)
partner update TAKEOVER_DISABLED (09Sep2009 04:40:26)

Then, in 'priv set diag', the output of 'cf monitor all':

cf: Current monitor status (09Sep2009 04:41:54):
partner 'c0', VIA Interconnect is down (link up)
state UP, time 578790, event CHECK_FSM, elem ChkMbValid (13)
mirrorConsistencyRequired TRUE
takeoverByPartner 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
mirrorEnabled TRUE, lowMemory FALSE, memio UNINIT, killPackets TRUE
degraded FALSE, reservePolicy ALWAYS_AFTER_TAKEOVER, resetDisks TRUE
hw_assist status:
hw_assist Inactive on c1: c1 not monitoring alerts from partner(c0)
hw_assist Inactive on c0: c0 not monitoring alerts from partner c1
timeouts:
fast 1000, slow 0, mailbox 2500, connect 0
operator 600000, firmware 0 (recvd 15000), dumpcore 576790
booting 300000 (recvd 0)
transit timer enabled TRUE, transit 600000 (last 0)
mailbox disks:
Disk 0c.09.4 is a local mailbox disk
Disk 0c.09.5 is a local mailbox disk
Disk 0c.09.0 is a partner mailbox disk
Disk 0c.09.1 is a partner mailbox disk
primary state:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 237, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_MAILBOX, abs_time 1252471313, sk_time 577790
channel_status 0
channel CHANNEL_IC, abs_time 1252471309, sk_time 573790
channel_status 5
channel CHANNEL_NETWORK, abs_time 0, sk_time 0
channel_status -1
backup state:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 4950, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_MAILBOX, abs_time 1252471313, sk_time 577260
channel_status 0
Channel Read Ctx:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 4950, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_IC, abs_time 0, sk_time 0
channel_status 3
Channel Read Ctx:
version 2, senderSysid 0
cluster_time 0, hbt 0, node_status UNKNOWN
info 0x0 <>
flags 0x0 <>
channel CHANNEL_NETWORK, abs_time 0, sk_time 0
channel_status -1
Channel Read Ctx:
version 2, senderSysid 0
cluster_time 0, hbt 0, node_status UNKNOWN
info 0x0 <>
flags 0x0 <>
takeoverState FT_NONE, takeoverString 'No takeover information'
givebackState FT_NONE, givebackString 'No giveback information'
givebackRetries 0, givebackRequested FALSE
autoGivebackEnabled FALSE, autoGivebackWasDone FALSE, autoGivebackCifsStopping FALSE
autoGivebackLastVetoCheck 0, autoGivebackAttemptsExceeded FALSE
Maximum primary disk mailbox io times: normal = 245, transition = 0
Maximum backup disk mailbox io times: normal = 307, transition = 0
Num times logs unsynced : 0
Total system uptime: 579079 msec

rtowry1234 · ‎2009-09-09

Don't know if this will help, I had a problem when I configured our FAS2050 and used a ip address as the name of the virtual interface (used for cisco etherchannel). Once I named it (instead of a ip address), it worked.

danielpr · ‎2009-09-09

Randy,

It looks like some kind of config issues for me before i say hardware issue. Do u had any time Duplicate ip address in the network?

Thanks;

Daniel

rtowry1234 · ‎2009-09-09

When I configured the clustering, I specified the partner as the ip address. I tested failover, and it wouldn't work. So I went back and specified the partner address as the interface name topvif for the bottom controller, and botvif for the top controller.

After that, it worked.

npcarlson · ‎2009-09-10

When you say "I specified the partner as the ip address" -- do you mean in the IP takeover section, or elsewhere?

npcarlson · ‎2009-09-10

Hi Daniel,

Was this supposed to be addressed at me?

There have not been duplicate IP's on the network.

Thanks!

-Nate

danielpr · ‎2009-09-11

Nate,

If you don't find any duplicate address warning in the console then it should be fine. But you should always looks at the configuration side properly.

I will try to do more findings around this issue and post my update if any.

Thanks

Daniel

npcarlson · ‎2009-09-10

Hmm, interesting. We don't use an etherchannel (virtual interface), just a single IP.. but can you post a config example of the issue you had (before/after)?

rtowry1234 · ‎2009-09-10

I can't easily get the configuration. But the fix was specifiying the partner interface name instead of ip address. I forgot where though (it was over a year ago).

npcarlson · ‎2009-09-10

OK, so I tried doing:

ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner e0a

..specifying the name of the partner interface instead of the IP. I made this change on both nodes, still the exact same error.

fcorfdir · ‎2009-09-10

If you want to make etherchanel your rc should look like something like this

hostname toto
ifconfig e0a down
ifconfig e0b down
vif create lacp toto_trunck -b ip e0a e0b
vlan create toto_trunck 26 1
ifconfig toto_trunck-1 172.16.0.5 netmask 255.255.255.0 mtusize 1500 partner 172.16.0.6 -wins
ifconfig toto_trunc-26 192.168.26.5 netmask 255.255.255.0 mtusize 1500 partner 192.168.26.6 -wins
route add default 172.16.0.254 1
routed on
options dns.domainname toto.intranet
options dns.enable on
options nis.enable off
savecore

my node name is toto I have create an etherchanel name toto_trunck, add 2 vlan 1 and 26 add ip on every vlan and the partner ip

here is the cisco config (don't forget to create vlan)

interface GigabitEthernet0/16

description toto e0a

switchport trunk native vlan 9

switchport mode trunk

channel-group 3 mode active

end

cata-giga#sh run int gig 0/22

Building configuration...

Current configuration : 145 bytes

!

interface GigabitEthernet0/22

description toto e0b

switchport trunk native vlan 9

switchport mode trunk

channel-group 3 mode active

end

cata-giga#sh run int po 3

Building configuration...

Current configuration : 111 bytes

!

interface Port-channel3

description lacp toto

switchport trunk native vlan 9

switchport mode trunk

npcarlson · ‎2009-09-17

We replaced C1 with a new FAS2050 controller, and it all works perfectly now.