ONTAP Hardware

Issues with clustering on a FAS2050

npcarlson
10,440 Views

Hi,

I've got a FAS2050 which is no longer covered by a service contract. It has dual controllers; however, clustering is refusing to work. After setting up the partner system ID's, etc, I've got it to the point where if I run 'cf enable', the cluster comes up, but the logs never sync up. One of the controllers (c1) throws some odd errors:

c1> cf enable
c1>

Tue Sep  8 21:57:39 GMT [c1: cf.misc.operatorEnable:warning]: Cluster monitor: operator initiated enabling of cluster
Tue Sep  8 21:57:39 GMT [c1: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of c0 disabled (cluster takeover disabled by partner)
Tue Sep  8 21:57:39 GMT [c1: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of c1 by c0 disabled (unsynchronized log)
Tue Sep  8 21:57:40 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep  8 21:57:43 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep  8 21:57:45 GMT [c1: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of c0 disabled (unsynchronized log)
Tue Sep  8 21:57:47 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep  8 21:57:51 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep  8 21:58:01 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2

The errors I'm wondering about are the 'Interconnect nic 0 has error on VI <...>' errors - I only see them on this head, and not the other one (c0). Swapping the positions of the controllers makes no difference. Is this likely a hardware issue with the integrated Infiniband controller on c1, or could it be something else?

Thanks!

15 REPLIES 15

npcarlson
10,410 Views

I suppose I should cover what I've already done. I read the post at http://communities.netapp.com/message/9031, and attempted things mentioned in there. Interconnect is integrated in a FAS2050, so nothing I can do about the cable. Only a single cluster interconnect also. Wiped all disks in the system clean and started from scratch with new mailbox disks, etc; did not help. Reseated controllers, did not help. Tried pretty much everything I can think of!

danielpr
10,410 Views

Carlson,

Can you share the /etc/rc configuration file for both the nodes?

Thanks;

Daniel

npcarlson
10,410 Views

Certainly..

c0 (one without the error):

#Auto-generated by setup Fri Sep  4 23:15:09 GMT 2009
hostname c0
ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner 192.168.0.34
ifconfig e0b `hostname`-e0b mediatype auto flowcontrol full partner 192.168.0.36
route add default 192.168.0.30 1
routed on
options dns.enable off
options nis.enable off
savecore

c1 (one with the error):

#Auto-generated by setup Fri Sep  4 23:17:55 GMT 2009
hostname c1
ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner 192.168.0.33
ifconfig e0b `hostname`-e0b mediatype auto flowcontrol full partner 192.168.0.35
route add default 192.168.0.30 1
routed on
options dns.enable off
options nis.enable off
savecore

npcarlson
10,410 Views

Also, here is the output of 'cf monitor' and 'cf status':

c1> cf status
c0 is up, takeover disabled because of reason (unsynchronized log)
c1 has disabled takeover by c0 (unsynchronized log)
VIA Interconnect is down (link up).

c1> cf monitor
current time: 09Sep2009 04:40:28
UP 00:08:12, partner 'c0', cluster monitor enabled
VIA Interconnect is down (link up), takeover capability off-line (unsynchronized log)
takeover by partner off-line (unsynchronized log)
partner update TAKEOVER_DISABLED (09Sep2009 04:40:26)

Then, in 'priv set diag', the output of 'cf monitor all':

cf: Current monitor status (09Sep2009 04:41:54):
partner 'c0', VIA Interconnect is down (link up)
state UP, time 578790, event CHECK_FSM, elem ChkMbValid (13)
mirrorConsistencyRequired TRUE
takeoverByPartner 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
mirrorEnabled TRUE, lowMemory FALSE, memio UNINIT, killPackets TRUE
degraded FALSE, reservePolicy ALWAYS_AFTER_TAKEOVER, resetDisks TRUE
hw_assist status:
hw_assist Inactive on c1: c1 not monitoring alerts from partner(c0)
hw_assist Inactive on c0: c0 not monitoring alerts from partner c1
timeouts:
fast 1000, slow 0, mailbox 2500, connect 0
operator 600000, firmware 0 (recvd 15000), dumpcore 576790
booting 300000 (recvd 0)
transit timer enabled TRUE, transit 600000 (last 0)
mailbox disks:
Disk 0c.09.4 is a local mailbox disk
Disk 0c.09.5 is a local mailbox disk
Disk 0c.09.0 is a partner mailbox disk
Disk 0c.09.1 is a partner mailbox disk
primary state:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 237, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_MAILBOX, abs_time 1252471313, sk_time 577790
channel_status 0
channel CHANNEL_IC, abs_time 1252471309, sk_time 573790
channel_status 5
channel CHANNEL_NETWORK, abs_time 0, sk_time 0
channel_status -1
backup state:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 4950, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_MAILBOX, abs_time 1252471313, sk_time 577260
channel_status 0
Channel Read Ctx:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 4950, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_IC, abs_time 0, sk_time 0
channel_status 3
Channel Read Ctx:
version 2, senderSysid 0
cluster_time 0, hbt 0, node_status UNKNOWN
info 0x0 <>
flags 0x0 <>
channel CHANNEL_NETWORK, abs_time 0, sk_time 0
channel_status -1
Channel Read Ctx:
version 2, senderSysid 0
cluster_time 0, hbt 0, node_status UNKNOWN
info 0x0 <>
flags 0x0 <>
takeoverState FT_NONE, takeoverString 'No takeover information'
givebackState FT_NONE, givebackString 'No giveback information'
givebackRetries 0, givebackRequested FALSE
autoGivebackEnabled FALSE, autoGivebackWasDone FALSE, autoGivebackCifsStopping FALSE
autoGivebackLastVetoCheck 0, autoGivebackAttemptsExceeded FALSE
Maximum primary disk mailbox io times: normal = 245, transition = 0
Maximum backup disk mailbox io times: normal = 307, transition = 0
Num times logs unsynced : 0
Total system uptime: 579079 msec

rtowry1234
10,409 Views

Don't know if this will help, I had a problem when I configured our FAS2050 and used a ip address as the name of the virtual interface (used for cisco etherchannel). Once I named it (instead of a ip address), it worked.

danielpr
10,409 Views

Randy,

It looks like some kind of config issues for me before i say hardware issue. Do u had any time Duplicate ip address in the network?

Thanks;

Daniel

rtowry1234
10,409 Views

When I configured the clustering, I specified the partner as the ip address. I tested failover, and it wouldn't work. So I went back and specified the partner address as the interface name topvif for the bottom controller, and botvif for the top controller.

After that, it worked.

npcarlson
8,046 Views

When you say "I specified the partner as the ip address" -- do you mean in the IP takeover section, or elsewhere?

npcarlson
8,045 Views

Hi Daniel,

Was this supposed to be addressed at me?

There have not been duplicate IP's on the network.

Thanks!

-Nate

danielpr
8,045 Views

Nate,

 

If you don't find any duplicate address warning in the console then it should be fine. But you should always looks at the configuration side properly.

 

I will try to do more findings around this issue and post my update if any.

 

Thanks

Daniel

npcarlson
10,409 Views

Hmm, interesting. We don't use an etherchannel (virtual interface), just a single IP.. but can you post a config example of the issue you had (before/after)?

rtowry1234
8,045 Views

I can't easily get the configuration. But the fix was specifiying the partner interface name instead of ip address. I forgot where though (it was over a year ago).

npcarlson
8,045 Views

OK, so I tried doing:

ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner e0a

..specifying the name of the partner interface instead of the IP. I made this change on both nodes, still the exact same error.

fcorfdir
10,409 Views

If you want to make etherchanel your rc should look like something like this

hostname toto
ifconfig e0a down
ifconfig e0b down
vif create lacp toto_trunck -b ip e0a e0b
vlan create toto_trunck 26 1
ifconfig toto_trunck-1 172.16.0.5 netmask 255.255.255.0 mtusize 1500 partner 172.16.0.6 -wins
ifconfig toto_trunc-26 192.168.26.5 netmask 255.255.255.0 mtusize 1500 partner 192.168.26.6 -wins
route add default 172.16.0.254 1
routed on
options dns.domainname toto.intranet
options dns.enable on
options nis.enable off
savecore

my node name is toto I have create an etherchanel name toto_trunck, add 2 vlan 1 and 26 add ip on every vlan and the partner ip

here is the cisco config (don't forget to create vlan)

interface GigabitEthernet0/16

description toto e0a

switchport trunk native vlan 9

switchport mode trunk

channel-group 3 mode active

end

cata-giga#sh run int gig 0/22

Building configuration...

Current configuration : 145 bytes

!

interface GigabitEthernet0/22

description toto e0b

switchport trunk native vlan 9

switchport mode trunk

channel-group 3 mode active

end

cata-giga#sh run int po 3

Building configuration...

Current configuration : 111 bytes

!

interface Port-channel3

description lacp toto

switchport trunk native vlan 9

switchport mode trunk

npcarlson
8,045 Views

We replaced C1 with a new FAS2050 controller, and it all works perfectly now.

Public