ONTAP Hardware
ONTAP Hardware
Hi,
I've got a FAS2050 which is no longer covered by a service contract. It has dual controllers; however, clustering is refusing to work. After setting up the partner system ID's, etc, I've got it to the point where if I run 'cf enable', the cluster comes up, but the logs never sync up. One of the controllers (c1) throws some odd errors:
c1> cf enable
c1>
Tue Sep 8 21:57:39 GMT [c1: cf.misc.operatorEnable:warning]: Cluster monitor: operator initiated enabling of cluster
Tue Sep 8 21:57:39 GMT [c1: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of c0 disabled (cluster takeover disabled by partner)
Tue Sep 8 21:57:39 GMT [c1: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of c1 by c0 disabled (unsynchronized log)
Tue Sep 8 21:57:40 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep 8 21:57:43 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep 8 21:57:45 GMT [c1: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of c0 disabled (unsynchronized log)
Tue Sep 8 21:57:47 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep 8 21:57:51 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
Tue Sep 8 21:58:01 GMT [c1: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #11 RECV_DESC_ERROR 2
The errors I'm wondering about are the 'Interconnect nic 0 has error on VI <...>' errors - I only see them on this head, and not the other one (c0). Swapping the positions of the controllers makes no difference. Is this likely a hardware issue with the integrated Infiniband controller on c1, or could it be something else?
Thanks!
I suppose I should cover what I've already done. I read the post at http://communities.netapp.com/message/9031, and attempted things mentioned in there. Interconnect is integrated in a FAS2050, so nothing I can do about the cable. Only a single cluster interconnect also. Wiped all disks in the system clean and started from scratch with new mailbox disks, etc; did not help. Reseated controllers, did not help. Tried pretty much everything I can think of!
Carlson,
Can you share the /etc/rc configuration file for both the nodes?
Thanks;
Daniel
Certainly..
c0 (one without the error):
#Auto-generated by setup Fri Sep 4 23:15:09 GMT 2009
hostname c0
ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner 192.168.0.34
ifconfig e0b `hostname`-e0b mediatype auto flowcontrol full partner 192.168.0.36
route add default 192.168.0.30 1
routed on
options dns.enable off
options nis.enable off
savecore
c1 (one with the error):
#Auto-generated by setup Fri Sep 4 23:17:55 GMT 2009
hostname c1
ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner 192.168.0.33
ifconfig e0b `hostname`-e0b mediatype auto flowcontrol full partner 192.168.0.35
route add default 192.168.0.30 1
routed on
options dns.enable off
options nis.enable off
savecore
Also, here is the output of 'cf monitor' and 'cf status':
c1> cf status
c0 is up, takeover disabled because of reason (unsynchronized log)
c1 has disabled takeover by c0 (unsynchronized log)
VIA Interconnect is down (link up).
c1> cf monitor
current time: 09Sep2009 04:40:28
UP 00:08:12, partner 'c0', cluster monitor enabled
VIA Interconnect is down (link up), takeover capability off-line (unsynchronized log)
takeover by partner off-line (unsynchronized log)
partner update TAKEOVER_DISABLED (09Sep2009 04:40:26)
Then, in 'priv set diag', the output of 'cf monitor all':
cf: Current monitor status (09Sep2009 04:41:54):
partner 'c0', VIA Interconnect is down (link up)
state UP, time 578790, event CHECK_FSM, elem ChkMbValid (13)
mirrorConsistencyRequired TRUE
takeoverByPartner 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
mirrorEnabled TRUE, lowMemory FALSE, memio UNINIT, killPackets TRUE
degraded FALSE, reservePolicy ALWAYS_AFTER_TAKEOVER, resetDisks TRUE
hw_assist status:
hw_assist Inactive on c1: c1 not monitoring alerts from partner(c0)
hw_assist Inactive on c0: c0 not monitoring alerts from partner c1
timeouts:
fast 1000, slow 0, mailbox 2500, connect 0
operator 600000, firmware 0 (recvd 15000), dumpcore 576790
booting 300000 (recvd 0)
transit timer enabled TRUE, transit 600000 (last 0)
mailbox disks:
Disk 0c.09.4 is a local mailbox disk
Disk 0c.09.5 is a local mailbox disk
Disk 0c.09.0 is a partner mailbox disk
Disk 0c.09.1 is a partner mailbox disk
primary state:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 237, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_MAILBOX, abs_time 1252471313, sk_time 577790
channel_status 0
channel CHANNEL_IC, abs_time 1252471309, sk_time 573790
channel_status 5
channel CHANNEL_NETWORK, abs_time 0, sk_time 0
channel_status -1
backup state:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 4950, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_MAILBOX, abs_time 1252471313, sk_time 577260
channel_status 0
Channel Read Ctx:
version 2, senderSysid <x>
cluster_time 1252471144, hbt 4950, node_status TAKEOVER_DISABLED
info 0x2001 <NVRAM_DOWN,TAKEOVER_ON_PANIC>
flags 0x0 <>
channel CHANNEL_IC, abs_time 0, sk_time 0
channel_status 3
Channel Read Ctx:
version 2, senderSysid 0
cluster_time 0, hbt 0, node_status UNKNOWN
info 0x0 <>
flags 0x0 <>
channel CHANNEL_NETWORK, abs_time 0, sk_time 0
channel_status -1
Channel Read Ctx:
version 2, senderSysid 0
cluster_time 0, hbt 0, node_status UNKNOWN
info 0x0 <>
flags 0x0 <>
takeoverState FT_NONE, takeoverString 'No takeover information'
givebackState FT_NONE, givebackString 'No giveback information'
givebackRetries 0, givebackRequested FALSE
autoGivebackEnabled FALSE, autoGivebackWasDone FALSE, autoGivebackCifsStopping FALSE
autoGivebackLastVetoCheck 0, autoGivebackAttemptsExceeded FALSE
Maximum primary disk mailbox io times: normal = 245, transition = 0
Maximum backup disk mailbox io times: normal = 307, transition = 0
Num times logs unsynced : 0
Total system uptime: 579079 msec
Don't know if this will help, I had a problem when I configured our FAS2050 and used a ip address as the name of the virtual interface (used for cisco etherchannel). Once I named it (instead of a ip address), it worked.
Randy,
It looks like some kind of config issues for me before i say hardware issue. Do u had any time Duplicate ip address in the network?
Thanks;
Daniel
When I configured the clustering, I specified the partner as the ip address. I tested failover, and it wouldn't work. So I went back and specified the partner address as the interface name topvif for the bottom controller, and botvif for the top controller.
After that, it worked.
When you say "I specified the partner as the ip address" -- do you mean in the IP takeover section, or elsewhere?
Hi Daniel,
Was this supposed to be addressed at me?
There have not been duplicate IP's on the network.
Thanks!
-Nate
Nate,
If you don't find any duplicate address warning in the console then it should be fine. But you should always looks at the configuration side properly.
I will try to do more findings around this issue and post my update if any.
Thanks
Daniel
Hmm, interesting. We don't use an etherchannel (virtual interface), just a single IP.. but can you post a config example of the issue you had (before/after)?
I can't easily get the configuration. But the fix was specifiying the partner interface name instead of ip address. I forgot where though (it was over a year ago).
OK, so I tried doing:
ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full partner e0a
..specifying the name of the partner interface instead of the IP. I made this change on both nodes, still the exact same error.
If you want to make etherchanel your rc should look like something like this
hostname toto
ifconfig e0a down
ifconfig e0b down
vif create lacp toto_trunck -b ip e0a e0b
vlan create toto_trunck 26 1
ifconfig toto_trunck-1 172.16.0.5 netmask 255.255.255.0 mtusize 1500 partner 172.16.0.6 -wins
ifconfig toto_trunc-26 192.168.26.5 netmask 255.255.255.0 mtusize 1500 partner 192.168.26.6 -wins
route add default 172.16.0.254 1
routed on
options dns.domainname toto.intranet
options dns.enable on
options nis.enable off
savecore
my node name is toto I have create an etherchanel name toto_trunck, add 2 vlan 1 and 26 add ip on every vlan and the partner ip
here is the cisco config (don't forget to create vlan)
interface GigabitEthernet0/16
description toto e0a
switchport trunk native vlan 9
switchport mode trunk
channel-group 3 mode active
end
cata-giga#sh run int gig 0/22
Building configuration...
Current configuration : 145 bytes
!
interface GigabitEthernet0/22
description toto e0b
switchport trunk native vlan 9
switchport mode trunk
channel-group 3 mode active
end
cata-giga#sh run int po 3
Building configuration...
Current configuration : 111 bytes
!
interface Port-channel3
description lacp toto
switchport trunk native vlan 9
switchport mode trunk
We replaced C1 with a new FAS2050 controller, and it all works perfectly now.