I am have issues with a clustered FAS270 with two addtional shelves. It is at a remote site and we've had to power it down several times due to power issues at the site. This last power down, a halt -f was issued on both heads as usual, however upon boot one head came back, the other halted with panic: Permanent errors on all HA mailbox disks (while writing master block) in process fmmbx_instance. Then it halts and goes to the CFE prompt.
When an attempt was made to boot into maintenace mode we get: Inacluster, youMUSTensure that the partner is (and remains) down, or that takeover is manually disabled on the ... FAILURE TO DO SO CAN RESULT IN YOUR FILESYSTEMS BEING DESTROYED
[cf.nm.nicViError:info] replayed event: Interconnect nic 0 had error on VI #4 SEND_DESC_ERROR 2
[cf.nm.nicReset:warning] Initiating soft reset on Cluster Intereconnect card 0 due to rendezvous jammed
There is the possibility that an evaluation cluster license was loaded, however, I am not sure how to boot the partner to add the valid license.
Thanks in advance!
On the partner that did boot. I recieve the following:
XXXXX>cf status partner may be down, takeover disabled because of reason (CFO not licensed) zap has disabled takeover by partner (version mismatch) Host Info: Local 3 Partner 0 NVRAM TOC: Local 12 Partner 0 WAFL FSInfo: Local 72 Partner 0 WAFL log: Local 147 Partner 0 RAID: Local 8 Partner 0 RAID NVRAM: Local 13 Partner 0 VIA Interconnect is down (link 0 up, link 1 up).
XXXXX*> cf monitor current time: 13Sep2009 10:37:46 UP 14+21:57:20, partner 'unknown', cluster monitor enabled VIA Interconnect is down (link 0 up, link 1 up), takeover capability off-line (CFO not licensed) takeover by partner off-line (version mismatch) partner may be down, last partner update UNKNOWN (10Sep2009 05:17:18)
XXXXXX*> cf monitor all cf: Current monitor status (13Sep2009 10:38:21): partner 'unknown', VIA Interconnect is down (link 0 up, link 1 up) state UP, time 1288676133, event CHECK_FSM, elem SChkNoTkOver (13) mirrorConsistencyRequired TRUE takeoverByPartner 0x2081 <NVRAM_DOWN,VERSION,TAKEOVER_ON_PANIC> mirrorEnabled TRUE, lowMemory FALSE, memio UNINIT, killPackets TRUE degraded FALSE, reservePolicy ALWAYS_AFTER_TAKEOVER, resetDisks TRUE timeouts: fast 1000, slow 2500, mailbox 10000, connect 5000 operator 600000, firmware 15000 (recvd 1286605928), dumpcore 60000 booting 300000 (recvd 0) transit timer enabled TRUE, transit 600000 (last 58904) mailbox disks: Disk 0b.59 is a local mailbox disk Disk 0b.51 is a local mailbox disk Disk 0b.41 is a partner mailbox disk primary state: version 2, senderSysid XXXXXXXX cluster_time 1252559838, hbt 371252, node_status TAKEOVER_DISABLED info 0x2081 <NVRAM_DOWN,VERSION,TAKEOVER_ON_PANIC> flags 0x0 <> channel CHANNEL_MAILBOX, abs_time 1252838301, sk_time 1288676133 channel_status 0 channel CHANNEL_IC, abs_time 1252833344, sk_time 1283717709 channel_status 5 channel CHANNEL_NETWORK, abs_time 0, sk_time 0 channel_status -1 backup state: version 0, senderSysid 0 cluster_time 0, hbt 0, node_status UNKNOWN info 0x0 <> flags 0x0 <> channel CHANNEL_MAILBOX, abs_time 0, sk_time 0 channel_status 0 Channel Read Ctx: version 2, senderSysid 0 cluster_time 0, hbt 0, node_status UNKNOWN info 0x0 <> flags 0x0 <> channel CHANNEL_IC, abs_time 0, sk_time 0 channel_status 3 Channel Read Ctx: version 2, senderSysid 0 cluster_time 0, hbt 0, node_status UNKNOWN info 0x0 <> flags 0x0 <> channel CHANNEL_NETWORK, abs_time 0, sk_time 0 channel_status -1 Channel Read Ctx: version 2, senderSysid 0 cluster_time 0, hbt 0, node_status UNKNOWN info 0x0 <> flags 0x0 <> takeoverState FT_NONE, takeoverString 'No takeover information' givebackState FT_NONE, givebackString 'No giveback information' givebackRetries 0, givebackRequested FALSE autoGivebackEnabled FALSE, autoGivebackWasDone FALSE, autoGivebackCifsStopping FALSE autoGivebackLastVetoCheck 0, autoGivebackAttemptsExceeded FALSE Maximum primary disk mailbox io times: normal = 1112, transition = 0 Maximum backup disk mailbox io times: normal = 545, transition = 0 Num times logs unsynced : 0 Total system uptime: 1288676417 msec
All, thank you for your replies! I am not sure how or why a trial license has been installed on this cluster, as I have recently inhereited the responsibilities of taking care of it. I do have a valid license, however, I am at the point where I can't boot the partner to add the valid cluster license.
Daniel, what is strange is we are running the same version of ontap on both heads, as it was a working cluster for a while before this last unscheduled outage. I am thinking it is showing a version mismatch because it doesn't even see it's partner.
I guess I will try, as suggested, cf disable and try to get the other filer booted. Thanks for all your replies! I will let you know how it turns out, as I am walking someone through these steps at a remote site, and I do not want to fly out there....lol.
So the node which is down should be up, To make it happen give "cf giveback" from the node which is up (Incase the filer is in takeovermode) and check the down node whether it is coming up or not. Incase if it doesn't come up the do "cf disable". Now try to boot the second node and enable cluster once again.