ONTAP Hardware

FAS270c partner will not boot

donbarton
6,330 Views

Hi,

I am have issues with a clustered FAS270 with two addtional shelves.  It is at a remote site and we've had to power it down several times due to power issues at the site. This last power down, a halt -f was issued on both heads as usual, however upon boot one head came back, the other halted with panic: Permanent errors on all HA mailbox disks (while writing master block) in process fmmbx_instance.  Then it halts and goes to the CFE prompt.

When an attempt was made to boot into maintenace mode we get:  In a cluster, you MUST ensure that the partner is (and remains) down, or that takeover is manually disabled on the ...                      FAILURE TO DO SO CAN RESULT IN YOUR FILESYSTEMS BEING DESTROYED

I also recieve errors:

[cf.fm.VersionMismatch:error] replayed event: Cluster Monitor: Cluster monitor version mismatch detected: 2/0

[cf.nm.nicViError:info] replayed event: Interconnect nic 0 had error on VI #4 SEND_DESC_ERROR 2

[cf.nm.nicReset:warning] Initiating soft reset on Cluster Intereconnect card 0 due to rendezvous jammed

There is the possibility that an evaluation cluster license was loaded, however, I am not sure how to boot the partner to add the valid license.

Thanks in advance!

On the partner that did boot. I recieve the following:

XXXXX>cf status
partner may be down, takeover disabled because of reason (CFO not licensed)
zap has disabled takeover by partner (version mismatch)
        Host Info: Local 3 Partner 0
        NVRAM TOC: Local 12 Partner 0
        WAFL FSInfo: Local 72 Partner 0
        WAFL log: Local 147 Partner 0
        RAID: Local 8 Partner 0
        RAID NVRAM: Local 13 Partner 0
VIA Interconnect is down (link 0 up, link 1 up).


XXXXX*> cf monitor
  current time: 13Sep2009 10:37:46
  UP 14+21:57:20, partner 'unknown', cluster monitor enabled
  VIA Interconnect is down (link 0 up, link 1 up), takeover capability off-line (CFO not licensed)
  takeover by partner off-line (version mismatch)
  partner may be down, last partner update UNKNOWN (10Sep2009 05:17:18)


XXXXXX*> cf monitor all
cf: Current monitor status (13Sep2009 10:38:21):
partner 'unknown', VIA Interconnect is down (link 0 up, link 1 up)
state UP, time 1288676133, event CHECK_FSM, elem SChkNoTkOver (13)
mirrorConsistencyRequired TRUE
takeoverByPartner 0x2081 <NVRAM_DOWN,VERSION,TAKEOVER_ON_PANIC>
mirrorEnabled TRUE, lowMemory FALSE, memio UNINIT, killPackets TRUE
degraded FALSE, reservePolicy ALWAYS_AFTER_TAKEOVER, resetDisks TRUE
timeouts:
    fast 1000, slow 2500, mailbox 10000, connect 5000
    operator 600000, firmware 15000 (recvd 1286605928), dumpcore 60000
    booting 300000 (recvd 0)
    transit timer enabled TRUE, transit 600000 (last 58904)
mailbox disks:
Disk 0b.59 is a local mailbox disk
Disk 0b.51 is a local mailbox disk
Disk 0b.41 is a partner mailbox disk
primary state:
        version 2, senderSysid XXXXXXXX
        cluster_time 1252559838, hbt 371252, node_status TAKEOVER_DISABLED
        info 0x2081 <NVRAM_DOWN,VERSION,TAKEOVER_ON_PANIC>
        flags 0x0 <>
        channel CHANNEL_MAILBOX, abs_time 1252838301, sk_time 1288676133
        channel_status 0
        channel CHANNEL_IC, abs_time 1252833344, sk_time 1283717709
        channel_status 5
        channel CHANNEL_NETWORK, abs_time 0, sk_time 0
        channel_status -1
backup state:
        version 0, senderSysid 0
        cluster_time 0, hbt 0, node_status UNKNOWN
        info 0x0 <>
        flags 0x0 <>
        channel CHANNEL_MAILBOX, abs_time 0, sk_time 0
        channel_status 0
        Channel Read Ctx:
        version 2, senderSysid 0
        cluster_time 0, hbt 0, node_status UNKNOWN
        info 0x0 <>
        flags 0x0 <>
        channel CHANNEL_IC, abs_time 0, sk_time 0
        channel_status 3
        Channel Read Ctx:
        version 2, senderSysid 0
        cluster_time 0, hbt 0, node_status UNKNOWN
        info 0x0 <>
        flags 0x0 <>
        channel CHANNEL_NETWORK, abs_time 0, sk_time 0
        channel_status -1
        Channel Read Ctx:
        version 2, senderSysid 0
        cluster_time 0, hbt 0, node_status UNKNOWN
        info 0x0 <>
        flags 0x0 <>
takeoverState FT_NONE, takeoverString 'No takeover information'
givebackState FT_NONE, givebackString 'No giveback information'
givebackRetries 0, givebackRequested FALSE
autoGivebackEnabled FALSE, autoGivebackWasDone FALSE, autoGivebackCifsStopping FALSE
autoGivebackLastVetoCheck 0, autoGivebackAttemptsExceeded FALSE
Maximum primary disk mailbox io times: normal = 1112, transition = 0
Maximum backup disk mailbox io times: normal = 545, transition = 0
Num times logs unsynced : 0
Total system uptime: 1288676417 msec

4 REPLIES 4

eric_barlier
6,330 Views

Hi Donald,

It does seem like you might have had a trial license on one of the controllers? is that even possible I wonder.. but this excerpt seems to suggest it is

possible:

XXXXX>cf status
partner may be down, takeover disabled because of reason (CFO not licensed)

In your case I would def. contact Ntap tech support. if you are out of warranty try to disable CF on the working controller:

cf disable

then try to reboot the not working controller. Once they are both up check licenses by running

license

and see if CLUSTER is enabled. Good luck.

Eric

danielpr
6,330 Views

Hi Donald,

The status shows that one of the node Cluster licsense is disabled or expried. And also seems to be Data ONTAP version mismatch on both the nodes.

You need to run the same Data ONTAP release version on the CFO pairs to enable Cluster.

>>  VIA Interconnect is down (link 0 up, link 1 up), takeover capability off-line (CFO not licensed)
  >>takeover by partner off-line (version mismatch)

Thanks;

Daniel

donbarton
6,330 Views

All, thank you for your replies!  I am not sure how or why a trial license has been installed on this cluster, as I have recently inhereited the responsibilities of taking care of it.   I do have a valid license, however, I am at the point where I can't boot the partner to add the valid cluster license.

Daniel, what is strange is we are running the same version of ontap on both heads, as it was a working cluster for a while before this last unscheduled outage. I am thinking it is showing a version mismatch because it doesn't even see it's partner.

I guess I will try, as suggested, cf disable and try to get the other filer booted. Thanks for all your replies!  I will let you know how it turns out, as I am walking someone through these steps at a remote site, and I do not want to fly out there....lol.

danielpr
6,330 Views

Donald,

So the node which is down should be up, To make it happen give "cf giveback" from the node which is up (Incase the filer is in takeovermode) and check the down node whether it is coming up or not. Incase if it doesn't come up the do "cf disable". Now try to boot the second node and enable cluster once again.

Thanks

Daniel

Public