ONTAP Discussions

MCTB did not issue CFOD. it report the left node in a failover state. anything I missed?

chao
6,911 Views

I setup MCTB to monitor a SMC.  when I power off one controll and same side diskshelf. MCTB could detect one node is down. but MCTB also report another node is already in a takeover state and skiped the remain checking.  MCTB also did not issue CFOD command.  is anything I missed?

 

MCTB version is 2.4, DFM version is 5.2 and ONTAP version is 8.1.3

 

mctb.log: 

2014-11-06 13:50:42,688 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:50:42,688 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:50:42,688 [filerStatusUpdate-1] DEBUG com.netapp.rre.anegada.Filer - fas3220a: cf-status CONNECTED, Enabled: true, IC: true
2014-11-06 13:50:42,688 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status CONNECTED, Enabled: true, IC: true
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking interconnects
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Getting aggr status for fas3220a
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220a: loading aggregate mirror status
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220a: aggr0: mirrored
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Getting aggr status for fas3220b
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220b: loading aggregate mirror status
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220b: aggr0: mirrored
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking CFOD for filer fas3220a
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor CF state: CONNECTED
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor IC: true
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Root Aggr Mirror Degraded: false
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking CFOD for filer fas3220b
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor CF state: CONNECTED
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor IC: true
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Root Aggr Mirror Degraded: false
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking mirror degradation on filer fas3220a
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking mirror degradation on filer fas3220b
2014-11-06 13:50:52,734 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:50:52,734 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:50:52,734 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status TAKEOVER_STARTED, Enabled: true, IC: false
2014-11-06 13:51:13,764 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:51:13,764 [filerStatusUpdate-1] WARN com.netapp.rre.bautils.DfmCommand - DFM_EVENT MetroCluster-TieBreaker:Communications-Event:Api-Error, source: fas3220a: Connection timed out: connect
2014-11-06 13:51:13,764 [filerStatusUpdate-1] DEBUG com.netapp.rre.bautils.DfmCommand - Executing: dfm event generate MetroCluster-TieBreaker:Communications-Event:Api-Error fas3220a Connection timed out: connect
2014-11-06 13:51:14,013 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:51:24,029 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:51:24,029 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:51:24,029 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:51:45,058 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:51:45,058 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:51:55,074 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:51:55,074 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:51:55,074 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:52:16,103 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:52:16,103 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:52:26,119 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:52:26,119 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:52:26,119 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:52:47,148 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:52:47,148 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:52:57,164 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:52:57,164 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:52:57,164 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:53:18,162 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:53:18,162 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:53:28,178 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:53:28,178 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:53:28,178 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:53:49,207 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:53:49,207 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:53:59,223 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:53:59,223 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:53:59,238 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:54:20,252 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:54:20,252 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:54:30,268 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:54:30,268 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:54:30,268 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:54:51,344 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:54:51,344 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:55:01,359 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:55:01,359 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:55:01,375 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:55:22,405 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:55:22,405 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:55:32,420 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:55:32,420 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:55:32,420 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:55:53,434 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:55:53,434 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.

 

 

config.xml

<configuration>
<monitor name="MetroCluster1">
<enabled>true</enabled>
<pollingInterval>10</pollingInterval>
<cfodAbortTimeout>90</cfodAbortTimeout>
<cfodSuccessWaitTimeout>120</cfodSuccessWaitTimeout>
<testMode>false</testMode>
<site0 name="SiteA">
<filer name="fas3220a">
<hostname>10.128.13.32</hostname>
<username>root</username>
<encryptedPassword>pFYLhSg72PUC_ZUf-4NBcA</encryptedPassword>
<ssl>false</ssl>
<connectTimeout>5</connectTimeout>
<connectRetries>1</connectRetries>
</filer>
</site0>
<site1 name="SiteB">
<filer name="fas3220b">
<hostname>10.128.13.35</hostname>
<username>root</username>
<encryptedPassword>pFYLhSg72PUC_ZUf-4NBcA</encryptedPassword>
<ssl>false</ssl>
<connectTimeout>5</connectTimeout>
<connectRetries>1</connectRetries>
</filer>
</site1>
</monitor>
</configuration>

 

 

in 3220b:

fas3220b> cf status
fas3220a may be down, takeover disabled because of reason (partner mailbox disks not accessible or invalid)
fas3220b has disabled takeover by fas3220a (interconnect error)
VIA Interconnect is down (link 0 down, link 1 down).
The DR partner site might be dead.
To take it over, power it down or isolate it as described in the Data Protection Guide, and then use cf forcetakeover -d.

 

1 ACCEPTED SOLUTION

abrian
6,844 Views

Hi TC,

 

My comments inline below:

 

1.  I think SMC also need a automated failover tool (MCTB?).   there are many customer to put SMC to difference floor or difference cabninet.  if one floor or one cab lost power,  normal CFO can do nothing so customer need a tool to help quick failover.  

 

It doesn't matter how far apart the SMC controllers are, as long as they are within acceptable SMC limits ONTAP will still fail over correctly. In SMC, regardless of the distance, MCTB doesn't provide any value.

 

2. I did a few more testing and guess maybe the "issue" is at MCTB side.  it seems MCTB once time detect a node start takeover,  even the failover is failed by mailbox missing, MCTB still consider it's in a take-over state.  but if I stop/start MCTB service, MCTB "fresh" sth and could find SMC need CFOD, then MCTB issue MCTB.

 

As soon as MCTB detects a takeover state on either controller, or a transition between takeover states, it will refuse to take any action. This is by design, as MCTB cannot be allowed to interfere with normal takeover operations.

 

3. I think even FMC, there is still sth happened to make controller down a few sec prior to shelf/switch down.  the partner will also try to takeover at first and then be failed by mailbox missing. I doubt MCTB could handle that well.   I think MCTB could do a little "improve" to re-detect cluster status even through it was informed one node tried to takeover.  MCTB should learn if the takeover is successful.

 

MCTB is not designed to trigger CFOD on a failed takeover, only in the explicit case where normal takeover cannot happen due to IC failure (which cannot happen in an SMC).

 

I am not sure if you have a chance to talk with RRE team.  MCTB is a wonderful tool and we hope it could extend to SMC and cDOT.

 

I am on the RRE team, and I am the original author of this version of MCTB.   The Admin Guide for MCTB describes the use case and procedures followed by MCTB to determine whether to issue CFOD.   cDOT will have it's own MCCTB solution developed by engineering (not RRE).   Again, the specific scenario MCTB is designed to handle cannot happen in an SMC, and so MCTB does not provide any value.

 

Regards,

Brian

 

View solution in original post

11 REPLIES 11

chao
6,849 Views

I almost get out the root cause. when I simulator one site outage.  I turn off the one controller and same side shelf power. but by "incorrect" order,  I may power-off controller 1 sec prior to power-off shelf.  the partner may try to takeover by 

 

Thu Nov  6 13:51:12 CST [fas3220b:cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(fas3220a), system_down because power_loss. 

Thu Nov 6 13:51:13 CST [fas3220b:cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of fas3220b by fas3220a disabled (interconnect error).
Thu Nov 6 13:51:13 CST [fas3220b:cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER
Thu Nov 6 13:51:13 CST [fas3220b:cf.fm.takeoverStarted:notice]: Failover monitor: takeover started

 

but even the takover was unsucessful,  but failover monitor still show takeover started.

 

it seems MCTB trust this by mistake and will not trigger the CFOD.

 

Is it a bug?  of when failover monitor will dismiss the state of "takeover started"

abrian
6,813 Views

Hi Chao,

 

ONTAP will do a normal failover whenever possible when one site of a MetroCluster goes down.  The specific case where ONTAP will not perform a normal takeover is when it cannot communicate with the partner and can't guarantee that it's indeed down.  This is the scenario MCTB is designed to automate.   As long as the IC link between the partners is up when one or the other fails, ONTAP will conduct a normal takeover and MCTB will detect that and take no action, which is what you are seeing.

 

Now, by SMC I assume you mean Stretch MetroCluster, correct?   MCTB is never needed in a SMC configuration because the IC can't go down before the head does (because there are no fabric switches that would bring the IC down prematurely).   For SMC configurations, MCTB is completely unnecessary.

 

Brian

chao
6,798 Views

Hi Brain

 

Thank you very much for your kindly reply.

1.  I think SMC also need a automated failover tool (MCTB?).   there are many customer to put SMC to difference floor or difference cabninet.  if one floor or one cab lost power,  normal CFO can do nothing so customer need a tool to help quick failover.  

 

2. I did a few more testing and guess maybe the "issue" is at MCTB side.  it seems MCTB once time detect a node start takeover,  even the failover is failed by mailbox missing, MCTB still consider it's in a take-over state.  but if I stop/start MCTB service, MCTB "fresh" sth and could find SMC need CFOD, then MCTB issue MCTB.

 

3. I think even FMC, there is still sth happened to make controller down a few sec prior to shelf/switch down.  the partner will also try to takeover at first and then be failed by mailbox missing. I doubt MCTB could handle that well.   I think MCTB could do a little "improve" to re-detect cluster status even through it was informed one node tried to takeover.  MCTB should learn if the takeover is successful.

 

I am not sure if you have a chance to talk with RRE team.  MCTB is a wonderful tool and we hope it could extend to SMC and cDOT.

 

Thanks and Best Regards!

TC

abrian
6,845 Views

Hi TC,

 

My comments inline below:

 

1.  I think SMC also need a automated failover tool (MCTB?).   there are many customer to put SMC to difference floor or difference cabninet.  if one floor or one cab lost power,  normal CFO can do nothing so customer need a tool to help quick failover.  

 

It doesn't matter how far apart the SMC controllers are, as long as they are within acceptable SMC limits ONTAP will still fail over correctly. In SMC, regardless of the distance, MCTB doesn't provide any value.

 

2. I did a few more testing and guess maybe the "issue" is at MCTB side.  it seems MCTB once time detect a node start takeover,  even the failover is failed by mailbox missing, MCTB still consider it's in a take-over state.  but if I stop/start MCTB service, MCTB "fresh" sth and could find SMC need CFOD, then MCTB issue MCTB.

 

As soon as MCTB detects a takeover state on either controller, or a transition between takeover states, it will refuse to take any action. This is by design, as MCTB cannot be allowed to interfere with normal takeover operations.

 

3. I think even FMC, there is still sth happened to make controller down a few sec prior to shelf/switch down.  the partner will also try to takeover at first and then be failed by mailbox missing. I doubt MCTB could handle that well.   I think MCTB could do a little "improve" to re-detect cluster status even through it was informed one node tried to takeover.  MCTB should learn if the takeover is successful.

 

MCTB is not designed to trigger CFOD on a failed takeover, only in the explicit case where normal takeover cannot happen due to IC failure (which cannot happen in an SMC).

 

I am not sure if you have a chance to talk with RRE team.  MCTB is a wonderful tool and we hope it could extend to SMC and cDOT.

 

I am on the RRE team, and I am the original author of this version of MCTB.   The Admin Guide for MCTB describes the use case and procedures followed by MCTB to determine whether to issue CFOD.   cDOT will have it's own MCCTB solution developed by engineering (not RRE).   Again, the specific scenario MCTB is designed to handle cannot happen in an SMC, and so MCTB does not provide any value.

 

Regards,

Brian

 

aborzenkov
6,781 Views

It doesn't matter how far apart the SMC controllers are, as long as they are within acceptable SMC limits ONTAP will still fail over correctly. In SMC, regardless of the distance, MCTB doesn't provide any value. 

My be I miss something obvious here, but - SMC in two locations. One location completely loses power. Both shelves and controller at the same time. Will partner takeover in this case?


As soon as MCTB detects a takeover state on either controller, or a transition between takeover states, it will refuse to take any action. This is by design, as MCTB cannot be allowed to interfere with normal takeover operations.


That's true, but as far as I understand situation here, partner never completed takeover. Probably because half of mailboxes suddenly were lost. That's probably complicated situation that requires coordination between DOT and MCTB; I would expect DOT to fail takeover and MCTB then kicking in. So it could be primarily DOT issue here.

 

abrian
6,712 Views

My be I miss something obvious here, but - SMC in two locations. One location completely loses power. Both shelves and controller at the same time. Will partner takeover in this case?

 

Yes, as long as the IC link is up, ONTAP will perform the takeover (or at least attempt to).

 

That's true, but as far as I understand situation here, partner never completed takeover. Probably because half of mailboxes suddenly were lost. That's probably complicated situation that requires coordination between DOT and MCTB; I would expect DOT to fail takeover and MCTB then kicking in. So it could be primarily DOT issue here.

 

MCTB is not designed to cover this scenario.  If ONTAP attempts to perform the takeover and fails, MCTB will take no action.  I'm not sure what the ramifications are to performing a CFOD after a failed takeover.  At some point, the potential for lost or corrupted data outweighs the need for service availability.  We opted to be conservative and only handle the specific case where the IC has failed, which is the most likely problem faced in a site wide power outtage in a Fabric MetroCluster.

aborzenkov
5,431 Views

Yes, as long as the IC link is up, ONTAP will perform the takeover (or at least attempt to).

Please read what I wrote: "One location completely loses power". Somebody puledl the mains plug in computer room. How can IC be up when partner is no more present?

abrian
5,423 Views

You are talking about a Stretch MetroCluster (SMC), correct?   If so, then the IC is a direct connect between the controllers, right?  If so, then the surviving controller will detect the failure of its partner and perform a normal takeover.

 

If the IC goes through FC switches, then you are talking about a Fabric MetroCluster (FMC).  In that case, you are correct.  If the switches, and thus the IC, go down at the same time as the controller, then it's likely that the partner will detect IC failure before it can detect the partner is down, and will then refuse to do a normal takeover.  This is the primary use case for MCTB.   However, a far safer solution is to maintain the back end FMC fabric (and thus the IC) during a power outage, perhaps by using separate UPSs on the switches.  As long as the IC survives long enough for the partner to detect the partner down, then ONTAP can do the much faster and safer normal takeover.

chao
6,764 Views

Hi Brian

 

Thank you very much for your kindly reply. thank you to help to develop MCTB.  for long time, customer and field team are seeking an automated metrocluster failover tool until MCTB. it's powerful value for us to sell/enable 7-mode metrocluster. 🙂 .  and I am glad to know the "issue" I met is by your design not a burt. 

 

in manual (i have the admin guide for 2.1 in my hand, not sure if there is any new version).

 

A CFOD is initiated under the following conditions: Page 22

 The controller is not reachable

 The controller’s partner has a cf status of WAITING or ERROR

 The controller’s partner shows that the interconnect link is down

 The aggregate containing the root volume is mirror degraded

 No un-ignored aggregate has exceeded the cfodAbortTimeout threshold.

If all other conditions exist, but an aggregate has exceeded the cfodAbortTimeout, a CFOD-Aborted event is thrown (see event definitions below).

 

I almost understand what's the MCTB working for but it's better with some further explanations about SMC scenario is not MCTB focus. SMC may not hit all MCTB defined CFOD condition in the very beginning but after first take-over try failed, it should hit them all.

 

fas3220b> cf status

fas3220a may be down, takeover disabled because of reason (partner mailbox disks not accessible or invalid)

fas3220b has disabled takeover by fas3220a (interconnect error)

VIA Interconnect is down (link 0 down, link 1 down).

The DR partner site might be dead.

To take it over, power it down or isolate it as described in the Data Protection Guide, and then use cf forcetakeover -d.

 

btw, I also want to update what i tested when MCTB could help for SMC.

1. in SMC scenario, if power-off shelves a few seconds prior-to power-off another components, CFO will be disabled by mailbox missing, then MCTB could trigger CFOD.

2. in SMC scenario, if MCTB detected one node is in take-over state, even this node taking over failure by any reason, MCTB will not detect it's state again but always skip the remain monitoring (yes, by design). but if MCTB service is restarted by any reason, MCTB will detect the more fresh state and trigger the CFOD (that state hit all MCTB defined CFOD conditions).

 

Thanks and Best Regards!

TC

abrian
6,708 Views

Hi TC,

 

1. in SMC scenario, if power-off shelves a few seconds prior-to power-off another components, CFO will be disabled by mailbox missing, then MCTB could trigger CFOD.

 

I'm not familiar enough with SMC or it's failure modes to understand why OMTAP is not triggering a normal failover at this point.

 

2. in SMC scenario, if MCTB detected one node is in take-over state, even this node taking over failure by any reason, MCTB will not detect it's state again but always skip the remain monitoring (yes, by design). but if MCTB service is restarted by any reason, MCTB will detect the more fresh state and trigger the CFOD (that state hit all MCTB defined CFOD conditions).

 

Restarting MCTB should have no effect here.  MCTB never caches controller state, so each time it cycles it acquires the CF status and other information fresh from the controllers (using the NMSDK APIs).  Since the ONTAP initiated takeover failed, the controller's are probably in TAKEOVER_FAILED state, which is one of the states that will cause MCTB to refuse to initiate CFOD.

abrian
6,699 Views

@abrian wrote:

Restarting MCTB should have no effect here.  MCTB never caches controller state, so each time it cycles it acquires the CF status and other information fresh from the controllers (using the NMSDK APIs).  Since the ONTAP initiated takeover failed, the controller's are probably in TAKEOVER_FAILED state, which is one of the states that will cause MCTB to refuse to initiate CFOD.


Sorry, I misspoke here.  When MCTB detects any takeover state, it will remember that until the state returns to CONNECTED.  So, even if the CF status becomes ERROR after the controller as been in a takeover state, it will not perform CFOD.  Of course, if you restart MCTB, the fact that the ERROR state followed a takeover state is lost, and MCTB will perform a CFOD, which is what you are seeing.

 

The intent is to require a human operator to take action except in the single case where ONTAP cannot initiate a takeover because the IC link between the controllers is down.  Any other failure mode, such as a takeover failure, is outside the criteria.   In effect, restarting MCTB is an operator taking action (although not as fast as the operator just initiating a CFOD manually, which is the intent).

 

Brian

Public