ONTAP Discussions

MCTB did not issue CFOD. it report the left node in a failover state. anything I missed?

chao
6,921 Views

I setup MCTB to monitor a SMC.  when I power off one controll and same side diskshelf. MCTB could detect one node is down. but MCTB also report another node is already in a takeover state and skiped the remain checking.  MCTB also did not issue CFOD command.  is anything I missed?

 

MCTB version is 2.4, DFM version is 5.2 and ONTAP version is 8.1.3

 

mctb.log: 

2014-11-06 13:50:42,688 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:50:42,688 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:50:42,688 [filerStatusUpdate-1] DEBUG com.netapp.rre.anegada.Filer - fas3220a: cf-status CONNECTED, Enabled: true, IC: true
2014-11-06 13:50:42,688 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status CONNECTED, Enabled: true, IC: true
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking interconnects
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Getting aggr status for fas3220a
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220a: loading aggregate mirror status
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220a: aggr0: mirrored
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Getting aggr status for fas3220b
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220b: loading aggregate mirror status
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220b: aggr0: mirrored
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking CFOD for filer fas3220a
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor CF state: CONNECTED
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor IC: true
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Root Aggr Mirror Degraded: false
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking CFOD for filer fas3220b
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor CF state: CONNECTED
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor IC: true
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Root Aggr Mirror Degraded: false
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking mirror degradation on filer fas3220a
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking mirror degradation on filer fas3220b
2014-11-06 13:50:52,734 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:50:52,734 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:50:52,734 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status TAKEOVER_STARTED, Enabled: true, IC: false
2014-11-06 13:51:13,764 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:51:13,764 [filerStatusUpdate-1] WARN com.netapp.rre.bautils.DfmCommand - DFM_EVENT MetroCluster-TieBreaker:Communications-Event:Api-Error, source: fas3220a: Connection timed out: connect
2014-11-06 13:51:13,764 [filerStatusUpdate-1] DEBUG com.netapp.rre.bautils.DfmCommand - Executing: dfm event generate MetroCluster-TieBreaker:Communications-Event:Api-Error fas3220a Connection timed out: connect
2014-11-06 13:51:14,013 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:51:24,029 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:51:24,029 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:51:24,029 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:51:45,058 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:51:45,058 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:51:55,074 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:51:55,074 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:51:55,074 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:52:16,103 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:52:16,103 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:52:26,119 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:52:26,119 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:52:26,119 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:52:47,148 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:52:47,148 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:52:57,164 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:52:57,164 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:52:57,164 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:53:18,162 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:53:18,162 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:53:28,178 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:53:28,178 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:53:28,178 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:53:49,207 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:53:49,207 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:53:59,223 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:53:59,223 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:53:59,238 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:54:20,252 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:54:20,252 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:54:30,268 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:54:30,268 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:54:30,268 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:54:51,344 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:54:51,344 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:55:01,359 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:55:01,359 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:55:01,375 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:55:22,405 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:55:22,405 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:55:32,420 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:55:32,420 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:55:32,420 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:55:53,434 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:55:53,434 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.

 

 

config.xml

<configuration>
<monitor name="MetroCluster1">
<enabled>true</enabled>
<pollingInterval>10</pollingInterval>
<cfodAbortTimeout>90</cfodAbortTimeout>
<cfodSuccessWaitTimeout>120</cfodSuccessWaitTimeout>
<testMode>false</testMode>
<site0 name="SiteA">
<filer name="fas3220a">
<hostname>10.128.13.32</hostname>
<username>root</username>
<encryptedPassword>pFYLhSg72PUC_ZUf-4NBcA</encryptedPassword>
<ssl>false</ssl>
<connectTimeout>5</connectTimeout>
<connectRetries>1</connectRetries>
</filer>
</site0>
<site1 name="SiteB">
<filer name="fas3220b">
<hostname>10.128.13.35</hostname>
<username>root</username>
<encryptedPassword>pFYLhSg72PUC_ZUf-4NBcA</encryptedPassword>
<ssl>false</ssl>
<connectTimeout>5</connectTimeout>
<connectRetries>1</connectRetries>
</filer>
</site1>
</monitor>
</configuration>

 

 

in 3220b:

fas3220b> cf status
fas3220a may be down, takeover disabled because of reason (partner mailbox disks not accessible or invalid)
fas3220b has disabled takeover by fas3220a (interconnect error)
VIA Interconnect is down (link 0 down, link 1 down).
The DR partner site might be dead.
To take it over, power it down or isolate it as described in the Data Protection Guide, and then use cf forcetakeover -d.

 

1 ACCEPTED SOLUTION

abrian
6,854 Views

Hi TC,

 

My comments inline below:

 

1.  I think SMC also need a automated failover tool (MCTB?).   there are many customer to put SMC to difference floor or difference cabninet.  if one floor or one cab lost power,  normal CFO can do nothing so customer need a tool to help quick failover.  

 

It doesn't matter how far apart the SMC controllers are, as long as they are within acceptable SMC limits ONTAP will still fail over correctly. In SMC, regardless of the distance, MCTB doesn't provide any value.

 

2. I did a few more testing and guess maybe the "issue" is at MCTB side.  it seems MCTB once time detect a node start takeover,  even the failover is failed by mailbox missing, MCTB still consider it's in a take-over state.  but if I stop/start MCTB service, MCTB "fresh" sth and could find SMC need CFOD, then MCTB issue MCTB.

 

As soon as MCTB detects a takeover state on either controller, or a transition between takeover states, it will refuse to take any action. This is by design, as MCTB cannot be allowed to interfere with normal takeover operations.

 

3. I think even FMC, there is still sth happened to make controller down a few sec prior to shelf/switch down.  the partner will also try to takeover at first and then be failed by mailbox missing. I doubt MCTB could handle that well.   I think MCTB could do a little "improve" to re-detect cluster status even through it was informed one node tried to takeover.  MCTB should learn if the takeover is successful.

 

MCTB is not designed to trigger CFOD on a failed takeover, only in the explicit case where normal takeover cannot happen due to IC failure (which cannot happen in an SMC).

 

I am not sure if you have a chance to talk with RRE team.  MCTB is a wonderful tool and we hope it could extend to SMC and cDOT.

 

I am on the RRE team, and I am the original author of this version of MCTB.   The Admin Guide for MCTB describes the use case and procedures followed by MCTB to determine whether to issue CFOD.   cDOT will have it's own MCCTB solution developed by engineering (not RRE).   Again, the specific scenario MCTB is designed to handle cannot happen in an SMC, and so MCTB does not provide any value.

 

Regards,

Brian

 

View solution in original post

11 REPLIES 11
Public