Solved: MCTB did not issue CFOD. it report the left node in a failover state. anything I missed?

chao · ‎2014-11-05

I setup MCTB to monitor a SMC. when I power off one controll and same side diskshelf. MCTB could detect one node is down. but MCTB also report another node is already in a takeover state and skiped the remain checking. MCTB also did not issue CFOD command. is anything I missed?

MCTB version is 2.4, DFM version is 5.2 and ONTAP version is 8.1.3

mctb.log:

2014-11-06 13:50:42,688 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:50:42,688 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:50:42,688 [filerStatusUpdate-1] DEBUG com.netapp.rre.anegada.Filer - fas3220a: cf-status CONNECTED, Enabled: true, IC: true
2014-11-06 13:50:42,688 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status CONNECTED, Enabled: true, IC: true
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking interconnects
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Getting aggr status for fas3220a
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220a: loading aggregate mirror status
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220a: aggr0: mirrored
2014-11-06 13:50:42,703 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Getting aggr status for fas3220b
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220b: loading aggregate mirror status
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Filer - fas3220b: aggr0: mirrored
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking CFOD for filer fas3220a
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor CF state: CONNECTED
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor IC: true
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Root Aggr Mirror Degraded: false
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking CFOD for filer fas3220b
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor CF state: CONNECTED
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Survivor IC: true
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Root Aggr Mirror Degraded: false
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking mirror degradation on filer fas3220a
2014-11-06 13:50:42,719 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking mirror degradation on filer fas3220b
2014-11-06 13:50:52,734 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:50:52,734 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:50:52,734 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status TAKEOVER_STARTED, Enabled: true, IC: false
2014-11-06 13:51:13,764 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:51:13,764 [filerStatusUpdate-1] WARN com.netapp.rre.bautils.DfmCommand - DFM_EVENT MetroCluster-TieBreaker:Communications-Event:Api-Error, source: fas3220a: Connection timed out: connect
2014-11-06 13:51:13,764 [filerStatusUpdate-1] DEBUG com.netapp.rre.bautils.DfmCommand - Executing: dfm event generate MetroCluster-TieBreaker:Communications-Event:Api-Error fas3220a Connection timed out: connect
2014-11-06 13:51:14,013 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:51:24,029 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:51:24,029 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:51:24,029 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:51:45,058 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:51:45,058 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:51:55,074 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:51:55,074 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:51:55,074 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:52:16,103 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:52:16,103 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:52:26,119 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:52:26,119 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:52:26,119 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:52:47,148 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:52:47,148 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:52:57,164 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:52:57,164 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:52:57,164 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:53:18,162 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:53:18,162 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:53:28,178 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:53:28,178 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:53:28,178 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:53:49,207 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:53:49,207 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:53:59,223 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:53:59,223 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:53:59,238 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:54:20,252 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:54:20,252 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:54:30,268 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:54:30,268 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:54:30,268 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:54:51,344 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:54:51,344 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:55:01,359 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:55:01,359 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:55:01,375 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:55:22,405 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:55:22,405 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.
2014-11-06 13:55:32,420 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Checking site reachability
2014-11-06 13:55:32,420 [MetroCluster1-monitor] DEBUG com.netapp.rre.anegada.Monitor - Updating filer status
2014-11-06 13:55:32,420 [filerStatusUpdate-2] DEBUG com.netapp.rre.anegada.Filer - fas3220b: cf-status ERROR, Enabled: true, IC: false
2014-11-06 13:55:53,434 [filerStatusUpdate-1] ERROR com.netapp.rre.anegada.Filer - fas3220a: ApiException on cf-status: java.net.ConnectException: Connection timed out: connect
2014-11-06 13:55:53,434 [MetroCluster1-monitor] INFO com.netapp.rre.anegada.Monitor - fas3220b in a take-over state, skipping remaining checks.

config.xml

<configuration>
<monitor name="MetroCluster1">
<enabled>true</enabled>
<pollingInterval>10</pollingInterval>
<cfodAbortTimeout>90</cfodAbortTimeout>
<cfodSuccessWaitTimeout>120</cfodSuccessWaitTimeout>
<testMode>false</testMode>
<site0 name="SiteA">
<filer name="fas3220a">
<hostname>10.128.13.32</hostname>
<username>root</username>
<encryptedPassword>pFYLhSg72PUC_ZUf-4NBcA</encryptedPassword>
<ssl>false</ssl>
<connectTimeout>5</connectTimeout>
<connectRetries>1</connectRetries>
</filer>
</site0>
<site1 name="SiteB">
<filer name="fas3220b">
<hostname>10.128.13.35</hostname>
<username>root</username>
<encryptedPassword>pFYLhSg72PUC_ZUf-4NBcA</encryptedPassword>
<ssl>false</ssl>
<connectTimeout>5</connectTimeout>
<connectRetries>1</connectRetries>
</filer>
</site1>
</monitor>
</configuration>

in 3220b:

fas3220b> cf status
fas3220a may be down, takeover disabled because of reason (partner mailbox disks not accessible or invalid)
fas3220b has disabled takeover by fas3220a (interconnect error)
VIA Interconnect is down (link 0 down, link 1 down).
The DR partner site might be dead.
To take it over, power it down or isolate it as described in the Data Protection Guide, and then use cf forcetakeover -d.

abrian · ‎2014-11-07

Hi TC,

My comments inline below:

1. I think SMC also need a automated failover tool (MCTB?). there are many customer to put SMC to difference floor or difference cabninet. if one floor or one cab lost power, normal CFO can do nothing so customer need a tool to help quick failover.

It doesn't matter how far apart the SMC controllers are, as long as they are within acceptable SMC limits ONTAP will still fail over correctly. In SMC, regardless of the distance, MCTB doesn't provide any value.

2. I did a few more testing and guess maybe the "issue" is at MCTB side. it seems MCTB once time detect a node start takeover, even the failover is failed by mailbox missing, MCTB still consider it's in a take-over state. but if I stop/start MCTB service, MCTB "fresh" sth and could find SMC need CFOD, then MCTB issue MCTB.

As soon as MCTB detects a takeover state on either controller, or a transition between takeover states, it will refuse to take any action. This is by design, as MCTB cannot be allowed to interfere with normal takeover operations.

3. I think even FMC, there is still sth happened to make controller down a few sec prior to shelf/switch down. the partner will also try to takeover at first and then be failed by mailbox missing. I doubt MCTB could handle that well. I think MCTB could do a little "improve" to re-detect cluster status even through it was informed one node tried to takeover. MCTB should learn if the takeover is successful.

MCTB is not designed to trigger CFOD on a failed takeover, only in the explicit case where normal takeover cannot happen due to IC failure (which cannot happen in an SMC).

I am not sure if you have a chance to talk with RRE team. MCTB is a wonderful tool and we hope it could extend to SMC and cDOT.

I am on the RRE team, and I am the original author of this version of MCTB. The Admin Guide for MCTB describes the use case and procedures followed by MCTB to determine whether to issue CFOD. cDOT will have it's own MCCTB solution developed by engineering (not RRE). Again, the specific scenario MCTB is designed to handle cannot happen in an SMC, and so MCTB does not provide any value.

Regards,

Brian

View solution in original post

chao · ‎2014-11-05

I almost get out the root cause. when I simulator one site outage. I turn off the one controller and same side shelf power. but by "incorrect" order, I may power-off controller 1 sec prior to power-off shelf. the partner may try to takeover by

Thu Nov 6 13:51:12 CST [fas3220b:cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(fas3220a), system_down because power_loss.

Thu Nov 6 13:51:13 CST [fas3220b:cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of fas3220b by fas3220a disabled (interconnect error).
Thu Nov 6 13:51:13 CST [fas3220b:cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER
Thu Nov 6 13:51:13 CST [fas3220b:cf.fm.takeoverStarted:notice]: Failover monitor: takeover started

but even the takover was unsucessful, but failover monitor still show takeover started.

it seems MCTB trust this by mistake and will not trigger the CFOD.

Is it a bug? of when failover monitor will dismiss the state of "takeover started"

abrian · ‎2014-11-06

Hi Chao,

ONTAP will do a normal failover whenever possible when one site of a MetroCluster goes down. The specific case where ONTAP will not perform a normal takeover is when it cannot communicate with the partner and can't guarantee that it's indeed down. This is the scenario MCTB is designed to automate. As long as the IC link between the partners is up when one or the other fails, ONTAP will conduct a normal takeover and MCTB will detect that and take no action, which is what you are seeing.

Now, by SMC I assume you mean Stretch MetroCluster, correct? MCTB is never needed in a SMC configuration because the IC can't go down before the head does (because there are no fabric switches that would bring the IC down prematurely). For SMC configurations, MCTB is completely unnecessary.

Brian

chao · ‎2014-11-06

Hi Brain

Thank you very much for your kindly reply.

1. I think SMC also need a automated failover tool (MCTB?). there are many customer to put SMC to difference floor or difference cabninet. if one floor or one cab lost power, normal CFO can do nothing so customer need a tool to help quick failover.

2. I did a few more testing and guess maybe the "issue" is at MCTB side. it seems MCTB once time detect a node start takeover, even the failover is failed by mailbox missing, MCTB still consider it's in a take-over state. but if I stop/start MCTB service, MCTB "fresh" sth and could find SMC need CFOD, then MCTB issue MCTB.

3. I think even FMC, there is still sth happened to make controller down a few sec prior to shelf/switch down. the partner will also try to takeover at first and then be failed by mailbox missing. I doubt MCTB could handle that well. I think MCTB could do a little "improve" to re-detect cluster status even through it was informed one node tried to takeover. MCTB should learn if the takeover is successful.

I am not sure if you have a chance to talk with RRE team. MCTB is a wonderful tool and we hope it could extend to SMC and cDOT.

Thanks and Best Regards!

TC

abrian · ‎2014-11-07