Re: Is fabric metro cluster require all 4 mailbox disks to load ONTAP?

AEKAKIRA2 · ‎2013-05-24

Is fabric metro cluster require all 4 mailbox disks to load ONTAP?

Will only local aggr mailbox disks (2) sufficient for a single node?

Is ISL link mandatory for a fabric metro cluster?

In case of force takeover mode, shutdown, startup of surviving node possible without any issues?

thomas_glodde · ‎2013-05-24

Is fabric metro cluster require all 4 mailbox disks to load ONTAP?

no

Will only local aggr mailbox disks (2) sufficient for a single node?

yes

Is ISL link mandatory for a fabric metro cluster?

no

In case of force takeover mode, shutdown, startup of surviving node possible without any issues?

yes

btw, you can always kill/reset mailbox with mailbox destroy local/partner in maintenance mode.

AEKAKIRA2 · ‎2013-05-24

Hello Thomas

Thanks for response, could you direct me to a link or document in support of above if any?

You mean ISL link is not mandatory (in absence of active connection) while booting a node.

In case of CFOD, surviving node was also made to shutdown, and while booting it definitely does not found 2 remote node disks ( in absence of ISL link, both fabrics). By releasing local/remote mailbox disks node comes up without issue? by executing destroy of mailbox disks takeover status might be lost and node comes up as takeover or stand alone (cf disable mode)?

Regards

Kiran

thomas_glodde · ‎2013-05-24

i doubt that this rare case occurence is documented properly somewhere 😕 my knowledge comes from real life experience of getting up almost dead metroclusters 😉

isl definately isnt needed, cf will just be disabled since partner not found.about the missing heartbeat disks, i doubt it Refrains a node from booting, could give it a go in the lab next monday.

AEKAKIRA2 · ‎2013-05-28

Hello Thomas

Could you able to find some document on the topic?, and mean time could give a try in simulating booting a node after takeover. (surviving node shutdown after takeover) Will node come up properly in absence of ISL link down?

Thanks in advance

Kiran

aborzenkov · ‎2013-05-24

Will only local aggr mailbox disks (2) sufficient for a single node?
yes

This probably needs some clarification. In MetroCluster there are normally 8 mailboxes (2 on each plex of mirrored aggregate on each controller). So do you mean 2 local and 4 remote mailboxes are sufficient? Or that half of all mailboxes is sufficient?

ismopuuronen · ‎2013-05-24

Hello,

hope this help:

mailbox disks are used to determine partner status. If the mailbox status is uncertain, cf will be disabled.

Mailbox doesent affect how ontap is loaded during boot.

Lets put it this way.
In a HA configuration nodes has to know partner status.

If the interconnect link is down, filers can still see that the partner is alive because nodes can update the mailbox disks.

ISL link is used for data traffic and also to sync NVRAM (interconnect).

If you have TI zone, then you have dedicated fiber for interconnect traffic.
You cannot have fabric metrocluster without ISL link.

Force takeover is different than basic takeover in a HA.

Fore takeover is only available in metroclusters.

"normal" takeover happens when nodeA goes down, nodeB sees that, and takes the role of nodeA to side of B to serve data.

Force takeover doesent happen automatically, you have to type the command to do it.

Scenario:

ISL link breaks between the sites.

Both nodes are okay, but the mirroring for data and cf can't happen any more, because nvmem is not in sync, pool 1 (mirrrored data) is unvailable, mailbox are unavailable.

In this case, if you do a force takeover in nodeA, it will start serving nodeB data from nodeA site (mirrored data, so this is possible).

This is not what you want, because nodeB is okay, and serving data all the time.

Then you have "two nodeBs" available for the clients.

You do force takeover only when you know, that other site is down, or if you know its going to be down.

Example, air condition is broken and the heat is getting higher. you shut down the site B to avoid overheating, do the force takeover and start serve data from the site A.

Br.

Ismo.

AEKAKIRA2 · ‎2013-05-24

Hello Ismo & Thomas
I am looking at peculiar scenario where I am forcing entire fabric metro cluster to shutdown for both sites maintenance on same day/time. In entire scenario if DWDM link was not active and ISL was not available to nodes (both fabrics).

1. I can execute cf disable and bring down both nodes separately

2. In case of ISL (DWDM) link is not active, can I bring up a single node (cf disable state)?

3. If manually disable ISL ports on switches and proceed to execute CFOD (cf forcetakover -d), shutdown of first site, next shutdown of surviving node

upon maintenance if I try to bring the 2nd node (in absence of active ISL link) will node come up? if it, what are the steps (like release MB disks of local/partner or any more steps to perform to bring back the node)

4. I see KB article on mail box disks (if node not accessible need to reset mb disks) but not sure whether node comes up with only 2 local mb disks.

Thanks in advance

Kiran

thomas_glodde · ‎2013-05-24

1) yes

2) in case isl goes down, cf gets disabled automaticaly

3) the Moment you disable isl, cf forcetakeover -d on one node and have the 2nd node still running, you have a split brain. dont do so

if you Need to do maintenance on node 2, do a normal cf takeover first, then disable isl and shut down node 2

4) it will

AEKAKIRA2 · ‎2013-05-24

Hello Thomas,

I want to do maintenance of both sites (metro cluster) on same day and time.

Hence I have to shutdown both nodes, once maintenance is complete I have to bring up both nodes. In case of ISL link is failed completely how to proceed?

In above node 1 takeover node2 and node 1 (takeover)/ surviving site, I want to shutdown the surviving node also. When I start node1 (which was in takeover mode before shutdown) will it comes up normally?

If any issue it encounter how to tackle the situation to bring up at least one node (either in cluster or cluster disabled state)

I will be eagerly waiting for results of testing in next week.

Regards

Kiran

thomas_glodde · ‎2013-05-24

kiran,

you want to have maintenance on both nodes at the same time?

1) isl UP, on both nodes:

cf disable

halt -t 0

2) isl DOWN, on both nodes:

cf disable

halt -t 0

thats it, no need to takeover if you dont run any services on any of the sides anyway.

after maintenance is done

1) isl UP

just boot both nodes

cf enable

2) isl DOWN

just boot both nodes

cf enable will fail as the isl is down

AEKAKIRA2 · ‎2013-05-26

Hello Thomas

Like you said above the process is working and followed, recently I faced a scenario where ISL (both fabrics) links were down during maintenance and when to try to boot up node was not coming up, finally released disks, mailbox disks destroyed on local, partner.

Now looking for alternate process if any in case of such maintenance and un known issue encounter during the period, and trying to see whether we can perform a failover of node to partner and shutdown of both nodes (site A, B) for maintenance. After maintenance when we boot node on site B (takeover partner) will it require a ISL to come up properly ? 2nd thing is takeover node require all 4 mb disks to boot up (like earlier said not required)

If there were issue like takeover node not able to boot, do we need to follow the same process of release of storage disks, destroying mailbox disks (local, partner) will bring up node? seems from the below output takeover node lose cluster details, Is this true? in this case how to recover cluster status as is? and bring up node normally?

Even if cluster state lost and able to bring up takeover node, is data upto that last write intact? if there were no changes/writes in last 30mins.

Output:

++++++++++++

*> mailbox destroy local

Destroying mailboxes forces a node to create new empty mailboxes,

which clears any takeover state and removes all knowledge

of out-of-date plexes of mirrored volumes.

Are you sure you want to destroy the local mailboxes? yes

mailboxes destroyed

*> mailbox destroy partner

Destroying mailboxes forces a node to create new empty mailboxes,

which clears any takeover state and removes all knowledge

of out-of-date plexes of mirrored volumes.

Destroying partner mailboxes means that you will not be

able to do a takeover of any sort (including forcetakeover -d)

until the partner reboots successfully. This is dangerous when

the local node is, or should be in, takeover state and

VERY dangerous if the partner has suffered some form of disaster

Are you sure you want to destroy the partner mailboxes?

++++++++++++

Thanks in advance

Regards

Kiran

aborzenkov · ‎2013-05-26

Like you said above the process is working and followed, recently I faced a scenario where ISL (both fabrics) links were down during maintenance and when to try to boot up node was not coming up

That's what I expected. Booting in this case would be highly dangerous and could lead to data corruption.

If there were issue like takeover node not able to boot, do we need to follow the same process of release of storage disks, destroying mailbox disks (local, partner) will bring up node?

It must be decided on case by case basis. It is impossible to give blanket statement. You need to evaluate situation and decide about your priorities - immediate service availability with potential data loss or data integrity by all means.

aborzenkov · ‎2013-05-24

1. I can execute cf disable and bring down both nodes separately
2. In case of ISL (DWDM) link is not active,  can I bring up a single node (cf disable state)?

For all I know this is not possible without effectively destroying active/active configuration.

3. If manually disable ISL ports on switches and proceed to execute CFOD (cf forcetakover -d), shutdown of first site, next shutdown of surviving node

upon maintenance if I try to bring the 2nd node (in absence of active ISL link) will node come up?

Which one is "second"? Let's say you have site A and site B, intersite link is lost and site B did "cf takeover -d". Then you should be able to boot site B (it will come up in takeover mode hosting both A and B) but you should not be able to boot site A.

4. I see KB article on mail box disks (if node not accessible need to reset mb disks) but not sure whether node comes up with only 2 local mb disks.

This is majority rule. You have 8 mailboxes in total (if not, cluster is misconfigured anyway). You need more than half of them for a node to boot.

scoatney · ‎2013-06-27

If a node has access to all of it's local mailbox disks prior to the node being shutdown - it needs access to the same (all) disks when it boots. It only needs access to the partner mailbox disks if a takeover is needed.

If you disable the ISL links while the node(s) are up, the HA mailbox logic will re-configure itself to the 2 remaining accessible disks. This can take a couple of seconds, there are EMS which indicates it's happened. You can then reboot the node with only 2 mailbox disks (the 2 remaining mailbox disks).

However, if you may want to do 'cf forcetakeover -d', you should not disable the ISL first. You run the risk of the data on the 2 plexes diverging and will loose data.

AEKAKIRA2 · ‎2013-05-26

Thank you each and everyone on providing valuable inputs on subject, by sparing your time. I will leave the query open till couple of days.