ONTAP Discussions

MetroCluster & NVRAM Interconnnect

shawn_lua_cw
6,797 Views

Dear Community Members,

Firstly thank you for viewing my post.

This is a question regarding MetroCluster behaviour.

We are in the process of acquiring a set of 3160 in MetroCluster/Syncmirror, with 2 Filers (3160a, 3160b) and 4 shelves (a-shelf0-pri, a-shelf0-sec, b-shelf0-pri, b-shelf0-sec).

This configuration has Shelf redundancy as well as Filer redundancy.

I understand that in the event a full site failure has occured, for example, node A (Left) manual intervention is required to force takeover on B. This is when all communication from B has failed with A, including loss of B to b-shelf0-sec and a-shelf0-pri.

Automated failover happens from A to B when HW failure occurs, i.e. NVRAM Card failure.

But what will happen when I cut or disconnect the interconnect cables only? How does A & B detect that takeover is not required and the loss in connection is not due to NVRAM Card fault but instead a connectivity fault?

Also, how does in example if A NVRAM card fails, how does B know that he should take over and loss of connection is not due to connectivity fault but instead a "real" HW fault?

I tried sourcing for information in regards to this and found the below:

1 (from Active-Active Controller Configuration Overview & Best Practices Guide) - If an NVRAM card fails, active/active controllers automatically fail over to their partner node and serve data from the surviving storage controller.
&
The interconnect adapter incorporates dual interconnect cables. If one cable fails, the heartbeat and NVRAM data are automatically sent over the second cable without delay or interruption. If both cables fail, failover capability is disabled, but both storage controllers continue to serve data to their respective applications and users. The Cluster Monitor then generates warning messages.
2 (from MetroCluster Design and Implementation Guide) - When a storage controller fails in an active-active configuration, the partner detects the failure and automatically (if enabled) performs a takeover of the data-serving responsibilities from the failed controller. Part of this process relies on the surviving controller being able to read information from the disks on the failed controller. If this quorum of disks is not available, then automatic takeover won’t be performed.
From those points 1 & 2 above, it states that the quorum of disks is a important factor used to decide if failover should take place. But how does the filer determine this?
Does this mean if the interconnect cable is disconnected, A & B note will try to takeover ownership of the other node's filer, but as no H/W failure is detected from the owning controller, both of them do not release the ownership and therefore do not allow its peer to gain access and therefore failover do not take place? During a true NVRAM Card failure, the node where the failure has occured will release ownership to its shelves and allow its peer to connect and therefore allowing failover?
The above is based on my understanding from the docs, so would like to understand if this is truly how it works, and if there are any docs that describe this.
Thank you all for your kind attention, really appreciate any confirmation.

Message was edited by: shawn.lua.cw for readability.

11 REPLIES 11
Public