Warning, long post.
TL;DR: Halting nodes doesn't remove their eligibility, and I forgot to mention cluster HA.
If I start with a healthy 4 node cluster:
ontap9::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
ontap9-01 true true true
ontap9-02 true true false
ontap9-03 true true false
ontap9-04 true true false
4 entries were displayed.
ontap9::*> storage failover show
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
ontap9-01 ontap9-02 true Connected to ontap9-02
ontap9-02 ontap9-01 true Connected to ontap9-01
ontap9-03 ontap9-04 true Connected to ontap9-04
ontap9-04 ontap9-03 true Connected to ontap9-03
4 entries were displayed.
Then halt the second HA Pair, it will warn you when you take down the second node:
ontap9::*> halt -node ontap9-03 -inhibit-takeover true
ontap9::*> halt -node ontap9-04 -inhibit-takeover true
(system node halt)
Warning: This operation will cause node "ontap9-04" to be marked as unhealthy.
Unhealthy nodes do not participate in quorum voting. If the node goes
out of service and one more node goes out of service there will be a
data serving failure for the entire cluster. This will cause a client
disruption. Use "cluster show" to verify cluster state. If possible
bring other nodes online to improve the resiliency of this cluster.
Do you want to continue? {y|n}: y
You can see the down nodes are still eligible:
ontap9::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
ontap9-01 true true true
ontap9-02 true true false
ontap9-03 false true false
ontap9-04 false true false
4 entries were displayed.
And if I try to takeover node 2, I get a warning that I am about to cause a data service failure:
ontap9::storage failover*> takeover -ofnode ontap9-02
Error: command failed: Taking node "ontap9-02" out of service might result in a data service failure and
client disruption for the entire cluster. If possible, bring an additional node online to improve
the resiliency of the cluster and to ensure continuity of service. Verify the health of the node
using the "cluster show" command, then try the command again, or provide "-ignore-quorum-warnings"
to bypass this check.
If I override it, or just bounce the node, I lose quorum and the RDBs drop offline.
ontap9::storage failover*> takeover -ofnode ontap9-02 -ignore-quorum-warnings true
Warning: A takeover will be initiated. Once the partner node reboots, a giveback will be automatically
initiated. Do you want to continue? {y|n}: y
ontap9::storage failover*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
ontap9-01 false true true
ontap9-02 false true false
ontap9-03 false true false
ontap9-04 false true false
4 entries were displayed.
ontap9::storage failover*> cluster ring show
Node UnitName Epoch DB Epoch DB Trnxs Master Online
--------- -------- -------- -------- -------- --------- ---------
ontap9-01 mgmt 0 7 108 - offline
ontap9-01 vldb 0 8 9 - offline
ontap9-01 vifmgr 0 8 65 - offline
ontap9-01 bcomd 0 8 1 - offline
ontap9-01 crs 0 8 1 - offline
Warning: Unable to list entries on node ontap9-02. RPC: Couldn't make connection
Unable to list entries on node ontap9-03. RPC: Couldn't make connection
Unable to list entries on node ontap9-04. RPC: Couldn't make connection
5 entries were displayed.
No sane admin would override those warnings, but when you're in the DC doing maintenance and somebody bumps the wrong power cable, well, things happen.
Now lets see how it plays out if I remove their voting rights before I shut them down to move them to a new rack:
ontap9::*> node modify -node ontap9-03 -eligibility false
Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting
should be used only for unusual maintenance operations. To restore the node's data-serving
capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y
ontap9::*> node modify -node ontap9-04 -eligibility false
Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting
should be used only for unusual maintenance operations. To restore the node's data-serving
capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y
Waiting for a quorum membership change to complete.
ontap9::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
ontap9-01 true true true
ontap9-02 true true false
ontap9-03 false false false
ontap9-04 false false false
4 entries were displayed.
No longer eligble, and now for the halt:
ontap9::*> halt -node ontap9-03 -inhibit-takeover true
(system node halt)
Warning: Are you sure you want to halt node "ontap9-03"? {y|n}: y
ontap9::*> halt -node ontap9-04 -inhibit-takeover true
(system node halt)
Warning: Are you sure you want to halt node "ontap9-04"? {y|n}: y
Notice no quorum warnings this time.
I'm not stable enough yet, since if my epsilon node bounces, I still lose the cluster, so during this unusual maintenance activity, I'll turn cluster HA back on:
ontap9::*> cluster ha modify -configured true
Warning: High Availability (HA) configuration for cluster services requires that both SFO storage failover
and SFO auto-giveback be enabled. These actions will be performed if necessary.
Do you want to continue? {y|n}: y
Notice: HA is configured in management.
Now I effectively have a two node cluster again:
ontap9::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
ontap9-01 true true false
ontap9-02 true true false
ontap9-03 false false false
ontap9-04 false false false
4 entries were displayed.
So if something happens to one my remaining nodes, I can continue to serve data:
ontap9::*> storage failover takeover -ofnode ontap9-01
Warning: A takeover will be initiated. Once the partner node reboots, a giveback
will be automatically initiated. Do you want to continue?
{y|n}: y
ontap9::*>
And I am limping along on one node, but it is better than the alternative:
ontap9::cluster ha*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
ontap9-01 false true false
ontap9-02 true true false
ontap9-03 false false false
ontap9-04 false false false
4 entries were displayed.
ontap9::cluster ha*> cluster ring show
Node UnitName Epoch DB Epoch DB Trnxs Master Online
--------- -------- -------- -------- -------- --------- ---------
Warning: Unable to list entries on node ontap9-01. RPC: Couldn't make
connection
ontap9-02 mgmt 9 9 33 ontap9-02 master
ontap9-02 vldb 10 10 6 ontap9-02 master
ontap9-02 vifmgr 10 10 15 ontap9-02 master
ontap9-02 bcomd 10 10 1 ontap9-02 master
ontap9-02 crs 10 10 1 ontap9-02 master
Warning: Unable to list entries on node ontap9-03. RPC: Couldn't make
connection
Unable to list entries on node ontap9-04. RPC: Couldn't make
connection
5 entries were displayed.
Then when everything is done and the nodes are up in their new home, I can undo those changes and put things back together:
First disable cluster HA:
ontap9::*> cluster ha modify -configured false
Warning: This operation will unconfigure cluster HA. Cluster HA must be configured on a two-node cluster to
ensure data access availability in the event of storage failover.
Do you want to continue? {y|n}: y
Notice: HA is disabled.
Then mark one of the nodes as eligible:
ontap9::*> node modify -node ontap9-03 -eligibility true
And reboot it:
ontap9::*> reboot -node ontap9-03 -inhibit-takeover true
(system node reboot)
Warning: This operation will cause node "ontap9-03" to be marked as unhealthy. Unhealthy nodes do not
participate in quorum voting. If the node goes out of service and one more node goes out of service
there will be a data serving failure for the entire cluster. This will cause a client disruption.
Use "cluster show" to verify cluster state. If possible bring other nodes online to improve the
resiliency of this cluster.
Do you want to continue? {y|n}: y
ontap9::*>
Once it comes back, repeat on the other node:
ontap9::*> node modify -node ontap9-04 -eligibility true
ontap9::*> node reboot -node ontap9-04 -inhibit-takeover true
Warning: Are you sure you want to reboot node "ontap9-04"? {y|n}: y
Once it comes back up, you're good.
ontap9::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
ontap9-01 true true true
ontap9-02 true true false
ontap9-03 true true false
ontap9-04 true true false
4 entries were displayed.
If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.