ONTAP Discussions
ONTAP Discussions
We have a 4 node cluster consisting of two AFF8080 HA Pairs, running 8.3.2P10. Two of the nodes are deliberately configured with only test/dev volumes because we need to shut those nodes down temporarily for about an hour to move them to a new rack.
I want to be sure this temporary shutdown is done cleanly and doesn't cause issues. Here are the steps I have taken or will take before shutdown:
Are there any other items I should consider to ensure a smooth shutdown? In particular I'm wondering if I should do anything else with the root volume protection snapmirror jobs. The root volumes for the test/dev SVMs (and their copies) are currently on the nodes that are staying up. Not sure if that will be an issue.
Would love to hear any feedback or suggestions!
Solved! See The Solution
Warning, long post.
TL;DR: Halting nodes doesn't remove their eligibility, and I forgot to mention cluster HA.
If I start with a healthy 4 node cluster:
ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true true ontap9-02 true true false ontap9-03 true true false ontap9-04 true true false 4 entries were displayed. ontap9::*> storage failover show Takeover Node Partner Possible State Description -------------- -------------- -------- ------------------------------------- ontap9-01 ontap9-02 true Connected to ontap9-02 ontap9-02 ontap9-01 true Connected to ontap9-01 ontap9-03 ontap9-04 true Connected to ontap9-04 ontap9-04 ontap9-03 true Connected to ontap9-03 4 entries were displayed.
Then halt the second HA Pair, it will warn you when you take down the second node:
ontap9::*> halt -node ontap9-03 -inhibit-takeover true ontap9::*> halt -node ontap9-04 -inhibit-takeover true (system node halt) Warning: This operation will cause node "ontap9-04" to be marked as unhealthy. Unhealthy nodes do not participate in quorum voting. If the node goes out of service and one more node goes out of service there will be a data serving failure for the entire cluster. This will cause a client disruption. Use "cluster show" to verify cluster state. If possible bring other nodes online to improve the resiliency of this cluster. Do you want to continue? {y|n}: y
You can see the down nodes are still eligible:
ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true true ontap9-02 true true false ontap9-03 false true false ontap9-04 false true false 4 entries were displayed.
And if I try to takeover node 2, I get a warning that I am about to cause a data service failure:
ontap9::storage failover*> takeover -ofnode ontap9-02 Error: command failed: Taking node "ontap9-02" out of service might result in a data service failure and client disruption for the entire cluster. If possible, bring an additional node online to improve the resiliency of the cluster and to ensure continuity of service. Verify the health of the node using the "cluster show" command, then try the command again, or provide "-ignore-quorum-warnings" to bypass this check.
If I override it, or just bounce the node, I lose quorum and the RDBs drop offline.
ontap9::storage failover*> takeover -ofnode ontap9-02 -ignore-quorum-warnings true Warning: A takeover will be initiated. Once the partner node reboots, a giveback will be automatically initiated. Do you want to continue? {y|n}: y ontap9::storage failover*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 false true true ontap9-02 false true false ontap9-03 false true false ontap9-04 false true false 4 entries were displayed. ontap9::storage failover*> cluster ring show Node UnitName Epoch DB Epoch DB Trnxs Master Online --------- -------- -------- -------- -------- --------- --------- ontap9-01 mgmt 0 7 108 - offline ontap9-01 vldb 0 8 9 - offline ontap9-01 vifmgr 0 8 65 - offline ontap9-01 bcomd 0 8 1 - offline ontap9-01 crs 0 8 1 - offline Warning: Unable to list entries on node ontap9-02. RPC: Couldn't make connection Unable to list entries on node ontap9-03. RPC: Couldn't make connection Unable to list entries on node ontap9-04. RPC: Couldn't make connection 5 entries were displayed.
No sane admin would override those warnings, but when you're in the DC doing maintenance and somebody bumps the wrong power cable, well, things happen.
Now lets see how it plays out if I remove their voting rights before I shut them down to move them to a new rack:
ontap9::*> node modify -node ontap9-03 -eligibility false Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y ontap9::*> node modify -node ontap9-04 -eligibility false Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y Waiting for a quorum membership change to complete. ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true true ontap9-02 true true false ontap9-03 false false false ontap9-04 false false false 4 entries were displayed.
No longer eligble, and now for the halt:
ontap9::*> halt -node ontap9-03 -inhibit-takeover true (system node halt) Warning: Are you sure you want to halt node "ontap9-03"? {y|n}: y ontap9::*> halt -node ontap9-04 -inhibit-takeover true (system node halt) Warning: Are you sure you want to halt node "ontap9-04"? {y|n}: y
Notice no quorum warnings this time.
I'm not stable enough yet, since if my epsilon node bounces, I still lose the cluster, so during this unusual maintenance activity, I'll turn cluster HA back on:
ontap9::*> cluster ha modify -configured true Warning: High Availability (HA) configuration for cluster services requires that both SFO storage failover and SFO auto-giveback be enabled. These actions will be performed if necessary. Do you want to continue? {y|n}: y Notice: HA is configured in management.
Now I effectively have a two node cluster again:
ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true false ontap9-02 true true false ontap9-03 false false false ontap9-04 false false false 4 entries were displayed.
So if something happens to one my remaining nodes, I can continue to serve data:
ontap9::*> storage failover takeover -ofnode ontap9-01 Warning: A takeover will be initiated. Once the partner node reboots, a giveback will be automatically initiated. Do you want to continue? {y|n}: y ontap9::*>
And I am limping along on one node, but it is better than the alternative:
ontap9::cluster ha*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 false true false ontap9-02 true true false ontap9-03 false false false ontap9-04 false false false 4 entries were displayed. ontap9::cluster ha*> cluster ring show Node UnitName Epoch DB Epoch DB Trnxs Master Online --------- -------- -------- -------- -------- --------- --------- Warning: Unable to list entries on node ontap9-01. RPC: Couldn't make connection ontap9-02 mgmt 9 9 33 ontap9-02 master ontap9-02 vldb 10 10 6 ontap9-02 master ontap9-02 vifmgr 10 10 15 ontap9-02 master ontap9-02 bcomd 10 10 1 ontap9-02 master ontap9-02 crs 10 10 1 ontap9-02 master Warning: Unable to list entries on node ontap9-03. RPC: Couldn't make connection Unable to list entries on node ontap9-04. RPC: Couldn't make connection 5 entries were displayed.
Then when everything is done and the nodes are up in their new home, I can undo those changes and put things back together:
First disable cluster HA:
ontap9::*> cluster ha modify -configured false Warning: This operation will unconfigure cluster HA. Cluster HA must be configured on a two-node cluster to ensure data access availability in the event of storage failover. Do you want to continue? {y|n}: y Notice: HA is disabled.
Then mark one of the nodes as eligible:
ontap9::*> node modify -node ontap9-03 -eligibility true
And reboot it:
ontap9::*> reboot -node ontap9-03 -inhibit-takeover true (system node reboot) Warning: This operation will cause node "ontap9-03" to be marked as unhealthy. Unhealthy nodes do not participate in quorum voting. If the node goes out of service and one more node goes out of service there will be a data serving failure for the entire cluster. This will cause a client disruption. Use "cluster show" to verify cluster state. If possible bring other nodes online to improve the resiliency of this cluster. Do you want to continue? {y|n}: y ontap9::*>
Once it comes back, repeat on the other node:
ontap9::*> node modify -node ontap9-04 -eligibility true ontap9::*> node reboot -node ontap9-04 -inhibit-takeover true Warning: Are you sure you want to reboot node "ontap9-04"? {y|n}: y
Once it comes back up, you're good.
ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true true ontap9-02 true true false ontap9-03 true true false ontap9-04 true true false 4 entries were displayed.
You've got the main things in play - if the LSmirrors of SVM root vols both src and dst are on surviving nodes, it will be fine. As long as the src for the production ones are staying up, they will be fine too.
I'd also suggest putting in a pre-emptive case against the serial number of the nodes that are going down.
I would also suggest marking both of the nodes undergoing maintenance as inelgible, otherwise your cluster would not survive a takeover event on one of the two remaining nodes.
Cluster1::> set advanced Cluster1::*> node modify -node Cluster1-04 -eligibility false Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}:
Thank you for your replies Alex and Shawn!
I had seen the option to mark the nodes as ineligible, but didn't think it was necessary to issue the commands since I'm shutting the nodes down completely during the maintenance. I assumed that would automatically mark them ineligible since they wouldn't be up and running anymore. Is that not correct?
Warning, long post.
TL;DR: Halting nodes doesn't remove their eligibility, and I forgot to mention cluster HA.
If I start with a healthy 4 node cluster:
ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true true ontap9-02 true true false ontap9-03 true true false ontap9-04 true true false 4 entries were displayed. ontap9::*> storage failover show Takeover Node Partner Possible State Description -------------- -------------- -------- ------------------------------------- ontap9-01 ontap9-02 true Connected to ontap9-02 ontap9-02 ontap9-01 true Connected to ontap9-01 ontap9-03 ontap9-04 true Connected to ontap9-04 ontap9-04 ontap9-03 true Connected to ontap9-03 4 entries were displayed.
Then halt the second HA Pair, it will warn you when you take down the second node:
ontap9::*> halt -node ontap9-03 -inhibit-takeover true ontap9::*> halt -node ontap9-04 -inhibit-takeover true (system node halt) Warning: This operation will cause node "ontap9-04" to be marked as unhealthy. Unhealthy nodes do not participate in quorum voting. If the node goes out of service and one more node goes out of service there will be a data serving failure for the entire cluster. This will cause a client disruption. Use "cluster show" to verify cluster state. If possible bring other nodes online to improve the resiliency of this cluster. Do you want to continue? {y|n}: y
You can see the down nodes are still eligible:
ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true true ontap9-02 true true false ontap9-03 false true false ontap9-04 false true false 4 entries were displayed.
And if I try to takeover node 2, I get a warning that I am about to cause a data service failure:
ontap9::storage failover*> takeover -ofnode ontap9-02 Error: command failed: Taking node "ontap9-02" out of service might result in a data service failure and client disruption for the entire cluster. If possible, bring an additional node online to improve the resiliency of the cluster and to ensure continuity of service. Verify the health of the node using the "cluster show" command, then try the command again, or provide "-ignore-quorum-warnings" to bypass this check.
If I override it, or just bounce the node, I lose quorum and the RDBs drop offline.
ontap9::storage failover*> takeover -ofnode ontap9-02 -ignore-quorum-warnings true Warning: A takeover will be initiated. Once the partner node reboots, a giveback will be automatically initiated. Do you want to continue? {y|n}: y ontap9::storage failover*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 false true true ontap9-02 false true false ontap9-03 false true false ontap9-04 false true false 4 entries were displayed. ontap9::storage failover*> cluster ring show Node UnitName Epoch DB Epoch DB Trnxs Master Online --------- -------- -------- -------- -------- --------- --------- ontap9-01 mgmt 0 7 108 - offline ontap9-01 vldb 0 8 9 - offline ontap9-01 vifmgr 0 8 65 - offline ontap9-01 bcomd 0 8 1 - offline ontap9-01 crs 0 8 1 - offline Warning: Unable to list entries on node ontap9-02. RPC: Couldn't make connection Unable to list entries on node ontap9-03. RPC: Couldn't make connection Unable to list entries on node ontap9-04. RPC: Couldn't make connection 5 entries were displayed.
No sane admin would override those warnings, but when you're in the DC doing maintenance and somebody bumps the wrong power cable, well, things happen.
Now lets see how it plays out if I remove their voting rights before I shut them down to move them to a new rack:
ontap9::*> node modify -node ontap9-03 -eligibility false Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y ontap9::*> node modify -node ontap9-04 -eligibility false Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y Waiting for a quorum membership change to complete. ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true true ontap9-02 true true false ontap9-03 false false false ontap9-04 false false false 4 entries were displayed.
No longer eligble, and now for the halt:
ontap9::*> halt -node ontap9-03 -inhibit-takeover true (system node halt) Warning: Are you sure you want to halt node "ontap9-03"? {y|n}: y ontap9::*> halt -node ontap9-04 -inhibit-takeover true (system node halt) Warning: Are you sure you want to halt node "ontap9-04"? {y|n}: y
Notice no quorum warnings this time.
I'm not stable enough yet, since if my epsilon node bounces, I still lose the cluster, so during this unusual maintenance activity, I'll turn cluster HA back on:
ontap9::*> cluster ha modify -configured true Warning: High Availability (HA) configuration for cluster services requires that both SFO storage failover and SFO auto-giveback be enabled. These actions will be performed if necessary. Do you want to continue? {y|n}: y Notice: HA is configured in management.
Now I effectively have a two node cluster again:
ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true false ontap9-02 true true false ontap9-03 false false false ontap9-04 false false false 4 entries were displayed.
So if something happens to one my remaining nodes, I can continue to serve data:
ontap9::*> storage failover takeover -ofnode ontap9-01 Warning: A takeover will be initiated. Once the partner node reboots, a giveback will be automatically initiated. Do you want to continue? {y|n}: y ontap9::*>
And I am limping along on one node, but it is better than the alternative:
ontap9::cluster ha*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 false true false ontap9-02 true true false ontap9-03 false false false ontap9-04 false false false 4 entries were displayed. ontap9::cluster ha*> cluster ring show Node UnitName Epoch DB Epoch DB Trnxs Master Online --------- -------- -------- -------- -------- --------- --------- Warning: Unable to list entries on node ontap9-01. RPC: Couldn't make connection ontap9-02 mgmt 9 9 33 ontap9-02 master ontap9-02 vldb 10 10 6 ontap9-02 master ontap9-02 vifmgr 10 10 15 ontap9-02 master ontap9-02 bcomd 10 10 1 ontap9-02 master ontap9-02 crs 10 10 1 ontap9-02 master Warning: Unable to list entries on node ontap9-03. RPC: Couldn't make connection Unable to list entries on node ontap9-04. RPC: Couldn't make connection 5 entries were displayed.
Then when everything is done and the nodes are up in their new home, I can undo those changes and put things back together:
First disable cluster HA:
ontap9::*> cluster ha modify -configured false Warning: This operation will unconfigure cluster HA. Cluster HA must be configured on a two-node cluster to ensure data access availability in the event of storage failover. Do you want to continue? {y|n}: y Notice: HA is disabled.
Then mark one of the nodes as eligible:
ontap9::*> node modify -node ontap9-03 -eligibility true
And reboot it:
ontap9::*> reboot -node ontap9-03 -inhibit-takeover true (system node reboot) Warning: This operation will cause node "ontap9-03" to be marked as unhealthy. Unhealthy nodes do not participate in quorum voting. If the node goes out of service and one more node goes out of service there will be a data serving failure for the entire cluster. This will cause a client disruption. Use "cluster show" to verify cluster state. If possible bring other nodes online to improve the resiliency of this cluster. Do you want to continue? {y|n}: y ontap9::*>
Once it comes back, repeat on the other node:
ontap9::*> node modify -node ontap9-04 -eligibility true ontap9::*> node reboot -node ontap9-04 -inhibit-takeover true Warning: Are you sure you want to reboot node "ontap9-04"? {y|n}: y
Once it comes back up, you're good.
ontap9::*> cluster show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ ontap9-01 true true true ontap9-02 true true false ontap9-03 true true false ontap9-04 true true false 4 entries were displayed.
Fantastic post Sean - maybe one for the cookbooks?
Sean,
To say this is immensely helpful is an understatement. This is exactly what I was looking for. I did not realize I would put the cluster at risk by not marking the downed nodes ineligible and turning cluster ha on temporarily. This makes perfect sense. Thank you for taking the time to write this up!