Solved: Temporary Shut Down of Two Cluster Nodes

TMADOCTHOMAS · ‎2017-09-15

We have a 4 node cluster consisting of two AFF8080 HA Pairs, running 8.3.2P10. Two of the nodes are deliberately configured with only test/dev volumes because we need to shut those nodes down temporarily for about an hour to move them to a new rack.

I want to be sure this temporary shutdown is done cleanly and doesn't cause issues. Here are the steps I have taken or will take before shutdown:

Moved Epsilon to the nodes that are staying up.
Moved cluster ring functions to the nodes that are staying up.
Will quiesce SVM root volume protection snapmirror jobs for test/dev SVMs.
Will pause VSC backup jobs that point to test/dev volumes.
Will issue the halt command with -inhibit-takeover and -skip-lif-migration-before-shutdown both set to true

Are there any other items I should consider to ensure a smooth shutdown? In particular I'm wondering if I should do anything else with the root volume protection snapmirror jobs. The root volumes for the test/dev SVMs (and their copies) are currently on the nodes that are staying up. Not sure if that will be an issue.

Would love to hear any feedback or suggestions!

SeanHatfield · ‎2017-09-19

Warning, long post.

TL;DR: Halting nodes doesn't remove their eligibility, and I forgot to mention cluster HA.

If I start with a healthy 4 node cluster:

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          true
ontap9-02            true    true          false
ontap9-03            true    true          false
ontap9-04            true    true          false
4 entries were displayed.

ontap9::*> storage failover show
                              Takeover
Node           Partner        Possible State Description
-------------- -------------- -------- -------------------------------------
ontap9-01      ontap9-02      true     Connected to ontap9-02
ontap9-02      ontap9-01      true     Connected to ontap9-01
ontap9-03      ontap9-04      true     Connected to ontap9-04
ontap9-04      ontap9-03      true     Connected to ontap9-03
4 entries were displayed.

Then halt the second HA Pair, it will warn you when you take down the second node:

ontap9::*> halt -node ontap9-03 -inhibit-takeover true

ontap9::*> halt -node ontap9-04 -inhibit-takeover true
  (system node halt)

Warning: This operation will cause node "ontap9-04" to be marked as unhealthy.
         Unhealthy nodes do not participate in quorum voting. If the node goes
         out of service and one more node goes out of service there will be a
         data serving failure for the entire cluster. This will cause a client
         disruption. Use "cluster show" to verify cluster state. If possible
         bring other nodes online to improve the resiliency of this cluster.
Do you want to continue? {y|n}: y

You can see the down nodes are still eligible:

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          true
ontap9-02            true    true          false
ontap9-03            false   true          false
ontap9-04            false   true          false
4 entries were displayed.

And if I try to takeover node 2, I get a warning that I am about to cause a data service failure:

ontap9::storage failover*> takeover -ofnode ontap9-02

Error: command failed: Taking node "ontap9-02" out of service might result in a data service failure and
       client disruption for the entire cluster. If possible, bring an additional node online to improve
       the resiliency of the cluster and to ensure continuity of service. Verify the health of the node
       using the "cluster show" command, then try the command again, or provide "-ignore-quorum-warnings"
       to bypass this check.

If I override it, or just bounce the node, I lose quorum and the RDBs drop offline.

ontap9::storage failover*> takeover -ofnode ontap9-02 -ignore-quorum-warnings true

Warning: A takeover will be initiated. Once the partner node reboots, a giveback will be automatically
         initiated. Do you want to continue? {y|n}: y

ontap9::storage failover*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            false   true          true
ontap9-02            false   true          false
ontap9-03            false   true          false
ontap9-04            false   true          false
4 entries were displayed.

ontap9::storage failover*> cluster ring show
Node      UnitName Epoch    DB Epoch DB Trnxs Master    Online
--------- -------- -------- -------- -------- --------- ---------
ontap9-01 mgmt     0        7        108      -         offline
ontap9-01 vldb     0        8        9        -         offline
ontap9-01 vifmgr   0        8        65       -         offline
ontap9-01 bcomd    0        8        1        -         offline
ontap9-01 crs      0        8        1        -         offline

Warning: Unable to list entries on node ontap9-02. RPC: Couldn't make connection
         Unable to list entries on node ontap9-03. RPC: Couldn't make connection
         Unable to list entries on node ontap9-04. RPC: Couldn't make connection
5 entries were displayed.

No sane admin would override those warnings, but when you're in the DC doing maintenance and somebody bumps the wrong power cable, well, things happen.

Now lets see how it plays out if I remove their voting rights before I shut them down to move them to a new rack:

ontap9::*> node modify -node ontap9-03 -eligibility false

Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting
         should be used only for unusual maintenance operations. To restore the node's data-serving
         capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y

ontap9::*> node modify -node ontap9-04 -eligibility false

Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting
         should be used only for unusual maintenance operations. To restore the node's data-serving
         capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y
Waiting for a quorum membership change to complete.

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          true
ontap9-02            true    true          false
ontap9-03            false   false         false
ontap9-04            false   false         false
4 entries were displayed.

No longer eligble, and now for the halt:

ontap9::*> halt -node ontap9-03 -inhibit-takeover true
  (system node halt)

Warning: Are you sure you want to halt node "ontap9-03"? {y|n}: y

ontap9::*> halt -node ontap9-04 -inhibit-takeover true
  (system node halt)

Warning: Are you sure you want to halt node "ontap9-04"? {y|n}: y

Notice no quorum warnings this time.

I'm not stable enough yet, since if my epsilon node bounces, I still lose the cluster, so during this unusual maintenance activity, I'll turn cluster HA back on:

ontap9::*> cluster ha modify -configured true

Warning: High Availability (HA) configuration for cluster services requires that both SFO storage failover
         and SFO auto-giveback be enabled. These actions will be performed if necessary.
Do you want to continue? {y|n}: y

Notice: HA is configured in management.

Now I effectively have a two node cluster again:

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          false
ontap9-02            true    true          false
ontap9-03            false   false         false
ontap9-04            false   false         false
4 entries were displayed.

So if something happens to one my remaining nodes, I can continue to serve data:

ontap9::*> storage failover takeover -ofnode ontap9-01

Warning: A takeover will be initiated. Once the partner node reboots, a giveback
         will be automatically initiated. Do you want to continue?
          {y|n}: y

ontap9::*>

And I am limping along on one node, but it is better than the alternative:

ontap9::cluster ha*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            false   true          false
ontap9-02            true    true          false
ontap9-03            false   false         false
ontap9-04            false   false         false
4 entries were displayed.

ontap9::cluster ha*> cluster ring show
Node      UnitName Epoch    DB Epoch DB Trnxs Master    Online
--------- -------- -------- -------- -------- --------- ---------

Warning: Unable to list entries on node ontap9-01. RPC: Couldn't make
         connection
ontap9-02 mgmt     9        9        33       ontap9-02 master
ontap9-02 vldb     10       10       6        ontap9-02 master
ontap9-02 vifmgr   10       10       15       ontap9-02 master
ontap9-02 bcomd    10       10       1        ontap9-02 master
ontap9-02 crs      10       10       1        ontap9-02 master

Warning: Unable to list entries on node ontap9-03. RPC: Couldn't make
         connection
         Unable to list entries on node ontap9-04. RPC: Couldn't make
         connection
5 entries were displayed.

Then when everything is done and the nodes are up in their new home, I can undo those changes and put things back together:

First disable cluster HA:

ontap9::*> cluster ha modify -configured false

Warning: This operation will unconfigure cluster HA.  Cluster HA must be configured on a two-node cluster to
         ensure data access availability in the event of storage failover.
Do you want to continue? {y|n}: y

Notice: HA is disabled.

Then mark one of the nodes as eligible:

ontap9::*> node modify -node ontap9-03 -eligibility true

And reboot it:

ontap9::*> reboot -node ontap9-03 -inhibit-takeover true
  (system node reboot)

Warning: This operation will cause node "ontap9-03" to be marked as unhealthy. Unhealthy nodes do not
         participate in quorum voting. If the node goes out of service and one more node goes out of service
         there will be a data serving failure for the entire cluster. This will cause a client disruption.
         Use "cluster show" to verify cluster state. If possible bring other nodes online to improve the
         resiliency of this cluster.
Do you want to continue? {y|n}: y

ontap9::*>

Once it comes back, repeat on the other node:

ontap9::*> node modify -node ontap9-04 -eligibility true

ontap9::*> node reboot -node ontap9-04 -inhibit-takeover true

Warning: Are you sure you want to reboot node "ontap9-04"? {y|n}: y

Once it comes back up, you're good.

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          true
ontap9-02            true    true          false
ontap9-03            true    true          false
ontap9-04            true    true          false
4 entries were displayed.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

View solution in original post

AlexDawson · ‎2017-09-17

You've got the main things in play - if the LSmirrors of SVM root vols both src and dst are on surviving nodes, it will be fine. As long as the src for the production ones are staying up, they will be fine too.

I'd also suggest putting in a pre-emptive case against the serial number of the nodes that are going down.

SeanHatfield · ‎2017-09-17

I would also suggest marking both of the nodes undergoing maintenance as inelgible, otherwise your cluster would not survive a takeover event on one of the two remaining nodes.

Cluster1::> set advanced
Cluster1::*> node modify -node Cluster1-04 -eligibility false

Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}:

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

TMADOCTHOMAS · ‎2017-09-19

Thank you for your replies Alex and Shawn!

I had seen the option to mark the nodes as ineligible, but didn't think it was necessary to issue the commands since I'm shutting the nodes down completely during the maintenance. I assumed that would automatically mark them ineligible since they wouldn't be up and running anymore. Is that not correct?

SeanHatfield · ‎2017-09-19

Warning, long post.

TL;DR: Halting nodes doesn't remove their eligibility, and I forgot to mention cluster HA.

If I start with a healthy 4 node cluster:

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          true
ontap9-02            true    true          false
ontap9-03            true    true          false
ontap9-04            true    true          false
4 entries were displayed.

ontap9::*> storage failover show
                              Takeover
Node           Partner        Possible State Description
-------------- -------------- -------- -------------------------------------
ontap9-01      ontap9-02      true     Connected to ontap9-02
ontap9-02      ontap9-01      true     Connected to ontap9-01
ontap9-03      ontap9-04      true     Connected to ontap9-04
ontap9-04      ontap9-03      true     Connected to ontap9-03
4 entries were displayed.

Then halt the second HA Pair, it will warn you when you take down the second node:

ontap9::*> halt -node ontap9-03 -inhibit-takeover true

ontap9::*> halt -node ontap9-04 -inhibit-takeover true
  (system node halt)

Warning: This operation will cause node "ontap9-04" to be marked as unhealthy.
         Unhealthy nodes do not participate in quorum voting. If the node goes
         out of service and one more node goes out of service there will be a
         data serving failure for the entire cluster. This will cause a client
         disruption. Use "cluster show" to verify cluster state. If possible
         bring other nodes online to improve the resiliency of this cluster.
Do you want to continue? {y|n}: y

You can see the down nodes are still eligible:

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          true
ontap9-02            true    true          false
ontap9-03            false   true          false
ontap9-04            false   true          false
4 entries were displayed.

And if I try to takeover node 2, I get a warning that I am about to cause a data service failure:

ontap9::storage failover*> takeover -ofnode ontap9-02

Error: command failed: Taking node "ontap9-02" out of service might result in a data service failure and
       client disruption for the entire cluster. If possible, bring an additional node online to improve
       the resiliency of the cluster and to ensure continuity of service. Verify the health of the node
       using the "cluster show" command, then try the command again, or provide "-ignore-quorum-warnings"
       to bypass this check.

If I override it, or just bounce the node, I lose quorum and the RDBs drop offline.

ontap9::storage failover*> takeover -ofnode ontap9-02 -ignore-quorum-warnings true

Warning: A takeover will be initiated. Once the partner node reboots, a giveback will be automatically
         initiated. Do you want to continue? {y|n}: y

ontap9::storage failover*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            false   true          true
ontap9-02            false   true          false
ontap9-03            false   true          false
ontap9-04            false   true          false
4 entries were displayed.

ontap9::storage failover*> cluster ring show
Node      UnitName Epoch    DB Epoch DB Trnxs Master    Online
--------- -------- -------- -------- -------- --------- ---------
ontap9-01 mgmt     0        7        108      -         offline
ontap9-01 vldb     0        8        9        -         offline
ontap9-01 vifmgr   0        8        65       -         offline
ontap9-01 bcomd    0        8        1        -         offline
ontap9-01 crs      0        8        1        -         offline

Warning: Unable to list entries on node ontap9-02. RPC: Couldn't make connection
         Unable to list entries on node ontap9-03. RPC: Couldn't make connection
         Unable to list entries on node ontap9-04. RPC: Couldn't make connection
5 entries were displayed.

No sane admin would override those warnings, but when you're in the DC doing maintenance and somebody bumps the wrong power cable, well, things happen.

Now lets see how it plays out if I remove their voting rights before I shut them down to move them to a new rack:

ontap9::*> node modify -node ontap9-03 -eligibility false

Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting
         should be used only for unusual maintenance operations. To restore the node's data-serving
         capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y

ontap9::*> node modify -node ontap9-04 -eligibility false

Warning: When a node's eligibility is set to "false," SAN and NAS access might be affected. This setting
         should be used only for unusual maintenance operations. To restore the node's data-serving
         capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y
Waiting for a quorum membership change to complete.

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          true
ontap9-02            true    true          false
ontap9-03            false   false         false
ontap9-04            false   false         false
4 entries were displayed.

No longer eligble, and now for the halt:

ontap9::*> halt -node ontap9-03 -inhibit-takeover true
  (system node halt)

Warning: Are you sure you want to halt node "ontap9-03"? {y|n}: y

ontap9::*> halt -node ontap9-04 -inhibit-takeover true
  (system node halt)

Warning: Are you sure you want to halt node "ontap9-04"? {y|n}: y

Notice no quorum warnings this time.

I'm not stable enough yet, since if my epsilon node bounces, I still lose the cluster, so during this unusual maintenance activity, I'll turn cluster HA back on:

ontap9::*> cluster ha modify -configured true

Warning: High Availability (HA) configuration for cluster services requires that both SFO storage failover
         and SFO auto-giveback be enabled. These actions will be performed if necessary.
Do you want to continue? {y|n}: y

Notice: HA is configured in management.

Now I effectively have a two node cluster again:

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          false
ontap9-02            true    true          false
ontap9-03            false   false         false
ontap9-04            false   false         false
4 entries were displayed.

So if something happens to one my remaining nodes, I can continue to serve data:

ontap9::*> storage failover takeover -ofnode ontap9-01

Warning: A takeover will be initiated. Once the partner node reboots, a giveback
         will be automatically initiated. Do you want to continue?
          {y|n}: y

ontap9::*>

And I am limping along on one node, but it is better than the alternative:

ontap9::cluster ha*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            false   true          false
ontap9-02            true    true          false
ontap9-03            false   false         false
ontap9-04            false   false         false
4 entries were displayed.

ontap9::cluster ha*> cluster ring show
Node      UnitName Epoch    DB Epoch DB Trnxs Master    Online
--------- -------- -------- -------- -------- --------- ---------

Warning: Unable to list entries on node ontap9-01. RPC: Couldn't make
         connection
ontap9-02 mgmt     9        9        33       ontap9-02 master
ontap9-02 vldb     10       10       6        ontap9-02 master
ontap9-02 vifmgr   10       10       15       ontap9-02 master
ontap9-02 bcomd    10       10       1        ontap9-02 master
ontap9-02 crs      10       10       1        ontap9-02 master

Warning: Unable to list entries on node ontap9-03. RPC: Couldn't make
         connection
         Unable to list entries on node ontap9-04. RPC: Couldn't make
         connection
5 entries were displayed.

Then when everything is done and the nodes are up in their new home, I can undo those changes and put things back together:

First disable cluster HA:

ontap9::*> cluster ha modify -configured false

Warning: This operation will unconfigure cluster HA.  Cluster HA must be configured on a two-node cluster to
         ensure data access availability in the event of storage failover.
Do you want to continue? {y|n}: y

Notice: HA is disabled.

Then mark one of the nodes as eligible:

ontap9::*> node modify -node ontap9-03 -eligibility true

And reboot it:

ontap9::*> reboot -node ontap9-03 -inhibit-takeover true
  (system node reboot)

Warning: This operation will cause node "ontap9-03" to be marked as unhealthy. Unhealthy nodes do not
         participate in quorum voting. If the node goes out of service and one more node goes out of service
         there will be a data serving failure for the entire cluster. This will cause a client disruption.
         Use "cluster show" to verify cluster state. If possible bring other nodes online to improve the
         resiliency of this cluster.
Do you want to continue? {y|n}: y

ontap9::*>

Once it comes back, repeat on the other node:

ontap9::*> node modify -node ontap9-04 -eligibility true

ontap9::*> node reboot -node ontap9-04 -inhibit-takeover true

Warning: Are you sure you want to reboot node "ontap9-04"? {y|n}: y

Once it comes back up, you're good.

ontap9::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
ontap9-01            true    true          true
ontap9-02            true    true          false
ontap9-03            true    true          false
ontap9-04            true    true          false
4 entries were displayed.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

AlexDawson · ‎2017-09-19

Fantastic post Sean - maybe one for the cookbooks?

TMADOCTHOMAS · ‎2017-09-20

Sean,

To say this is immensely helpful is an understatement. This is exactly what I was looking for. I did not realize I would put the cluster at risk by not marking the downed nodes ineligible and turning cluster ha on temporarily. This makes perfect sense. Thank you for taking the time to write this up!

Temporary Shut Down of Two Cluster Nodes

Graceful shut down

4 node cluster split

Temporary Shutdown of Switchless Cluster

Single-Node Cluster damage

Add nodes in unhealth cluster