Solved: Re: Strech metrocluster failover test - some weird things

IMPI · ‎2016-04-27

Hi,

i set up all my strech metrocluster, and now i'm testing failover and i have problem with some tests.

Please let me first explain my config

Two sites (50-60 m far away one from the other)

In each site : one FAS8020 (1 controler) and one DS2246 with 24x900 GB SAS disks
All is configured using streched metrocluster.

The config was factory done (and reviewed by a NetApp engineer), when i received it i plugged it and did a "metrocluster chech run" then "check show" and all was OK.

I configured only network and NFS exports.

On each part of the cluster (site A and site B) i configured a failover group with e0e and e0f.

I configured two NFS exports per cluster
- one with half the free space using e0e as home port and the failovergroup (name DS-FAS8020-A-01, for cluster B replace A with B!)
- one with the remaining half free space using e0f as home port and the failovergroup (name DS-FAS8020-A-02)

I'm only using NFS with ESX

I installed two VM, one on DS-FAS8020-A-01 (let's call it VM-A), the other on DS-FAS8020-B-01 (VM-B)
There is a little script writing on the disk, then i can see when disk is frozen or not.

Here was my tests and the result, some of them are quite surprising :

-1- I unpluged e0e : VM-A disk freezes for about 10 secondes then all went ok. Then i pluged it back again and did a "send home" on the lif and no problem

Q1 : is 10 seconds a normal failover time in this case ?

-2- I did a manual switchover on cluster B : i connected to cluster B, did a "metrocluster switchover" then "metrocluster heal -phase aggregates" and "metrocluster heal -phase root-aggregates".

Cluster-A goes to "LOADER-A" prompt : ok
VM-A disk freeze for about 40 seconds !
Then we power off cluster B FAS8020 : all ok
Then i wonder "what about disk shelf, did the switchover apply to the shelf ?". To test it i powered off cluster A DS2246
-> once again disk freeze for about 40 seconds but for both VM !!!

Q2 : is 40 seconds a normal faiover time in this case ?
Q3 : is that normal that only the controler going to failover, not the associated shelf ?
Q4 : why powering of the shelf freezes both VM

Then I powered on FAS and DS, i did a switchback (with overide vetoes because of sync in progress), there was a little disk freeze on VM-A (about 15 seconds)

-3- real failure : as if we had a real problem : power off all site B (FAS, DS, HP switch)

-> VM-A disk freezes for about 40 seconds then went ok
-> VM-B disk freezes and never restart... after some minutes i powered on shelf B -> disk on VM-B went ok (after 400 sec due to minutes i waited)
-> controler B did not want to boot, i had to do both command "metrocluster heal" on controler A, then boot_ontap on controller B. Then "metrocluster switchback", VM-B disk freezes for about 15 sec then OK

Q6 : Why did the cluster did not "see" that one site was down ?

I'm only in test phase then ii can do test, we can reboot and so on.

Thanks for your help and advices

niels · ‎2016-04-27

Hi IMPI,

first - nothing that you have seen concerns me and everything is working as designed. That said - let me explain the details (as far as I can).

Q1 : is 10 seconds a normal failover time in this case ?

I'm not a network specialist so unfortunately I can't help you here too much. It might be related to spanning tree configurations but as said, I'm not an ethernet pro nor do I play one on TV.

My assumption is that it takes some time on the switch to recognize the MAC address of the LIF has moved to a different port. The switch has to update its ARP table and distribute that information to the clients.

When you send the port home after having reconnected the cable, that's a negotiated activity and may be the reason why it's not noticed.

Q2 : is 40 seconds a normal faiover time in this case ?

I'd say that about 10 seconds of this is related to the problem above. Again, the LIF suddenly moves to a different physical port on the network. This information needs to be distributed accordingly. That leaves us with 30 seconds. That is well within the specs. The max switchover time is designed to be <120 seconds for systems configured with maximum objects (LIFs, volumes, SVMs, etc). The minimum SO time I've observed in my lab was 15 seconds, the majority was around 25-40 seconds.

Q3 : is that normal that only the controler going to failover, not the associated shelf ?

Yes. When you issue a manual switchover (the so-called negotiated switchover or NSO), then all other hardware is still functioning. The system does not see a requirement to break the mirror. As long as the disk shelf stays available, it will continue to synchronously write data to both sites. As NSO is for planned maintenance activities, it's assumed you take the aggregate plexes offline manually in case you want to power down the shelf.

This is to keep the data protected as long as technically possible.

Q4 : why powering of the shelf freezes both VM

When you power off the shelf the disks suddenly disappear and cannot accept any more IO. ONTAP has to wait for the SCSI timeout of the disks for any outstanding IO. This is set to 30 seconds. After this time has expired the missing disks are removed from the disk inventory and writes will continue to the remaining plex only. The aggregate is now in mirror-degraded state.

In your case the shelf seems to hold disks that belong to both nodes' aggregates, so that's why IO of both VMs stalls as both aggregates are impacted.

I'm missing Q5 😉

Q6 : Why did the cluster did not "see" that one site was down ?

This is also working as designed, because not being able to "see" the other site is the actual issue here. As with any clustered solution, the MetroCluster on its own cannot distinguish between a real site desaster and the loss of connectivity between the sites if the surviving site loses communication to the controller and the disks *at the same time*. It requires a third and independant instance to assist with this decision. You might be familiar with the quorum principle of server clusters.

In order to prevent split brain the system does not perform a switchover. It will continue to serve its own data though as that's not impacted (other than the mirror is lost).

Per design a third instance has to declare the desaster and perform the sitchover. This could be

- the administrator by issuing "metrocluster switchover -force-on-desaster true" - that would be your task

- implement the MetroCluster Tie Breaker software at a third site with the option "observer-mode" configured as false - than this system will initiate the switchover on your behalf

The switchover would have worked automatically when the controller and the shelf would have died after one another with a time of >30 seconds inbetween (remember the disk timeout from above?). That's because the controllers communicate directly over FC-VI and indirectly over a special set of disks (the so-called mailbox disks). In that case the surviving system can make the profound decision that the other site is in fact down.

Hope that helps

Kind regards, Niels

---------

If this solution resolved your issue, please mark it as resolution and give kudos

View solution in original post

niels · ‎2016-04-27

Hi IMPI,

first - nothing that you have seen concerns me and everything is working as designed. That said - let me explain the details (as far as I can).

Q1 : is 10 seconds a normal failover time in this case ?

I'm not a network specialist so unfortunately I can't help you here too much. It might be related to spanning tree configurations but as said, I'm not an ethernet pro nor do I play one on TV.

My assumption is that it takes some time on the switch to recognize the MAC address of the LIF has moved to a different port. The switch has to update its ARP table and distribute that information to the clients.

When you send the port home after having reconnected the cable, that's a negotiated activity and may be the reason why it's not noticed.

Q2 : is 40 seconds a normal faiover time in this case ?

I'd say that about 10 seconds of this is related to the problem above. Again, the LIF suddenly moves to a different physical port on the network. This information needs to be distributed accordingly. That leaves us with 30 seconds. That is well within the specs. The max switchover time is designed to be <120 seconds for systems configured with maximum objects (LIFs, volumes, SVMs, etc). The minimum SO time I've observed in my lab was 15 seconds, the majority was around 25-40 seconds.

Q3 : is that normal that only the controler going to failover, not the associated shelf ?

Yes. When you issue a manual switchover (the so-called negotiated switchover or NSO), then all other hardware is still functioning. The system does not see a requirement to break the mirror. As long as the disk shelf stays available, it will continue to synchronously write data to both sites. As NSO is for planned maintenance activities, it's assumed you take the aggregate plexes offline manually in case you want to power down the shelf.

This is to keep the data protected as long as technically possible.

Q4 : why powering of the shelf freezes both VM

When you power off the shelf the disks suddenly disappear and cannot accept any more IO. ONTAP has to wait for the SCSI timeout of the disks for any outstanding IO. This is set to 30 seconds. After this time has expired the missing disks are removed from the disk inventory and writes will continue to the remaining plex only. The aggregate is now in mirror-degraded state.

In your case the shelf seems to hold disks that belong to both nodes' aggregates, so that's why IO of both VMs stalls as both aggregates are impacted.

I'm missing Q5 😉

Q6 : Why did the cluster did not "see" that one site was down ?

This is also working as designed, because not being able to "see" the other site is the actual issue here. As with any clustered solution, the MetroCluster on its own cannot distinguish between a real site desaster and the loss of connectivity between the sites if the surviving site loses communication to the controller and the disks *at the same time*. It requires a third and independant instance to assist with this decision. You might be familiar with the quorum principle of server clusters.

In order to prevent split brain the system does not perform a switchover. It will continue to serve its own data though as that's not impacted (other than the mirror is lost).

Per design a third instance has to declare the desaster and perform the sitchover. This could be

- the administrator by issuing "metrocluster switchover -force-on-desaster true" - that would be your task

- implement the MetroCluster Tie Breaker software at a third site with the option "observer-mode" configured as false - than this system will initiate the switchover on your behalf

The switchover would have worked automatically when the controller and the shelf would have died after one another with a time of >30 seconds inbetween (remember the disk timeout from above?). That's because the controllers communicate directly over FC-VI and indirectly over a special set of disks (the so-called mailbox disks). In that case the surviving system can make the profound decision that the other site is in fact down.

Hope that helps

Kind regards, Niels

---------

If this solution resolved your issue, please mark it as resolution and give kudos

IMPI · ‎2016-04-28

Niels,

really thanks for this detailled and perfect reply. I really appreciated it !

I do not know where my Q5 goes... 🙂 Internet stole it !

Q1, Q2, Q4 are great for me, right.

Q3 : i understand that next time i want to do a NSO i'll first put plexes offline, then issue a "switchover" command. I'll try it.

Q6 : i ear about splitbrain/quorum/tiebreaker, and sure you are right. I know that one site went down, biut the other site can not know if it is down or only intersite communication that is losted, you're right, this is obvious.

i'll have to implement tiebreaker, any (simple) documentation about that ?

Do i have to add many other hardware ?

Thanks !

niels · ‎2016-04-29

Hi IMPI,

regarding Q3 - you can only set the plexes offline *after* you performed the negotiated switchover. The NSO will only work if the MetroCluster is in perfect condition, meaning also the mirrors are OK.

If you require Tie Breaker or not depends on what failure scenarios you want to protect against. As said, it's by design that the MetroCluster will not perform a switchover on its own. We see that 99% of customers want manual intervention as a desaster usually implies more than just the storage is gone.

If it is a simple power loss you want to protect against, you can use some UPS in order to get the MetroCluster to perform the switchover on its own. Remember, the switchover will not work as soon as controller and disks at one site are gone at the same time. So if you attach the disks to a UPS that is capable of powering the shelves for about 40 seconds, your switchover will work. The sequence would be:

- power-loss

- controller immideately dies; shelves still powered by UPS

- surviving controller can determine other controller is dead as it still has access to the mailbox disks

- surviving site will initiate switchover

- the UPS runs out of power

- aggregates will go into degraded state

This simple trick has the advantage that you don't need to configure anything, don't need a server/VM at a third site and don't need to rely on the network infrastructure between the three sites.

If you still think you need a Tie Breaker you can find the software and documentation here:

http://mysupport.netapp.com/NOW/download/software/metrocluster_tiebreaker/1.1/

Also have a look into the MetroCluster documentation - it has a paragraph about Tie Breaker:

https://library.netapp.com/ecm/ecm_get_file/ECMLP2427457

Per default the Tie Breaker is running in "Observer Mode" only. It will only notify via SNMP in case it detects a disaster and will not act upon it. In order to have Tie Breaker initiate the SO on your behalf, you need to contact your NetApp account team and have them file a PVR.

I also recommend to read the MetroCluster Technical Report to get a better understanding of the architecture.

http://www.netapp.com/us/media/tr-4375.pdf

regards, Niels

--------

If this solution resolved your issue, please mark it as resolution and give kudos

IMPI · ‎2016-05-02

Niels,

once again this reply is perfect.

Thanks a lot, i'll go ahead, read links and test

Many thanks for this great help

Regards

IMPI · ‎2016-05-09

Hi Niel,

i hope you're still reading this thread.

Just one more advice please : about UPS, somebody else gave me the same advice but told me to put it on the controler. Then when we loose power, controller will see that there is no network, no shelf and will initiate a switchover.

The advantage given there, compared to the solution where the ups is on shelf, is that we need only a little ups. Shelf is more power demanding.

Do you agree with that ?

thanks

niels · ‎2016-05-09

Unfortunately that would only work in the legacy 7mode MetroCluster if the interfaces are configured that way. With Clustered Data ONTAP there is no possibility for the controller to switchover on network failure, so you would need to have the controller die first in order to have a clean, automated and quick switchover. regards, Niels