Hi IMPI,
first - nothing that you have seen concerns me and everything is working as designed. That said - let me explain the details (as far as I can).
Q1 : is 10 seconds a normal failover time in this case ?
I'm not a network specialist so unfortunately I can't help you here too much. It might be related to spanning tree configurations but as said, I'm not an ethernet pro nor do I play one on TV.
My assumption is that it takes some time on the switch to recognize the MAC address of the LIF has moved to a different port. The switch has to update its ARP table and distribute that information to the clients.
When you send the port home after having reconnected the cable, that's a negotiated activity and may be the reason why it's not noticed.
Q2 : is 40 seconds a normal faiover time in this case ?
I'd say that about 10 seconds of this is related to the problem above. Again, the LIF suddenly moves to a different physical port on the network. This information needs to be distributed accordingly. That leaves us with 30 seconds. That is well within the specs. The max switchover time is designed to be <120 seconds for systems configured with maximum objects (LIFs, volumes, SVMs, etc). The minimum SO time I've observed in my lab was 15 seconds, the majority was around 25-40 seconds.
Q3 : is that normal that only the controler going to failover, not the associated shelf ?
Yes. When you issue a manual switchover (the so-called negotiated switchover or NSO), then all other hardware is still functioning. The system does not see a requirement to break the mirror. As long as the disk shelf stays available, it will continue to synchronously write data to both sites. As NSO is for planned maintenance activities, it's assumed you take the aggregate plexes offline manually in case you want to power down the shelf.
This is to keep the data protected as long as technically possible.
Q4 : why powering of the shelf freezes both VM
When you power off the shelf the disks suddenly disappear and cannot accept any more IO. ONTAP has to wait for the SCSI timeout of the disks for any outstanding IO. This is set to 30 seconds. After this time has expired the missing disks are removed from the disk inventory and writes will continue to the remaining plex only. The aggregate is now in mirror-degraded state.
In your case the shelf seems to hold disks that belong to both nodes' aggregates, so that's why IO of both VMs stalls as both aggregates are impacted.
I'm missing Q5 😉
Q6 : Why did the cluster did not "see" that one site was down ?
This is also working as designed, because not being able to "see" the other site is the actual issue here. As with any clustered solution, the MetroCluster on its own cannot distinguish between a real site desaster and the loss of connectivity between the sites if the surviving site loses communication to the controller and the disks *at the same time*. It requires a third and independant instance to assist with this decision. You might be familiar with the quorum principle of server clusters.
In order to prevent split brain the system does not perform a switchover. It will continue to serve its own data though as that's not impacted (other than the mirror is lost).
Per design a third instance has to declare the desaster and perform the sitchover. This could be
- the administrator by issuing "metrocluster switchover -force-on-desaster true" - that would be your task
- implement the MetroCluster Tie Breaker software at a third site with the option "observer-mode" configured as false - than this system will initiate the switchover on your behalf
The switchover would have worked automatically when the controller and the shelf would have died after one another with a time of >30 seconds inbetween (remember the disk timeout from above?). That's because the controllers communicate directly over FC-VI and indirectly over a special set of disks (the so-called mailbox disks). In that case the surviving system can make the profound decision that the other site is in fact down.
Hope that helps
Kind regards, Niels
---------
If this solution resolved your issue, please mark it as resolution and give kudos