I'm currently doing a design for a customer utilising a two NetApp MetroCluster pairs running 7-mode, and VMware vSphere 5.5 Enterprise Plus. One of the resources I've been referencing is NetApp's TR-4128 vSphere 5 on NetApp MetroCluster Solution. The document is excellent, especially around providing test plans and expected failure scenarios, but there are a couple of recommendations that apparently go against my understanding of best practice.
On page 19,
For the Admission Control option, select “Disable” since the solution's main goal is maximum availability rather than performance in the case of host failure.
This seems like a non-sequitor. A large part of providing availability for an application is guaranteeing minimum performance levels, typically dictated by an SLA. In a stretched metro storage cluster environment, with half of your storage and half of your hosts on each site, the solution must provide minimum performance even in the case of losing of 50% of both your compute and storage resources.
In the context of a vSphere HA cluster, enforcing minimum performance levels for your applications is done at the VM level through resource reservations. Guess what will happen in an HA event, where you’re reduced to 50% capacity, but your VMs have reservations configured that cannot be satisfied by the remaining hosts? The HA placement request will fail at the alive site and HA will fail to restart them, completely defeating the point of your expensive MetroCluster solution.
Admission Control is designed to prevent exactly this sort of thing from happening. In normal operation, even running at 100% capacity, it will prevent you powering on more VMs that can be supported in the event of the failure. The configuration should be:
Admission Control: Enable: Disallow VM power on operations that violate availability constraints.
Admission Control Policy: Percentage of cluster resources reserved as failover capacity:
This is also the recommendation from the VMware publication, VMware vSphere Metro Storage Cluster Case Study
On page 10,
Further, because such hosts are equally divided across the two sites, and to ensure that all workloads can be restarted by vSphere HA, configuring the admission control policy to 50 percent is advised
In the case where you're not using reservations, you should configure reasonable ball-park figures for das.vmCpuMinMhz and das.vmMemoryMinMB to ensure that admission control stops you deploying more VMs that you can service during a failover scenario.
Host Isolation Response
On page 20,
In iSCSI/NFS environments in which the management network correlates with the IP storage network, it is impossible for hosts to decide whether it is fully isolated. In these environments, it is better to change the setting to “Shutdown,” which will gracefully shut down the VMs whenever there is an isolation response. This avoids split-brain scenarios too.
Using the IP addresses of the array as isolation addresses means that, when the host triggers its HA response, it knows that it cannot reach its datastores. In this case, the VMs cannot write to their disks and thus cannot gracefully shut down or flush their dirty write buffers. Using the “Shutdown” isolation response will only delay the shutdown – the appropriate response in an IP storage environment is “Power Off”.
I would love to get some input from others who have designed similar solutions, clarification on these vSphere HA configurations especially as they relate to MetroCluster.
Thanks in advance!