NetApp ONTAP 9.10.1 adds many new features to expand the performance and supported scale of the VMware vSphere Metro Storage Cluster (vMSC) solution on SnapMirror Business Continuity (SM-BC). I thought now would be fantastic time to revisit the solution and introduce you to the ONTAP 9.10.1 enhancements.
What’s new with SM-BC and ONTAP 9.10.1?
With the ONTAP 9.10.1 release, the number of consistency groups has increased from 5 to 20, each guaranteeing dependent write-order consistency for up to 16 volumes (an increase from 12 volumes previously). The number of SnapMirror Synchronous (SM-S) relationships available to SM-BC has also steadily increased, up to 60 in ONTAP 9.9.1 and raising again to 200 in ONTAP 9.10.1.
But there are other related improvements since the original 9.8 release that are also relevant to this discussion.
As you know, in ONTAP 9.8 we also introduced support for the maximum size VMFS6 datastore (64TB, ONTAP supports up to 128TB LUNs) with ONTAP All SAN Array (ASA) platforms, and since ONTAP 9.9.1, we’ve improved single LUN I/O performance dramatically—nearly 400% under some workloads compared to ONTAP 9.8 single LUN performance. So now you can safely deploy massive and highly performant datastores to service your largest VMs and protect them with SM-BC BY using vMSC.
Implementing vMSC with SM-BC
Let’s look at an example of vMSC using SM-BC deployment.
A quick note, all the steps are taken from the following documents:
And here are some additional references:
Some high-level notes:
- 10ms RTT between replicas is required by both NetApp and VMware.
- VMware states that storage I/O control (SIOC) metrics must be disabled, which means the reports available in NetApp ONTAP tools for VMware vSphere cannot display latency stats that depend on SIOC.
- Update 2002/12/22 I wanted to add that ATS-only locking is required. Fortunately, this is the default setting for currently support ESXi versions. However, in a heterogeneous environment, you may have enabled SCSI-2 reserve/release-based locking for another array.
The configuration continues in the order that follows.
Prepare your ESXi cluster
To prepare your ESXi cluster, complete the following steps:
- Set the HA admission control with the cluster resource reserve; leave the default 50% CPU/MEM.
Add two isolation IPs that can ping, one per site. Do not use the gateway IP. The vSphere HA advanced setting used is das.isolationaddress. You can use ONTAP or Mediator IP addresses for this purpose.
Refer to: https://core.vmware.com/resource/vmware-vsphere-metro-storage-cluster-recommended-practices#sec2-sub5
Adding an advanced setting called das.heartbeatDsPerHost can increase the number of heartbeat datastores. Use four heartbeat datastores (HB DSs)—two per site. Use the “Select from List but Compliment” option. This is needed because if one site fails, you still need two HB DSs. However, those don’t have to be protected with SM-BC.
Refer to: https://core.vmware.com/resource/vmware-vsphere-metro-storage-cluster-recommended-practices#sec2-sub5
- Set VMCP to Power Off and restart for both permanent device loss (PDL) and all paths down (APD). For APD, select Conservative Policy.
- Leave the response recovery disabled.
- Make sure Disk.AutoremoveOnPDL is set to 1 on the ESXi host Advanced System Settings.
Configure NetApp software
To support the NetApp components, deploy and configure the following software:
- Deploy a Centos 7.6–7.9 or RHEL 7.6–7.9/8.0–8.4 based virtual machine (VM) to serve as the ONTAP Mediator host.
- Install the ONTAP Mediator and certificates (certs are optional) on the VM.
- Peer your ONTAP clusters and storage virtual machines (SVMs). We’ll be using the standard workflows here.
- Deploy and configure ONTAP tools. Again, we’ll be using the standard workflow here, so add both clusters.
Provision and protect a SAN datastore
To provision and protect a SAN datastore, complete the following steps:
- Using ONTAP tools, provision a SAN datastore on site A. This will ensure that all hosts in site A have the correct initiator mappings and that the LUN is created with the correct options for optimal use as a vSphere datastore.
- Use ONTAP System Manager to enable SM-BC. Simply go down to Protection, expand the menu, and select Relationships. Once there, click Protect, and select LUNs. There create a new or select existing consistency group and uncheck the enforce option to not conflict with any quality-of-service (QoS) policies set by ONTAP tools based on VM storage policies.
The following two screenshots illustrate the process.
- After SnapMirror is in-sync, map the replica LUN on the destination and make sure to use same LUN ID from site A then rescan storage on ESXi hosts after.
- That which hosts you map the LUNs to depends on your host access topology. For a brief description of the differences between uniform host access and non-uniform host access deployment models, see the KB. There are certain pros and cons to both deployment models, but that’s beyond the scope of this post.
Now the datastore should be visible to all mapped hosts.
Notice now how you have double the number of paths, but the active (I/O) paths are the same from the node that owns the aggregate.
Before:
After:
Follow-up work to be done before going into production
The following tasks should be performed on both site A and site B. These tasks are in addition to the generally recommended best practice of using ONTAP tools to tune your ESXi host for recommended settings.
Consider the application dependencies and set the VM ordering appropriately. For example, a Microsoft Windows Active Directory Domain Controller should be started before a Microsoft SQL Server.
For example:
- Also, you’ll want to create host and VM groups to set site affinities.
- Set the rule to say Should and not Must.
- Datastores created by ONTAP tools do not have SIOC on by default, however any datastores created or edited manually might. In those instances, remember to turn it off.
Right-click the datastore and select Configure Storage I/O Control.
Make sure it is disabled.
- This will affect ONTAP tools for VMware vSphere’s ability to collect storage stats.
That about wraps it up. You’re now ready to begin failover testing!