Tech ONTAP Blogs
Tech ONTAP Blogs
This is a post about how I expectedly needed to use an ONTAP feature in order to test a completely different ONTAP feature. If you haven't heard of it, SVM Migrate is a high availability feature that allows you to migrate a running storage environment from one cluster to a completely different cluster, nondisruptively.
The feature I wanted to test was SnapMirror active sync (SM-AS) running in symmetric active-active mode. We enhanced SM-AS last year to offer symmetric active-active replication. Here’s a basic diagram of what I was working with:
It’s a couple of A700 clusters with SM-AS enabled. I set up my Oracle RAC configuration, including databases and quorum drives, on the jfs_as1 and jfs_as2 SVMs. Oracle RAC is not yet supported with SM-AS in active/active mode, but I couldn’t think of a reason it shouldn’t work, and I wanted to give this a spin. The idea here is creating a single cross-site, ultra-available Oracle RAC cluster. I'll post on this later.
When you first set up an ONTAP system, it’s a little like VMware ESX. You’ll have an operational cluster, but it doesn’t do anything yet. You need to define an Storage Virtual Machine (SVM). It’s basically a self-contained storage personality. As with VMware, it’s about multitenancy and security and manageability. You might only have the one SVM on your cluster, but if you want to have different SVMs serving different types of data or managed by different teams, you can do that too. For example, maybe you have a production SVM that is treated extra-carefully, but then you have a development SVM where you give your developers more control over their storage environment.
This isn’t the point of this post, but SnapMirror active sync (SM-AS) is a zero-RPO replication solution. When operated in active-active mode, what you have is the same data and the same LUNs available on two different systems. All reads are serviced locally. Writes obviously must be replicated to the partner cluster to maintain consistency. The result is symmetric active-active access to the same dataset.
I know how it works internally, so I was sure that simply configuring replication would result in a perfectly usable solution. The question I had was about failover. When you configure SM-AS, you also have a mediator service that manages tiebreaking and failover.
My first test to validate is what happens when Cluster2 fails. What SHOULD happen is the replication should fail and the mediator should signal to Cluster1 that is can resume operations unmirrored. After all, the point here is ultra high availability.
Here’s the issue – my Oracle RAC hosts are all running under VMware using VMDK files hosted on the SVM called jfs_esx. If I cut the power on Cluster2, I’m going to take out my hosts as well. I really, really didn’t want to take the time to configure a new ONTAP system and vMotion my VMDK files over.
I decided to give SVM Migrate a try. It’s been around since ONTAP 9.10, but I never used it before. The purpose of SVM Migrate is to replicate that entire SVM personality. There are some restrictions, but in my case I just had a 1TB NFS share hosting all my VMDKs.
Since I was working in a lab environment that I own, I figured I’d just give this a try. It was a good test of simplicity. I didn't shut anything down. All my VMs are operational and the RAC clusters are running. Will it all survive the migration? Let's find out! I don't need no documentation.
Caution: Please read the documentation. I didn’t read the documentation, but I’ve been working with ONTAP since ’95 and half my job is trying to break things.
I knew the command was probably vserver something (an SVM is known as a vserver at the CLI) so I just started typing and using the tab key to see what arguments were required. It looked like I could just do this:
rtp-a700s-c01::> vserver migrate start -vserver jfs_esx -source-cluster rtp-a700s-c02
Info: To check the status of the migrate operation use the "vserver migrate show" command.
I was then pretty sure I was moving my jfs_esx SVM from cluster2 to cluster1. Then again, maybe I didn't provide a required argument or maybe there was some aspect of configuration that blocked the migration. Let's find out what happened...
The prior command told me to run vserver migrate show to monitor, so that's what I did. I ran it a couple times.
rtp-a700s-c01::> vserver migrate show
Destination Source
Vserver Cluster Cluster Status
---------------- ------------------- ------------------- ---------------------
jfs_esx rtp-a700s-c01 rtp-a700s-c02 setup-configuration
rtp-a700s-c01::> vserver migrate show
Destination Source
Vserver Cluster Cluster Status
---------------- ------------------- ------------------- ---------------------
jfs_esx rtp-a700s-c01 rtp-a700s-c02 transferring
Looks like it's working. It appears to have configured the destination and commenced data transfer.
SnapMirror
The most important part of the SVM Migrate operation is moving the data itself, which happens via SnapMirror. That's what the word transferring means above. The SVM Migrate operation is transferring my data. How much data to I need to move?
rtp-a700s-c02::> vol show -vserver jfs_esx jfs_esx -fields used
vserver volume used
------- ------- -------
jfs_esx jfs_esx 536.8GB
Looks like I'll need to transfer around a half terabyte of total data. I just have the one volume in this SVM. It's a 1TB volume, but after efficiency savings it's 536GB of data.
I was monitoring the status by repeatedly running snapmirror show when I saw something odd
rtp-a700s-c01::> snapmirror show -destination-path jfs_esx:jfs_esx -fields snapshot-progress
source-path destination-path snapshot-progress
--------------- ---------------- -----------------
jfs_esx:jfs_esx jfs_esx:jfs_esx 175.8GB
rtp-a700s-c01::> snapmirror show -destination-path jfs_esx:jfs_esx -fields snapshot-progress
source-path destination-path snapshot-progress
--------------- ---------------- -----------------
jfs_esx:jfs_esx jfs_esx:jfs_esx 23.09GB
What happened? Why did I go from 175GB transferred to just 23GB? The reason is I'm looking at a different SnapMirror operation, and the reason that happened was snapshots.
I guessed that SVM Migrate had initialized the mirror, and then was transferring the individual snapshots from the source. I checked the snapshots at the destination to confirm:
rtp-a700s-c01::> snapshot show -vserver jfs_esx ---Blocks--- Vserver Volume Snapshot Size Total% Used% -------- -------- ------------------------------------- -------- ------ ----- jfs_esx jfs_esx snapmirror.ac509ea6-fa33-11ed-ae6e-00a098f7d731_2152057340.2023-05-25_180109 90.08GB 9% 20% smas_testing_baseline 6.53GB 1% 2% snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520100.2024-02-27_172427 133.6MB 0% 0% snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184257 668KB 0% 0% snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184300 5.35GB 1% 1% nightly.2024-02-28_0105 14.40GB 1% 4% 6 entries were displayed. rtp-a700s-c01::> snapshot show -vserver jfs_esx ---Blocks--- Vserver Volume Snapshot Size Total% Used% -------- -------- ------------------------------------- -------- ------ ----- jfs_esx jfs_esx snapmirror.ac509ea6-fa33-11ed-ae6e-00a098f7d731_2152057340.2023-05-25_180109 90.08GB 9% 20% smas_testing_baseline 6.53GB 1% 2% snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520100.2024-02-27_172427 133.6MB 0% 0% snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184257 668KB 0% 0% snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184300 5.35GB 1% 1% nightly.2024-02-28_0105 19.26GB 2% 5% nightly.2024-02-29_0105 33.84MB 0% 0% 7 entries were displayed.
You can see I went from 6 snapshots to 7 snapshots in just a few moments. I asked engineering, "Hey, does SVM Migrate initialize a baseline transfer of my data, and then start transferring the deltas to copy the snapshots too?" and they said, "Yup".
There were 15 snapshots on this volume, so I'm halfway done moving them. My transfer been running for about 10 minutes at this point.
I went back to monitoring the status, but this time I used the show-volume rather than show argument.
rtp-a700s-c01::> vserver migrate show-volume Volume Transfer Vserver Volume State Healthy Status Errors -------- ---------------- -------- --------- --------- ---------------------- jfs_esx jfs_esx online true Transferring - jfs_esx_root online true ReadyForCutoverPreCommit
Looks like one of my volumes is fully transferred, but there's a lot of data in that jfs_esx volume, so that's still running.
After another 5 minutes or so, I got to this:
rtp-a700s-c01::> vserver migrate show-volume Volume Transfer Vserver Volume State Healthy Status Errors -------- ---------------- -------- --------- --------- ---------------------- jfs_esx jfs_esx online true ReadyForCutoverPreCommit - jfs_esx_root online true ReadyForCutoverPreCommit
Cool. All data is transferred. Ready for the cutover process. If I didn't want this to happen automatically, I could have deferred the cutover. There several other options available with the vserver migrate command that I didn't know about initially because, as mentioned before, I didn't actually read the documentation.
Once all the basic data is transferred, it's time for SVM Migrate to perform the cutover. Since this is an RPO=0 migration, the underlying data must be brought into an RPO=0 synchronous replication configuration. SVM Migrate orchestrates that process, and I saw that transition occur:
rtp-a700s-c01::> vserver migrate show-volume Volume Transfer Vserver Volume State Healthy Status Errors -------- ---------------- -------- --------- --------- ---------------------- jfs_esx jfs_esx online true InSync - jfs_esx_root online true InSync - 2 entries were displayed.
I then went back to watching the migrate-show output and saw these responses:
rtp-a700s-c01::> vserver migrate show Destination Source Vserver Cluster Cluster Status ---------------- ------------------- ------------------- --------------------- jfs_esx rtp-a700s-c01 rtp-a700s-c02 post-cutover rtp-a700s-c01::> vserver migrate show Destination Source Vserver Cluster Cluster Status ---------------- ------------------- ------------------- --------------------- jfs_esx rtp-a700s-c01 rtp-a700s-c02 cleanup rtp-a700s-c01::> vserver migrate show Destination Source Vserver Cluster Cluster Status ---------------- ------------------- ------------------- --------------------- jfs_esx rtp-a700s-c01 rtp-a700s-c02 migrate-complete
I'm impressed. I was in some early conversations about the SVM Migrate feature, but I hadn't thought about it since then.
I successfully relocated all the storage for all my VMs, nondisruptively, with a single command, and without even reading the documentation (again, please read the documentation anyway).
It was simple, and it simply worked. As it should.