SVM Migrate without reading the documentation

steiner · ‎2024-03-07

This is a post about how I expectedly needed to use an ONTAP feature in order to test a completely different ONTAP feature. If you haven't heard of it, SVM Migrate is a high availability feature that allows you to migrate a running storage environment from one cluster to a completely different cluster, nondisruptively.

ONTAP environment

The feature I wanted to test was SnapMirror active sync (SM-AS) running in symmetric active-active mode. We enhanced SM-AS last year to offer symmetric active-active replication. Here’s a basic diagram of what I was working with:

It’s a couple of A700 clusters with SM-AS enabled. I set up my Oracle RAC configuration, including databases and quorum drives, on the jfs_as1 and jfs_as2 SVMs. Oracle RAC is not yet supported with SM-AS in active/active mode, but I couldn’t think of a reason it shouldn’t work, and I wanted to give this a spin. The idea here is creating a single cross-site, ultra-available Oracle RAC cluster. I'll post on this later.

What’s an SVM again?

When you first set up an ONTAP system, it’s a little like VMware ESX. You’ll have an operational cluster, but it doesn’t do anything yet. You need to define an Storage Virtual Machine (SVM). It’s basically a self-contained storage personality. As with VMware, it’s about multitenancy and security and manageability. You might only have the one SVM on your cluster, but if you want to have different SVMs serving different types of data or managed by different teams, you can do that too. For example, maybe you have a production SVM that is treated extra-carefully, but then you have a development SVM where you give your developers more control over their storage environment.

SnapMirror active sync

This isn’t the point of this post, but SnapMirror active sync (SM-AS) is a zero-RPO replication solution. When operated in active-active mode, what you have is the same data and the same LUNs available on two different systems. All reads are serviced locally. Writes obviously must be replicated to the partner cluster to maintain consistency. The result is symmetric active-active access to the same dataset.

I know how it works internally, so I was sure that simply configuring replication would result in a perfectly usable solution. The question I had was about failover. When you configure SM-AS, you also have a mediator service that manages tiebreaking and failover.

The Problem

My first test to validate is what happens when Cluster2 fails. What SHOULD happen is the replication should fail and the mediator should signal to Cluster1 that is can resume operations unmirrored. After all, the point here is ultra high availability.

Here’s the issue – my Oracle RAC hosts are all running under VMware using VMDK files hosted on the SVM called jfs_esx. If I cut the power on Cluster2, I’m going to take out my hosts as well. I really, really didn’t want to take the time to configure a new ONTAP system and vMotion my VMDK files over.

SVM Migrate to the rescue!

I decided to give SVM Migrate a try. It’s been around since ONTAP 9.10, but I never used it before. The purpose of SVM Migrate is to replicate that entire SVM personality. There are some restrictions, but in my case I just had a 1TB NFS share hosting all my VMDKs.

Since I was working in a lab environment that I own, I figured I’d just give this a try. It was a good test of simplicity. I didn't shut anything down. All my VMs are operational and the RAC clusters are running. Will it all survive the migration? Let's find out! I don't need no documentation.

Caution: Please read the documentation. I didn’t read the documentation, but I’ve been working with ONTAP since ’95 and half my job is trying to break things.

Starting the migration

I knew the command was probably vserver something (an SVM is known as a vserver at the CLI) so I just started typing and using the tab key to see what arguments were required. It looked like I could just do this:

rtp-a700s-c01::> vserver migrate start -vserver jfs_esx -source-cluster rtp-a700s-c02

Info: To check the status of the migrate operation use the "vserver migrate show" command.

I was then pretty sure I was moving my jfs_esx SVM from cluster2 to cluster1. Then again, maybe I didn't provide a required argument or maybe there was some aspect of configuration that blocked the migration. Let's find out what happened...

Monitoring

The prior command told me to run vserver migrate show to monitor, so that's what I did. I ran it a couple times.

rtp-a700s-c01::> vserver migrate show
                 Destination         Source
Vserver          Cluster             Cluster             Status
---------------- ------------------- ------------------- ---------------------
jfs_esx          rtp-a700s-c01       rtp-a700s-c02       setup-configuration

rtp-a700s-c01::> vserver migrate show
                 Destination         Source
Vserver          Cluster             Cluster             Status
---------------- ------------------- ------------------- ---------------------
jfs_esx          rtp-a700s-c01       rtp-a700s-c02       transferring

Looks like it's working. It appears to have configured the destination and commenced data transfer.

SnapMirror

The most important part of the SVM Migrate operation is moving the data itself, which happens via SnapMirror. That's what the word transferring means above. The SVM Migrate operation is transferring my data. How much data to I need to move?

rtp-a700s-c02::> vol show -vserver jfs_esx jfs_esx -fields used
vserver  volume   used
-------  -------  -------
jfs_esx  jfs_esx  536.8GB

Looks like I'll need to transfer around a half terabyte of total data. I just have the one volume in this SVM. It's a 1TB volume, but after efficiency savings it's 536GB of data.

I was monitoring the status by repeatedly running snapmirror show when I saw something odd

rtp-a700s-c01::> snapmirror show -destination-path jfs_esx:jfs_esx -fields snapshot-progress
source-path destination-path snapshot-progress
---------------  ----------------  -----------------
jfs_esx:jfs_esx  jfs_esx:jfs_esx   175.8GB

rtp-a700s-c01::> snapmirror show -destination-path jfs_esx:jfs_esx -fields snapshot-progress
source-path      destination-path  snapshot-progress
---------------  ----------------  -----------------
jfs_esx:jfs_esx  jfs_esx:jfs_esx   23.09GB

What happened? Why did I go from 175GB transferred to just 23GB? The reason is I'm looking at a different SnapMirror operation, and the reason that happened was snapshots.

Snapshot transfers

I guessed that SVM Migrate had initialized the mirror, and then was transferring the individual snapshots from the source. I checked the snapshots at the destination to confirm:

 rtp-a700s-c01::> snapshot show -vserver jfs_esx
                                                                 ---Blocks---
Vserver  Volume   Snapshot                                  Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
jfs_esx  jfs_esx
                  snapmirror.ac509ea6-fa33-11ed-ae6e-00a098f7d731_2152057340.2023-05-25_180109
                                                         90.08GB     9%   20%
                  smas_testing_baseline                   6.53GB     1%    2%
                  snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520100.2024-02-27_172427
                                                         133.6MB     0%    0%
                  snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184257
                                                           668KB     0%    0%
                  snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184300
                                                          5.35GB     1%    1%
                  nightly.2024-02-28_0105                14.40GB     1%    4%
         
6 entries were displayed.

rtp-a700s-c01::> snapshot show -vserver jfs_esx
                                                                 ---Blocks---
Vserver  Volume   Snapshot                                  Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
jfs_esx  jfs_esx
                  snapmirror.ac509ea6-fa33-11ed-ae6e-00a098f7d731_2152057340.2023-05-25_180109
                                                         90.08GB     9%   20%
                  smas_testing_baseline                   6.53GB     1%    2%
                  snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520100.2024-02-27_172427
                                                         133.6MB     0%    0%
                  snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184257
                                                           668KB     0%    0%
                  snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184300
                                                          5.35GB     1%    1%
                  nightly.2024-02-28_0105                19.26GB     2%    5%
                  nightly.2024-02-29_0105                33.84MB     0%    0%
         
7 entries were displayed.

You can see I went from 6 snapshots to 7 snapshots in just a few moments. I asked engineering, "Hey, does SVM Migrate initialize a baseline transfer of my data, and then start transferring the deltas to copy the snapshots too?" and they said, "Yup".

There were 15 snapshots on this volume, so I'm halfway done moving them. My transfer been running for about 10 minutes at this point.

Monitoring, again

I went back to monitoring the status, but this time I used the show-volume rather than show argument.

rtp-a700s-c01::> vserver migrate show-volume
                          Volume             Transfer
Vserver  Volume           State    Healthy    Status   Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
         jfs_esx          online   true      Transferring
                                                       -
         jfs_esx_root     online   true      ReadyForCutoverPreCommit

Looks like one of my volumes is fully transferred, but there's a lot of data in that jfs_esx volume, so that's still running.

After another 5 minutes or so, I got to this:

 rtp-a700s-c01::> vserver migrate show-volume
                          Volume             Transfer
Vserver  Volume           State    Healthy    Status   Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
         jfs_esx          online   true      ReadyForCutoverPreCommit
                                                       -
         jfs_esx_root     online   true      ReadyForCutoverPreCommit

Cool. All data is transferred. Ready for the cutover process. If I didn't want this to happen automatically, I could have deferred the cutover. There several other options available with the vserver migrate command that I didn't know about initially because, as mentioned before, I didn't actually read the documentation.

SnapMirror Synchronous

Once all the basic data is transferred, it's time for SVM Migrate to perform the cutover. Since this is an RPO=0 migration, the underlying data must be brought into an RPO=0 synchronous replication configuration. SVM Migrate orchestrates that process, and I saw that transition occur:

 rtp-a700s-c01::> vserver migrate show-volume
                          Volume             Transfer
Vserver  Volume           State    Healthy    Status   Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
         jfs_esx          online   true      InSync    -
         jfs_esx_root     online   true      InSync    -
2 entries were displayed.

Finalization

I then went back to watching the migrate-show output and saw these responses:

 rtp-a700s-c01::> vserver migrate show
                 Destination         Source
Vserver          Cluster             Cluster             Status
---------------- ------------------- ------------------- ---------------------
jfs_esx          rtp-a700s-c01       rtp-a700s-c02       post-cutover


rtp-a700s-c01::> vserver migrate show
                 Destination         Source
Vserver          Cluster             Cluster             Status
---------------- ------------------- ------------------- ---------------------
jfs_esx          rtp-a700s-c01       rtp-a700s-c02       cleanup

rtp-a700s-c01::> vserver migrate show
                 Destination         Source
Vserver          Cluster             Cluster             Status
---------------- ------------------- ------------------- ---------------------
jfs_esx          rtp-a700s-c01       rtp-a700s-c02       migrate-complete

Thoughts

I'm impressed. I was in some early conversations about the SVM Migrate feature, but I hadn't thought about it since then.

I successfully relocated all the storage for all my VMs, nondisruptively, with a single command, and without even reading the documentation (again, please read the documentation anyway).

It was simple, and it simply worked. As it should.