The inevitability of migration

steiner · ‎2024-05-08

The inevitability of migration

There is one certainty in all enterprise IT infrastructures – migration. It doesn’t matter whose product you buy - you'll need to migrate data eventually. Sometimes it's because your current storage array has reached end-of-life. Other times, you'll find that a particular workload is in the wrong place. Sometimes it's about real estate. For example, you might need to power down a data center for maintenance and a critical workload would need to be relocated temporarily.

When this happens, you have lots of options. If you have an RPO=0 DR plan, you might be able to use your disaster recovery procedures to execute the migration. If it's just a tech refresh, you might use OS features to nondisruptively migrate data on a host-by-host basis. Logical volume managers can help you do that. With ONTAP storage systems, you might choose to swap the controllers and get yourself to a newer hardware platform. If you're moving datasets around geographically, ONTAP's SnapMirror is a convenient and highly scalable option for large-scale migration.

This post is about SVM Migrate, which is a feature that allows you to transparently migrate a complete storage environment from one array to another.

SVMs

Before I show the data from my testing, I need to explain the Storage Virtual Machine, or SVM. This is one of the most underrated and underutilized features of ONTAP.

ONTAP multitenancy is a little like ESX. To do anything useful, you have to create a virtual machine. In the case of ONTAP, we call it an SVM. The SVM is essentially a logical storage array, including security policies, replication policies, LUNs, NFS shares, SMB shares, and so forth. It’s a self-contained storage object, much like a guest on an ESX server is a self-contained operating system. ONTAP isn’t really a hypervisor, of course, but the result is still multitenancy.

Most customers seem to create just a single SVM on their ONTAP cluster, and that usually makes sense to me. Most customers want to share out LUNs and files to varies clients and there is a single team in charge of the array. It's just a single array to them.

Sometimes, however, they're missing an opportunity. For example, they could have created two SVMs, one for production and one for development. This would allow them to safely give the developers more direct control over provisioning and management of their storage. They could have created a third SVM that contains sensitive file shares, and they could lock that SVM down to select users.

There’s no right or wrong answer, it depends on business needs. It's really about granularity of data management.

SVM Migrate

You can migrate an entire SVM nondisruptively. There are some restrictions, and you can read more here, but if you're running a vanilla NFS configuration with workloads such as VMware or Oracle databases it can be a great way to perform nondisruptive migration. As mentioned above, there are many reasons you might want to do that, including moving select storage environments to new hardware, rebalancing workloads as performance needs evolve, or even shifting work around in an emergency situation.

The key difference between SVM Migrate and other options is that you are essentially migrating a storage array from one hardware platform to another. As mentioned above, an SVM is a complete logical storage array unto itself. Migrating an SVM means migrating all the storage, snapshots, security policies, logins, IP addresses, and other aspects of configuration from one hardware platform to another. It’s also designed to be used on a running system.

I’ll explain some of the internals below. It’s easier to understand if you look at the graph.

Test environment

I usually work with complicated application environments, so to test SVM Migrate I picked the touchiest configuration I could think of – Oracle RAC. I built an Oracle RAC cluster using version 21c for both the Grid and Database software.

A test with a database that is just sitting there inert proves nothing, so I added a load generator. I normally use Oracle SLOB, available here. It’s an incredibly powerful tool, and the main value is that it’s a real Oracle database doing real Oracle IO. It’s not synthetic like vdbench. It’s the real thing, and you can measure IO and response times at the database layer. Anything I do in a migration test would be affecting a real database and associated real timeouts and error handling.

My main interest was in the effect on cutover. At some point, the storage personality (the SVM) is going to have to cease operating on the old hardware platform and start operating on the new platform. That’s the point where IP addresses will be relocated and the location of IO processing will change.

What will cutover look like? After multiple migrations back and forth within my lab setup, I decided to graph it.

The Graph

Here’s what it looks like:

Here’s what was happening:

I started the workload and let it reach a steady state. There was about 180MB/sec of total database read IO and 35MB/sec of write IO. This isn’t enormous, but it’s a respectable amount of activity for a single database. This is also a very latency-sensitive workload, so any changes to storage system IO service times will be clearly reflected in the graph.
I initiated the SVM migration at the 0 seconds mark shown on the X-axis.
The first 25 seconds or so I could see setup operations occurring. The new SVM personality was being prepared on the new environment. This will require the transfer of basic configuration information, future IP addresses, security policies, and so forth. I wouldn’t expect any impact on performance yet, and as expected, there was none.
Stating at about 25 seconds, I could see a SnapMirror operation initialize and transfer data. This creates a mirror copy of the source SVM and all of its data from the current hardware cluster to the new cluster.
Up through the 175 second mark, I could see repeated SnapMirror transfers as the individual snapshot deltas were also replicated from the source cluster to the destination cluster. I'm not just migrating the data, I'm migrating the snapshot used backups, clones, and other purposes.
The system then entered a synchronous replication state for a few seconds. You can see throughput drop noticeably on the graph. This is a natural result of a database needing to wait extra time for writes to complete because those writes are now being committed to two different storage systems before being acknowledged
Cutover occurred at about the 180 second mark.
You can then see the cache start to warm up on the destination cluster. The total IO climbs as response times improve.
The IO eventually stabilizes at around 250MB/sec of read IO and 45MB/sec of write IO. This increase in IO reflects the fact the new storage array had a slightly better network connection between the storage system and the database server. There are fewer network hops.

That’s it. It just works, and all it took was a single command.

Cluster1::> vserver migrate start -vserver jfs_svmmigrate -source-cluster rtp-a700s-c01
Info: To check the status of the migrate operation use the "vserver migrateshow" command.

Cluster1::>

I’m impressed. I know ONTAP internals well enough to have predicted how this would work, and SVM Migrate really isn’t doing anything new. It’s orchestrating basic ONTAP capabilities, but whoever put all this together did a great job. I was able to monitor all the steps as they proceeded, I didn’t note any unexplained or problematic pauses, and the cutover should be almost undetectable to database users.

I wouldn’t have hesitated to use SVM Migrate in the middle of the workday if I was still in my prior job. If the DBAs were really looking, they might have noticed a short and minor impact on performance, but as a practical matter this was a nondisruptive operation.

There’s more to the command “vserver migrate” than I showed here too. For example, you might have a lot of data to move and you want to set up the initial copying by defer the cutover until later. You can read about it in the documentation.

Nondisruptive migration of a running Oracle database to a new cluster

The inevitability of migration

SVMs

SVM Migrate

Test environment

The Graph