SnapMirror active sync and the active-active data center
I've been beating up on SnapMirror active sync in the lab for about a year now, and I have to say this is the coolest feature I’ve seen added to ONTAP in years . In particular, I'm talking about SnapMirror active sync running in symmetric active-active mode. The value isn’t in the feature itself, it’s in the sorts of solutions it allows you to create. The foundation of SnapMirror active sync is SnapMirror synchronous, and it's been around much longer and is well-proven, but I only started using the active-active capabilities a year ago.
There are a number of ways to configure SnapMirror active sync, but the one I want to focus on here is an active-active application environment. You can build clustered environments including Oracle RAC, VMware, and Windows Server Failover Clusters (WSFC) where the application cluster is distributed across two sites and IO response times are symmetric. You get RPO=0 through synchronous cross-site mirroring and RTO=0 through built-in ONTAP automation and the fact that storage resources are available on both sites, all the time.
This post should be useful for anyone looking at synchronous replication solutions, but I’m going to use Oracle RAC to illustrate SnapMirror active sync functionality because you can examine the RTO=0 and RPO=0 characteristics of the overall solution from the application layer all the way to the storage layer. The same database is operational at both sites, all the time. A disaster requires breaking a mirror, but you don’t really fail anything over because the services were running at the surviving site all along.
The examples used might be Oracle RAC, but the principles are universal. This includes concepts like "preferred site", heartbeats, tiebreaking, and timeouts. Even if you know next to nothing about Oracle RAC, read on.
Before diving into the details of how SnapMirror active sync (SM-as) works, I’d like to explain what SnapMirror active sync (SM-as) is for.
RPO=0
If you work in IT, you’ve definitely heard about the Recovery Point Objective. How much data can you afford to lose if certain events happen? If you're looking at synchronous replication, that means RPO=0, but that's more nuanced than it seems. When I see a requirement for RPO=0, I usually divide it into two categories:
RPO for common problems
RPO for disaster scenarios
RPO=0 for normal problems is fairly easy to accomplish. For example, if you have a database you should expect to need to recover it occasionally. Maybe a patch went wrong and caused corruption or someone deleted the wrong file. All you usually need is RAID-protected storage and a basic backup plan and you can restore the database, replay logs, and fix the problem with no loss of data. You should expect to have RPO=0 recoverability for these types of common problems, and if you choose ONTAP we can make the procedures easier, faster, and less expensive.
If you want RPO=0 for disaster scenarios, things get more complicated. As an example, what if you have a malicious insider who decided to destroy your data and its backups? That requires a very different approach using the right technology with the right solution. We have some cool things to offer in that space.
SnapMirror active sync is about RPO=0 for disaster scenarios that lead to site loss. It might be permanent site loss because of a fire or maybe it's a temporary loss because of a power outage. Perhaps it's more isolated, like a power surge that destroys a storage array. SM-as addresses these scenarios with synchronous mirroring. It's not just RPO=0 replication either, it's a self-healing, integrated, intelligent RPO=0 replication solution.
RTO=0
A requirement for RPO=0 often comes with a requirement for Recovery Time Objective (RTO) of zero as well. The RTO is the amount of time you can be without a service, and overall it's an easily understood SLA, with one exception, which is RTO=0. What does that mean?
The term "RTO=0" across the whole IT industry. I continue to insist there’s no such thing as RTO=0. That would require instantaneously recognizing a problem exists and then instantly correcting it to restore service. That isn’t possible. From a storage perspective, there’s no way to know whether a 10 millisecond wait for a LUN to respond is because the storage system is merely busy with other IO or the data center itself fell into a black hole and no longer exists.
You can’t have RTO=0, but you can have a near-zero RTO from the perspective of IT operations. That exact definition of “near-zero” will depend on business needs. If an RTO is low enough that operations aren’t significantly affected, that’s what the industry calls RTO=0. It means “nondisruptive”.
Not all of the RTO is determined solely by storage availability. For example, a typical ONTAP controller failover completes in around 2-4 seconds, and during this time there will be a pause in SAN IO. The OS’s, however, often have much higher timeouts. ONTAP can fail over in 2 seconds and be fully ready to serve data again, but sometimes the host will wait 15 seconds before retrying an IO and you can’t change that. OS vendors have optimized SAN behavior over the years, but it's not perfect. Those rare but real and occasionally lengthy IO timeouts are probably happening from time to time in your SAN already. Sometimes an FC frame will go missing because a cable is just slightly damaged and there will be a brief performance blip. It shouldn’t cause real-world problems, though. Nothing should crash.
Since SM-as failover operations should be nondisruptive to ongoing IT operations, including disasters, I’ll call SM-as an RTO=0 storage solution since that's the way the RTO term is used in the industry.
You still have to think about the application layer, though. For example, you could distribute a VMware cluster across sites and replicate the underlying storage with SM-as, but failure of a given site would require VMware HA to detect the loss of a given VM and restart it on the surviving site. That takes some time. Does that requirement meet your RTO? If you can’t start a VM quickly enough, you can consider containerization. One of its benefits is a container can be activated nearly instantaneously. There’s no OS boot sequence. That might be an option to improve the big-picture RTO. Finally, if you’re using a truly active-active application like Oracle RAC, you can proactively have the same database running at both sites all the time. The right solution depends on the RTO and the behavior of the entire solution during failover events.
From a storage perspective, achieving that near-zero RTO is a lot more complicated than you’d think. How does one site know the other site is alive? What should happen if a write cannot be replicated? How does a storage system know the difference between a replication partner that is permanently offline versus a partner that is only temporarily unreachable? What should happen to LUNs that are being accessed by a host but are no longer guaranteed to be able to serve the most recent copy of data? What happens when replication is restored?
On to the diagrams…
SnapMirror active sync architecture
Let’s start with the basics of how SM-as replication works.
Here’s what this shows:
These are two different storage clusters. One of them is Cluster1, and the other is Cluster2. (I'll explain the jfs_as1 and jfs_as2 labels a little later…)
It might look like there are six LUNs but it’s logically only three.
The reason is there are six different LUN images, but the LUN1 on each jfs_as1 is functionally the same as LUN1 on jfs_as2, the LUN2's are the same, and the LUN3's are the same.
From a host point of view, this looks like ordinary SAN multipathing. LUN1 and its data is available and accessible on either cluster.
IO behavior and IO performance are symmetric.
If you perform a read of LUN1 from the cluster on site A, the read will be serviced by the local copy of the data on site A
If you perform a read of LUN1 from the cluster on site B, the read will be serviced by the local copy of the data on site B
Performing a write of LUN1 will require replication of the write to the opposite site before the write is acknowledged.
Now let's look at failover. There’s another component involved – the mediator.
The mediator is required for safely automating failover. Ideally, it would be placed on an independent 3 rd site, but it can still function for most needs if it’s placed on site A or site B. The mediator is not really a tiebreaker, although that’s effectively the function it provides. It's not taking any actions; it’s providing an alternate communication channel to the storage systems.
The #1 challenge with automated failover is the split-brain problem, and that problem arises if your two sites lose connectivity with each other. What should happen? You don’t want to have two different sites activate themselves as the surviving copy of the data, but how can a single site tell the difference between actual loss of the opposite site and an inability to communicate with the opposite site?
This is where the mediator enters the picture. If placed on a 3 rd site, and each site has a separate network connection to that site, then you have an additional path for each site to validate the health of the other. Look at the picture above again and consider the following scenarios.
What happens if the mediator fails or is unreachable from one or both sites?
The two clusters can still communicate with each other over the same link used for replication services.
Data is still served with RPO=0 protection
What happens if Site A fails?
Site B will see both of the communication channels to the partner storage system go down.
Site B will take over data services, but without RPO=0 mirroring
What happens if Site B fails?
Site A will see both of the communication channels to the partner storage system go down.
Site A will take over data services, but without RPO=0 mirroring
There is one other scenario to consider: Loss of the data replication link. If the replication link between sites is lost, RPO=0 mirroring will obviously be impossible. What should happen then?
This is controlled by the preferred site status. In an SM-as relationship, one of the sites is secondary to the other. This makes no difference in normal operations, and all data access is symmetric, but if replication is interrupted then the tie will have to be broken to resume operations. The results is that the preferred site will continue operations without mirroring and the secondary will halt IO processing until replication is reestablished. More on this topic below…
SAN design with SnapMirror active sync
As explained above, SM-as functionally provides the same LUNs on two different clusters. That doesn’t mean you necessarily want your hosts to access their LUNs across all available paths. You might, or you might not. There are two basic options called uniform and non-uniform access.
Uniform access
Let’s say you had a single host, and you wanted RPO=0, RTO=0 data protection. You’d probably want to configure SM-as with uniform access, which means the LUNs would be visible to your host from both clusters.
Obviously, this will mean that complete loss of site A would also result in the loss of your host and its applications, but you would have extremely high availability storage services for your data. This is probably a less common use case for SM-as, but it does happen. Not all applications can be clustered. Sometimes all you can do is provide ultra-available storage and accept that you’ll need to find alternate servers on a remote site in the event of disaster.
Another option with uniform access is in a full mesh networked cluster.
If you use SM-as with uniform access in a cluster as shown above, any host should always have access to a usable copy of the data. Even if one of the storage systems failed or storage replication connectivity was lost, the hosts would continue working. From a host multipath point of view, it all looks like a single storage system. A failure of any one component would result in some of the SAN paths disappearing off the network, but otherwise everything would continue working as usual.
Uniform access with local proximity
Another feature of SM-as is the option to configure the storage systems know where the hosts are located. When you map the LUNs to a given host, you can indicate whether or not they are local to the storage system.
In normal operation, all IO is local IO. Reads and writes are serviced from the local storage array. Write IO will, of course, need to be replicated by the local controller to the remote system before being acknowledged, but all read IO will be serviced locally and will not incur extra latency by traversing the SAN.
The only time the nonoptimized paths will be used is when all active/optimized paths are lost. For example, if the entire array on site A lost power, the hosts at site A would still be operational, although they would be experiencing higher read latency.
Note: There are redundant paths through the local cluster that are not shown on these diagrams for the sake of simplicity. ONTAP storage systems are HA themselves, so a controller failure should not result in site failure. It should merely result in a change in which local paths are used by an affected host.
Nonuniform access
Nonuniform access means each host has access to only a subset of available ports.
The primary benefit to this approach is SAN simplicity: you remove the need to stretch a SAN over the network. Some users don't have dark fiber between sites or lack the infrastructure to tunnel FC SAN traffic over an IP network. In some cases, the additional latency overhead of an application accessing storage across sites on a regular basis would be unacceptable, rendering the improved availability of uniform access unnecessary.
The disadvantage to nonuniform access is that certain failure scenarios, including loss of the replication link, will result in half of your hosts losing access to storage. Applications that run as single instances, such as a non-clustered database that is inherently only running on a single host at any given mount, would fail if local storage connectivity were lost. The data would still be protected, but the database server would no longer have access. It would need to be restarted on a remote site, preferably through an automated process. For example, VMware HA can detect an all-paths-down situation and restart a VM on another server where paths are available. In contrast, a clustered application such as Oracle RAC can deliver the same services that are constantly running at both sites. Losing a site with Oracle RAC doesn’t mean loss of the application service as a whole. Instances are still available and running at the surviving site.
Uniform access with ASA
The diagrams above showed path prioritization with AFF storage systems. NetApp ASA systems provide active-active multipathing, including with the use of SM-as. Consider the following diagram:
An ASA configuration with non-uniform access would work largely the same as it would with AFF. With uniform access, IO would be crossing the WAN. This may or may not be desirable. If the two sites were 100 meters apart with fiber connectivity there should be no detectable additional latency crossing the WAN, but if the sites were a long distance apart then read performance would suffer on both sites. In contrast, with AFF those WAN-crossing paths would only be used if there were no local paths available and read performance would be better.
ASA with SM-as in a low-latency configuration offers two interesting benefits. First, it essentially doubles the performance for any single host because IO can be serviced by twice as many controllers using twice as many paths. Second, it offers extreme availability because an entire storage system could be lost without interrupting host access.
SnapMirror active sync vs MetroCluster
Those of you that know NetApp probably know SM-as' cousin, MetroCluster. SM-as is similar to NetApp MetroCluster in overall functionality, but there are important differences in the way in which RPO=0 replication is implemented and how it is managed.
A MetroCluster configuration is more like one integrated cluster with nodes distributed across sites. SM-as behaves like two otherwise independent clusters that are cooperating in serving data from specified RPO=0 synchronously replicated LUNs.
The data in a MetroCluster configuration is only accessible from one particular site at any given time. A second copy of the data is present on the opposite site, but the data is passive. It cannot be accessed without a storage system failover.
MetroCluster and SM-as mirroring occur at different levels. MetroCluster mirroring is performed at the RAID layer. The low-level data is stored in a mirrored format using SyncMirror. The use of mirroring is virtually invisible up at the LUN, volume, and protocol layers.
In contrast, SM-as mirroring occurs at the protocol layer. The two clusters are overall independent clusters. Once the two copies of data are in sync, the two clusters only need to mirror writes. When a write occurs on one cluster, it is replicated to the other cluster. The write is only acknowledged to the host when the write has completed on both sites. Other than this protocol splitting behavior, the two clusters are otherwise normal ONTAP clusters.
The sweet spot for MetroCluster is large-scale replication. You can replicate an entire array with RPO=0 and near-zero RTO. This simplifies the failover process because there is only one "thing" to fail over, and it scales extremely well in terms of capacity and IOPS.
The first sweet spot for SM-as is granular replication. Sometimes you don’t want to replicate all data on a storage system as a single unit, or you need to be able to selectively fail over certain workloads.
The second sweet spot for SM-as is for active-active operations, where you want fully usable copies of data to be available on two different clusters located in two different locations with identical performance characteristics and, if desired, no requirement to stretch the SAN across sites. You can have your applications already running on both sites, which reduces the overall RTO during failover operations.
SnapMirror active sync setup
Although the following sections focus on an Oracle RAC configuration, the concepts are applicable to most any SnapMirror active sync deployment. Using Oracle RAC as an example is especially helpful because Oracle RAC is inherently complicated. If you understand how to configure and manage Oracle RAC on SM-as, including all the nuances, requirements, and especially the benefits ONTAP delivers for such a configuration, you’ll be able to use SM-as for any workload.
Additionally, I’ll be using the CLI to configure ONTAP, but you can also use the SystemManager GUI, and I want to say that I think it's outstanding. The UI team did great work with SnapMirror active sync. I’m using the CLI mostly because I personally prefer a CLI, but also because it's easier to explain what's happening with ONTAP if I can go step-by-step with the CLI. The GUI automates multiple operations in a single step.
Finally, I'll assume a mediator is already properly configured.
Storage Virtual Machines (SVM)
One item of ONTAP terminology you need to understand is the SVM, also called a vserver. ONTAP storage clusters are built for multitenancy. When you first install a system, it doesn't provide storage services in the same way that a new VMware server can't run applications. You need to create a VM.
An ONTAP SVM (again, also called a vserver, especially at the CLI) is not an actual VM running under a hypervisor, but conceptually it's the same thing. It's a virtual storage system with its own network addresses, security configuration, user accounts, and so forth. Once you create an SVM, you then provision your file shares, LUNs, S3 buckets, and other storage resources under the control of that SVM.
You'll see the use of -vserver in the examples below. I have two clusters, Cluster1 and Cluster2. Cluster1 includes an SVM called jfs_as1, which will be participating in a SnapMirror active sync relationship with SVM jfs_as2, located on Cluster2.
Provision storage
Once the SVM is defined, the next step is to create some volumes and LUNs. Remember – with ONTAP, a volume is not a LUN. A volume is a management object for storing data, including LUNs. We usually combine related LUNs in a single volume. The volume layout is usually designed to simplify management of those LUNs.
Confused? Here’s a diagram of my basic volume and LUN layout.
There is no true best practice for any database layout. The best option depends on what you want to do with the data. In this case, there are three volumes:
One volume for the Oracle Grid management and quorum LUNs.
One volume for the Oracle datafiles.
One volume for the Oracle logs, and this includes redo logs, archive logs, and the controlfiles.
LUN count for datafile IO of 8 LUNs. This isn't an ONTAP limitation, it's about the host OS. You need multiple LUNs to maximum SAN performance through the host IO stack. 4-8 LUNs is usuall required. The LUN count for logs and other sequentially accessed files is less important.
This design is all about manageability. For example, if you need to restore a database, the quickest way to do it would be to revert the state of the datafiles and then replay logs to the desired restore point. If you store the datafiles in a single volume, you can revert the state of that one volume to an earlier snapshot, but you need to ensure the log files are separated from datafiles. Otherwise, reverting the state of the datafiles would result in loss of archive log data that is critical for RPO=0 data recovery.
Likewise, this design allows you to manage performance more easily. You simply place a QoS limit on the datafile volume, rather than individual LUNs. As an aside, never place a QoS limit on a database transaction log because of the bursty nature of log IO. The average IO for transaction logs is usually fairly low, but the IO occurs in short bursts. If QoS engages during those bursts, the result will be significant performance damage. QoS is for datafiles only.
Create the volumes
I created my three volumes as follows.
Cluster1::> vol create -vserver jfs_as1 -volume jfsAA_grid_siteA -snapshot-policy none -percent-snapshot-space 0 -size 256g -space-guarantee Cluster1::> vol create -vserver jfs_as1 -volume jfsAA_oradata_siteA -snapshot-policy none -percent-snapshot-space 0 -size 1t -space-guarantee Cluster1::> vol create -vserver jfs_as1 -volume jfsAA_logs_siteA -snapshot-policy none -percent-snapshot-space 0 -size 500g -space-guarantee none
There’s a lot of different options when creating a volume. If you use SystemManager, you’ll get volumes with default behavior that is close to universally appropriate, but when using the CLI you might need to look at all the available options.
In my case, I wanted to create volumes for the grid, datafile, and log LUNs that include the following attributes:
Disabling scheduled snapshots. Scheduled snapshots can provide powerful on-box data protection, but the schedule, retention policy, and naming conventions need to be based on the SLAs. For now, I’d rather just disable snapshots to ensure no snapshots are unknowingly created.
Set the snapshot reserve to 0%. There is no reason to reserve snapshot space in a SAN environment.
Set space guarantees to none. Space guarantees would reserve the volume’s capacity on the system, which is almost always wasteful with a database. Most databases compress well, so reserving the full size of the volume would be unnecessary.
I want space efficiency settings to be enabled, but this is default behavior and does not require special arguments.
Create the LUNS
The next step is to create the required LUNs inside the volumes. I’ll be using the defaults, plus disabling space reservations. This will result in LUNs only consuming the space actually required by that LUN within the volume.
Cluster1::> lun create -vserver jfs_as1 /vol/jfsAA_grid_siteA/lun0 -size 64g -ostype linux -space-reserve disabled Cluster1::> lun create -vserver jfs_as1 /vol/jfsAA_grid_siteA/lun1 -size 64g -ostype linux -space-reserve disabled … Cluster1::> lun create -vserver jfs_as1 /vol/jfsAA_logs_siteA/lun0 -size 64g -ostype linux -space-reserve disabled Cluster1::> lun create -vserver jfs_as1 /vol/jfsAA_logs_siteA/lun1 -size 64g -ostype linux -space-reserve disabled
Define the CG
There are now three volumes of data: Oracle datafiles LUNs, Oracle log LUNs, and RAC cluster resource LUNs. While three separate volumes deliver optimal manageability, synchronous replication requires a single container. This RAC environment needs to be replicated and kept in sync as a unified whole. If site failure occurs, all resources at the surviving site need to be consistent with one another. We need a consistency group.
As mentioned above, with ONTAP, a volume is a not a LUN, it’s just a management container. If all LUNs in a given dataset were placed on a single volume, you can create snapshots, clone, restore, or replicate that single volume as a unit. In other words, a volume in ONTAP is natively a consistency group.
In many SM-as use cases, placing all the LUNs of a given application in a single volume might be all you need to meet your data protection requirements. Sometimes your requirements are more complicated. You might need to separate an application into multiple volumes based on manageability requirements, but also want to manage the application as a unit. That’s why we created ONTAP Consistency Groups, and you can read more about them here and here.
I can define a consistency group for my three volumes at the CLI by providing a CG name and the current volumes.
Cluster1::> consistency-group create -vserver jfs_as1 -consistency-group jfsAA -volumes jfsAA_oradata_siteA,jfsAA_logs_siteA,jfsAA_grid_siteA
The result is I've created a CG called jfsAA (I used the letters AA to denote the active-active configuration I’m building) based on the current grid, datafile, and log volumes and the LUNs they contain. Also, note that I used a suffix of _siteA so I can more easily keep track of while cluster hosts which volumes and the data within those volumes. More on that below…
Establish replication
There are now volumes of LUNs on site A, but replication isn’t operational yet. The first step to prepare for replication is to create volumes on site B. As mentioned previously, the SystemManager UI automates all of the setup work, but it’s easier to explain each step in the automated sequence of events using the CLI
Create destination volumes
Before I can replicate data, I need a place to host the replicas.
Cluster2::> vol create -vserver jfs_as2 -volume jfsAA_grid_siteB -snapshot-policy none -percent-snapshot-space 0 -size 256g -type DP -space-guarantee none Cluster2::> vol create -vserver jfs_as2 -volume jfsAA_oradata_siteB -snapshot-policy none -percent-snapshot-space 0 -size 1t -type DP -space-guarantee none Cluster2::> vol create -vserver jfs_as2 -volume jfsAA_logs_siteB -snapshot-policy none -percent-snapshot-space 0 -size 500g -type DP -space-guarantee none
The commands above created three volumes on site B using the same settings as used as site A, except they are a type “DP”, which means a data protection volume. This identifies a volume that will be joined to a replication relationship. No LUNs are being provisioned. They will be automatically created once replication is initialized and the content of the volumes on Cluster2 are synchronized from the source volumes on Cluster1.
Initialize replication
The following command creates an SnapMirror active sync relationship in active-active mode. This command is run on the uninitialized controller. SnapMirror is designed as a pull technology. For example, asynchronous SnapMirror updates pull new data from the source. It is not pushed from the source to a destination.
Cluster2::> snapmirror create -source-path jfs_as1:/cg/jfsAA -destination-path jfs_as2:/cg/jfsAA -cg-item-mappings jfsAA_grid_siteA:@jfsAA_grid_siteB,jfsAA_oradata_siteA:@jfsAA_oradata_siteB,jfsAA_logs_siteA:@jfsAA_logs_siteB -policy AutomatedFailOverDuplex Operation succeeded: SnapMirror create for the relationship with destination "jfs_as2:/cg/jfsAA".
This operation makes a lot more sense if you see it in the GUI, but I can break down the command. Here’s what it does:
snapmirror create
That's self-evident. It's creating a snapmirror relationship.
-source-path jfs_as1:/cg/jfsAA
Use a source of the SVM called jfs_as1 and the consistency group called jfsAA. You'll see this syntax elsewhere in the ONTAP CLI. A consistency group is denoted as [svm name]:/cg/[cg name].
-destination-path jfs_as2:/cg/jfsAA
Replicate that source CG called jfsAA to the SVM called jfs_as2 and use the same CG name of jfsAA
-cg-item-mappings \ jfsAA_grid_siteA:@jfsAA_grid_siteB, \ jfsAA_oradata_siteA:@jfsAA_oradata_siteB,\ jfsAA_logs_siteA:@jfsAA_log_siteB
This section controls the mapping of volumes to volumes. The syntax is [source volume]:@[destination volume]. I've mapped the source grid volume to the destination grid volume, oradata to oradata, and logs to logs. The only difference in the name is the suffix. I used a siteA and siteB suffix for the volumes to avoid user errors. If someone is performing management activities on the UI, using either SystemManager or the CLI, they should be readily able to tell whether they're working on the site A or site B system based on the suffix of the volume names.
-policy AutomatedFailOverDuplex
The final argument specifies a snapmirror relationship of type AutomatedFailoverDuplex, which means bidirectional synchronous replication with automated failure detection. The relationship now exists, but it's not yet initialized. This requires the following command on the second cluster.
Cluster2::> snapmirror initialize -destination-path jfs_as2:/cg/jfsAA Operation is queued: SnapMirror initialize of destination "jfs_as2:/cg/jfsAA".
I can check the status as follows, and the key is looking for a Relationship Status of InSync and Healthy being true.
Cluster2::> snapmirror show -vserver jfs_as2 -destination-path jfs_as2:/cg/jfsAA Source Path: jfs_as1:/cg/jfsAA Destination Path: jfs_as2:/cg/jfsAA Relationship Type: XDP Relationship Group Type: consistencygroup SnapMirror Policy: AutomatedFailOverDuplex Mirror State: Snapmirrored Relationship Status: InSync Healthy: true
Define the igroup
Before I make the LUNs available, I need to define the initiator group (igroup). I’m building an Oracle RAC cluster, which means two different hosts will be accessing the same LUNs. I’m using iSCSI, but FC works the same. It just uses WWNs rather than iSCSI initiator IDs.
First, I'll create the igroup on the site A system. Any LUNs mapped to this igroup will be available via iSCSI to hosts using the specified initiator.
Cluster1::> igroup create -vserver jfs_as1 -igroup jfsAA -ostype linux -initiator iqn.1994-05.com.redhat:a8ee93358a32
Next, I'll enter Advanced mode and associate the local initiator name with the local SVM. This is how ONTAP controls path priority. Any host with a WWN or iSCSI initiator listed in igroup will be able to access LUNs mapped to that igroup, but the path priorities would not be optimal. I want paths originating on site A to only be advertised as optimal paths to the hosts located on site A.
Cluster1::> set advanced Cluster1::*> igroup initiator add-proximal-vserver -vserver jfs_as1 iqn.1994-05.com.redhat:a8ee93358a32 -proximal-vservers jfs_as1
I then repeat the process on site B, using the WWN or iSCSI initiator of the host on site B.
Cluster2::> igroup create -vserver jfs_as2 -igroup jfsAA -ostype linux -initiator iqn.1994-05.com.redhat:5214562dfc56 Cluster1::*> set advanced Cluster2::*> igroup initiator add-proximal-vserver -vserver jfs_as2 iqn.1994-05.com.redhat:5214562dfc56 -proximal-vservers jfs_as2
I didn't really need to add the proximal-vserver information. Read on to understand why not.
Map the LUNs
Next, I map the LUNs to the igroup so the hosts will be able to access them. Although I used the same igroup name on each site, these igroups are located on different SVMs.
Cluster1::> lun map -vserver jfs_as1 /vol/jfsAA_grid_siteA/lun0 -igroup jfsAA Cluster1::> lun map -vserver jfs_as1 /vol/jfsAA_grid_siteA/lun1 -igroup jfsAA … Cluster1::> lun map -vserver jfs_as1 /vol/jfsAA_oradata_siteA/lun6 -igroup jfsAA Cluster1::> lun map -vserver jfs_as1 /vol/jfsAA_oradata_siteA/lun7 -igroup jfsAA
Cluster2::> lun map -vserver jfs_as2 /vol/jfsAA_grid_siteB/lun0 -igroup jfsAA Cluster2::> lun map -vserver jfs_as2 /vol/jfsAA_grid_siteB/lun1 -igroup jfsAA … Cluster2::> lun map -vserver jfs_as2 /vol/jfsAA_oradata_siteB/lun6 -igroup jfsAA Cluster2::> lun map -vserver jfs_as2 /vol/jfsAA_oradata_siteB/lun7 -igroup jfsAA
Oracle Configuration
From this point, setup is exactly like any other Oracle RAC server. Functionally, this is like a 2-site Oracle Extended RAC cluster, except there's no need for configuring ASM failgroups. The replication services are built-in to the storage system.
Device Names
There are multiple ways to control device names with Oracle, but my personal preference is using udev rules and multipath aliases. It takes more up-front work, but I have more control over the exact naming conventions to be used.
The multipath.conf file looks like this on each RAC node:
[root@jfs12 ~]# cat /etc/multipath.conf multipaths { multipath { wwid 3600a0980383041327a2b55676c547247 alias grid0 } multipath { wwid 3600a0980383041327a2b55676c547248 alias grid1 } multipath { wwid 3600a0980383041327a2b55676c547249 alias grid2 } … } multipath { wwid 3600a0980383041334a3f55676c69734d alias logs5 } multipath { wwid 3600a0980383041334a3f55676c69734e alias logs6 } multipath { wwid 3600a0980383041334a3f55676c69734f alias logs7 } }
Then I have the following udev rule:
[root@jfs12 ~]# cat /etc/udev/rules.d/99-asm.rules ENV{DM_NAME}=="grid*", GROUP:="asmadmin", OWNER:="grid", MODE:="660" ENV{DM_NAME}=="oradata*", GROUP:="asmadmin", OWNER:="grid", MODE:="660" ENV{DM_NAME}=="logs*", GROUP:="asmadmin", OWNER:="grid", MODE:="660"
The result is clear device names that are automatically assigned the correct user and group permissions. You can even see the devices in /dev/mapper.
[root@jfs12 ~]# ls /dev/mapper control grid1 logs0 logs2 logs4 logs6 oradata0 oradata2 oradata4 oradata6 vg00-root
ASM Configuration
Unlike typical Extended RAC, I created the ASM diskgroups with external redundancy, which means no mirroring by ASM itself. Replication services are provided by the storage system, not RAC. I created the following ASM diskgroups:
[root@jfs12 ~]# /grid/bin/asmcmd ls DBF/ GRID/ LOGS/
From this point on, the installation process is exactly like any other Oracle RAC installation.
Database Creation
I used the following database layout:
[root@jfs12 ~]# /grid/bin/asmcmd ls DBF/NTAP 124D13D9FE3FFF29E06370AAC00A260E/ 124DA9A2A3BB954AE06370AAC00A7624/ DATAFILE/ NTAPpdb1/ PARAMETERFILE/ PASSWORD/ TEMPFILE/ pdbseed/ sysaux01.dbf system01.dbf temp01.dbf undotbs01.dbf undotbs02.dbf users01.dbf [root@jfs12 ~]# /grid/bin/asmcmd ls LOGS/NTAP ARCHIVELOG/ CONTROLFILE/ ONLINELOG/ control01.ctl control02.ctl redo01.log redo02.log redo03.log redo04.log
There is a reason for this, but it's not connected to the use of SnapMirror active sync. See the section above called "Provision storage" for an explanation.
Failure scenarios
The following diagram shows the Oracle database and storage configuration as it exists at the hardware level (mediator not shown), and it is using a non-uniform network configuration. There is no SAN connectivity across sites.
I wrote above that I didn't need to configure host proximity. The reason is I'm using a nonuniform configuration. I have not stretched the SAN across sites. The only paths available to hosts are local paths. There's no reason to add host proximity settings because hosts will never be able to see storages paths to an opposite site. I included the instructions for configuring host proximity for readers who may be using uniform configurations and need to know how proximity settings are controlled.
Several of the scenarios described below that resulted in loss of database services would not have happened with a uniform network configuration. There is no cross-site SAN connectivity, so anything that results in the loss of active paths on a given site will mean there were no paths remaining at all. In a uniform network configuration, each site would be able to use alternate paths on the opposite site. The reason a non-uniform configuration was chosen for these tests was to illustrate that it is possible to have active-active RPO=0 replication without extending the SAN. It's a more demanding use case, and while it does have some limitations it also has benefits, such as a simpler SAN architecture.
There's another way to look at the storage architecture. The existence of replication is essentially invisible to the Oracle database and the RAC cluster. Various failure scenarios might result in disruption to certain database servers or the loss of certain paths, but as far as the database is concerned, there's just one set of LUNs. Logically, it looks like this:
Preferred sites
The configuration is symmetric, with one exception that is connected to split-brain management.
The question you need to ask is this - what happens if the replication link is lost and neither site has quorum? What do you want to happen? This question applies to both the Oracle RAC and the ONTAP behavior. If changes cannot be replicated across sites, and you want to resume operations, one of the sites will have to survive and the other site will have to become unavailable.
Oracle and css_critical
In the case of Oracle RAC, the default behavior is that one of the nodes in a RAC cluster that consists of an even number of servers will be deemed more important than the other nodes. The site with that higher priority node will survive site isolation while the nodes on the other site will evict. The prioritization is based on multiple factors, but you can also control this behavior using the css_critical setting.
My environment has two nodes, jfs12 and jfs13. The current settings for css_critical are as follows:
[root@jfs12 ~]# /grid/bin/crsctl get server css_critical CRS-5092: Current value of the server attribute CSS_CRITICAL is no.
[root@jfs13 trace]# /grid/bin/crsctl get server css_critical CRS-5092: Current value of the server attribute CSS_CRITICAL is no.
I want the site with jfs12 to be the preferred site, so I changed this value to yes on a site A node and restart services.
[root@jfs12 ~]# /grid/bin/crsctl set server css_critical yes CRS-4416: Server attribute 'CSS_CRITICAL' successfully changed. Restart Oracle High Availability Services for new value to take effect. [root@jfs12 ~]# /grid/bin/crsctl stop crs CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'jfs12' CRS-2673: Attempting to stop 'ora.crsd' on 'jfs12' CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on server 'jfs12' CRS-2673: Attempting to stop 'ora.ntap.ntappdb1.pdb' on 'jfs12' … CRS-2673: Attempting to stop 'ora.gipcd' on 'jfs12' CRS-2677: Stop of 'ora.gipcd' on 'jfs12' succeeded CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'jfs12' has completed CRS-4133: Oracle High Availability Services has been stopped. [root@jfs12 ~]# /grid/bin/crsctl start crs CRS-4123: Oracle High Availability Services has been started.
SnapMirror active sync preferred site
At any given moment, SnapMirror active sync will consider one site the "source" and the other the "destination". This implies a one-way replication relationship, but that's not what's happening. This is how the preferred site is determined. If the replication link is lost, the LUN paths on the source will continue to serve data while the LUN paths on the destination will become unavailable until replication is reestablished and enters a synchronous state. The paths will then resume serving data.
My current configuration has site A as the preferred site for both Oracle and ONTAP. This can be viewed via SystemManager:
or at the CLI:
Cluster2::> snapmirror show -destination-path jfs_as2:/cg/jfsAA Source Path: jfs_as1:/cg/jfsAA Destination Path: jfs_as2:/cg/jfsAA Relationship Type: XDP Relationship Group Type: consistencygroup SnapMirror Schedule: - SnapMirror Policy Type: automated-failover-duplex SnapMirror Policy: AutomatedFailOverDuplex Mirror State: Snapmirrored Relationship Status: InSync
The key is that the source is the SVM on cluster1. As mentioned above, the terms "source" and "destination" don't describe the flow of replicated data. Both sites can process a write and replicate it to the opposite site. In effect, both clusters are sources and destinations. The effect of designating one cluster as a source simply controls which cluster survives as a read-write storage system if the replication link is lost.
Loss of SnapMirror replication connectivity
If I cut the SM-as replication link, write IO cannot be completed because it would be impossible for a cluster to replicate changes to the opposite site. Here's what will happen with an Oracle RAC environment.
Site A
The result on site A of a replication link failure will be an approximately 15 second pause in write IO processing as ONTAP attempts to replicate writes before it determines that the replication link is genuinely inoperable. After the 15 seconds elapses, the ONTAP cluster on site A resumes read and write IO processing. The SAN paths will not change, and the LUNs will remain online.
Site B
Since site B is not the SnapMirror active sync preferred site, its LUN paths will become unavailable after about 15 seconds.
The replication link was cut at the timestamp 15:19:44. The first warning from Oracle RAC arrives 100 seconds later as the 200 second timeout (controlled by the Oracle RAC parameter disktimeout) approaches.
2024-09-10 15:21:24.702 [ONMD(2792)]CRS-1615: No I/O has completed after 50% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 99340 milliseconds. 2024-09-10 15:22:14.706 [ONMD(2792)]CRS-1614: No I/O has completed after 75% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 49330 milliseconds. 2024-09-10 15:22:44.708 [ONMD(2792)]CRS-1613: No I/O has completed after 90% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 19330 milliseconds. 2024-09-10 15:23:04.710 [ONMD(2792)]CRS-1604: CSSD voting file is offline: /dev/mapper/grid2; details at (:CSSNM00058:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc. 2024-09-10 15:23:04.710 [ONMD(2792)]CRS-1606: The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc 2024-09-10 15:23:04.716 [ONMD(2792)]CRS-1699: The CSS daemon is terminating due to a fatal error from thread: clssnmvDiskPingMonitorThread; Details at (:CSSSC00012:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc 2024-09-10 15:23:04.731 [OCSSD(2794)]CRS-1652: Starting clean up of CRSD resources.
Once the 200 second voting disk timeout has been reached, this Oracle RAC node will evict itself from the cluster and reboot.
Loss of Oracle RAC replication
Loss of the Oracle RAC replication link will produce a similar result, except the timeouts will be shorter by default. An Oracle RAC node will wait 200 seconds after loss of storage connectivity before evicting, but it will only wait 30 seconds after loss of the RAC network heartbeat.
The CRS messages are similar to those shown below. You can see the 30 second timeout lapse. Since css_critical was set on jfs12, located on site A, that will be the site to survive and jfs13 on site B will be evicted.
2024-09-12 10:56:44.047 [ONMD(3528)]CRS-1611: Network communication with node jfs13 (2) has been missing for 75% of the timeout interval. If this persists, removal of this node from cluster will occur in 6.980 seconds 2024-09-12 10:56:48.048 [ONMD(3528)]CRS-1610: Network communication with node jfs13 (2) has been missing for 90% of the timeout interval. If this persists, removal of this node from cluster will occur in 2.980 seconds 2024-09-12 10:56:51.031 [ONMD(3528)]CRS-1607: Node jfs13 is being evicted in cluster incarnation 621599354; details at (:CSSNM00007:) in /gridbase/diag/crs/jfs12/crs/trace/onmd.trc. 2024-09-12 10:56:52.390 [CRSD(6668)]CRS-7503: The Oracle Grid Infrastructure process 'crsd' observed communication issues between node 'jfs12' and node 'jfs13', interface list of local node 'jfs12' is '192.168.30.1:33194;', interface list of remote node 'jfs13' is '192.168.30.2:33621;'. 2024-09-12 10:56:55.683 [ONMD(3528)]CRS-1601: CSSD Reconfiguration complete. Active nodes are jfs12 . 2024-09-12 10:56:55.722 [CRSD(6668)]CRS-5504: Node down event reported for node 'jfs13'. 2024-09-12 10:56:57.222 [CRSD(6668)]CRS-2773: Server 'jfs13' has been removed from pool 'Generic'. 2024-09-12 10:56:57.224 [CRSD(6668)]CRS-2773: Server 'jfs13' has been removed from pool 'ora.NTAP'.
Complete loss of replication network
Oracle RAC split-brain detection has a dependency on the Oracle RAC storage heartbeat. If loss of site-to-site connectivity results in simultaneous loss of both the RAC network heartbeat and storage replication services, the result is the RAC sites will not be able to communicate cross-site via either the RAC interconnect or the RAC voting disks. The result in an even-numbed set of nodes may be eviction of both sites under default settings. The exact behavior will depend on the sequence of events and the timing of the RAC network and disk heartbeat polls.
The risk of a 2-site outage can be addressed in two ways. First, using an odd number of RAC nodes, preferably by use of a 3 rd site tiebreaker. If a 3 rd site is unavailable, the tiebreaker instance could be placed on one site. This means that loss of site-to-site connectivity will cause LUN paths to go down on one site, but one of the RAC sites will still have quorum and will not evict.
If a 3 rd site is not available, this problem can be addressed by adjusting the misscount parameter on the RAC cluster. Under the defaults, the RAC network heartbeat timeout is 30 seconds. This normally is used by RAC to identify failed RAC nodes and remove them from the cluster. It also has a connection to the voting disk heartbeat.
If, for example, the conduit carrying intersite traffic for both Oracle RAC and storage replication services is cut by a backhoe, the 30 second misscount countdown will begin. If the RAC preferred site node cannot reestablish contact with the opposite site within 30 seconds, and it also cannot use the voting disks to confirm the opposite site is down within that same 30 second window, then the preferred site nodes will also evict. The result is a full database outage.
Depending on when the misscount polling occurs, 30 seconds may not be enough time for SnapMirror active sync to time out and allow storage on the preferred site to resume services before the 30 second window expires. This 30 second window can be increased.
[root@jfs12 ~]# /grid/bin/crsctl set css misscount 100 CRS-4684: Successful set of parameter misscount to 100 for Cluster Synchronization Services.
This value allows the storage system on the preferred site to resume operations before the misscount timeout expires. The result will then be eviction only of the nodes at the site where the LUN paths were removed. Example below:
2024-09-12 09:50:59.352 [ONMD(681360)]CRS-1612: Network communication with node jfs13 (2) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 49.570 seconds 2024-09-12 09:51:10.082 [CRSD(682669)]CRS-7503: The Oracle Grid Infrastructure process 'crsd' observed communication issues between node 'jfs12' and node 'jfs13', interface list of local node 'jfs12' is '192.168.30.1:46039;', interface list of remote node 'jfs13' is '192.168.30.2:42037;'. 2024-09-12 09:51:24.356 [ONMD(681360)]CRS-1611: Network communication with node jfs13 (2) has been missing for 75% of the timeout interval. If this persists, removal of this node from cluster will occur in 24.560 seconds 2024-09-12 09:51:39.359 [ONMD(681360)]CRS-1610: Network communication with node jfs13 (2) has been missing for 90% of the timeout interval. If this persists, removal of this node from cluster will occur in 9.560 seconds 2024-09-12 09:51:47.527 [OHASD(680884)]CRS-8011: reboot advisory message from host: jfs13, component: cssagent, with time stamp: L-2024-09-12-09:51:47.451 2024-09-12 09:51:47.527 [OHASD(680884)]CRS-8013: reboot advisory message text: oracssdagent is about to reboot this node due to unknown reason as it did not receive local heartbeats for 10470 ms amount of time 2024-09-12 09:51:48.925 [ONMD(681360)]CRS-1632: Node jfs13 is being removed from the cluster in cluster incarnation 621596607
Note: Oracle Support strongly discourages altering with the misscount or disktimeout parameters to solve configuration problems. Changing these parameters can, however, be warranted and unavoidable in many cases, including SAN booting, virtualized, and storage replication configurations. If, for example, you had stability problems with a SAN or IP network that was resulting in RAC evictions you should fix the underlying problem and not charge the values of the misscount or disktimeout. Changing timeouts to address configuration errors is masking a problem, not solving a problem. Changing these parameters to properly configure a RAC environment based on design aspects of the underlying infrastructure is different and is consistent with Oracle support statements. With SAN booting, it is common to adjust misscount all the way up to 200 to match disktimeout.
Storage system failure
The result of a storage system failure is nearly identical to the result of losing the replication link. The surviving site should experience a roughly 15 second IO pause on writes. Once that 15 second period elapses, IO will resume on that site as usual.
The Oracle RAC node on the failed site will lose storage services and enter the same 200 second disktimeout countdown before eviction and subsequent reboot.
2024-09-11 13:44:38.613 [ONMD(3629)]CRS-1615: No I/O has completed after 50% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 99750 milliseconds. 2024-09-11 13:44:51.202 [ORAAGENT(5437)]CRS-5011: Check of resource "NTAP" failed: details at "(:CLSN00007:)" in "/gridbase/diag/crs/jfs13/crs/trace/crsd_oraagent_oracle.trc" 2024-09-11 13:44:51.798 [ORAAGENT(75914)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 75914 2024-09-11 13:45:28.626 [ONMD(3629)]CRS-1614: No I/O has completed after 75% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 49730 milliseconds. 2024-09-11 13:45:33.339 [ORAAGENT(76328)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 76328 2024-09-11 13:45:58.629 [ONMD(3629)]CRS-1613: No I/O has completed after 90% of the maximum interval. If this persists, voting file /dev/mapper/grid2 will be considered not functional in 19730 milliseconds. 2024-09-11 13:46:18.630 [ONMD(3629)]CRS-1604: CSSD voting file is offline: /dev/mapper/grid2; details at (:CSSNM00058:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc. 2024-09-11 13:46:18.631 [ONMD(3629)]CRS-1606: The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc 2024-09-11 13:46:18.638 [ONMD(3629)]CRS-1699: The CSS daemon is terminating due to a fatal error from thread: clssnmvDiskPingMonitorThread; Details at (:CSSSC00012:) in /gridbase/diag/crs/jfs13/crs/trace/onmd.trc 2024-09-11 13:46:18.651 [OCSSD(3631)]CRS-1652: Starting clean up of CRSD resources.
The SAN path state on the RAC node that has lost storage services looks like this:
oradata7 (3600a0980383041334a3f55676c697347) dm-20 NETAPP,LUN C-Mode size=128G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | `- 34:0:0:18 sdam 66:96 failed faulty running `-+- policy='service-time 0' prio=0 status=enabled `- 33:0:0:18 sdaj 66:48 failed faulty running
The linux host detected the loss of the paths much quicker than 200 seconds, but from a database perspective the client connections to the host on the failed site will still be frozen for 200 seconds under the default Oracle RAC settings. Full database operations will only resume after the eviction is completed.
Meanwhile, the Oracle RAC node on the opposite site will record the loss of the other RAC node. It otherwise continues to operate as usual.
2024-09-11 13:46:34.152 [ONMD(3547)]CRS-1612: Network communication with node jfs13 (2) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.020 seconds 2024-09-11 13:46:41.154 [ONMD(3547)]CRS-1611: Network communication with node jfs13 (2) has been missing for 75% of the timeout interval. If this persists, removal of this node from cluster will occur in 7.010 seconds 2024-09-11 13:46:46.155 [ONMD(3547)]CRS-1610: Network communication with node jfs13 (2) has been missing for 90% of the timeout interval. If this persists, removal of this node from cluster will occur in 2.010 seconds 2024-09-11 13:46:46.470 [OHASD(1705)]CRS-8011: reboot advisory message from host: jfs13, component: cssmonit, with time stamp: L-2024-09-11-13:46:46.404 2024-09-11 13:46:46.471 [OHASD(1705)]CRS-8013: reboot advisory message text: At this point node has lost voting file majority access and oracssdmonitor is rebooting the node due to unknown reason as it did not receive local hearbeats for 28180 ms amount of time 2024-09-11 13:46:48.173 [ONMD(3547)]CRS-1632: Node jfs13 is being removed from the cluster in cluster incarnation 621516934
Cut the power to the mediator
The mediator service does not directly control storage operations. It functions as an alternate control path between clusters. It exists primarily to automate failover without the risk of a split-brain scenario. In normal operation, each cluster is replicating changes to its partner, and each cluster therefore can verify that the partner cluster is online and serving data. If the replication link failed, replication would cease.
The reason a mediator is required for safe automated operations is because it would otherwise be impossible for a storage clusters to be able to determine whether loss of bidirectional communication was the result of a network outage or actual storage failure.
The mediator provides an alternate path for each cluster to verify the health of its partner. The scenarios are as follows:
If a cluster can contact its partner directly, replication services are operational. No action required.
If a preferred site cannot contact its partner directly or via the mediator, it will assume the partner is either actually unavailable or was isolated and has taken its LUN paths offline. The preferred site will then proceed to release the RPO=0 state and continue processing both read and write IO.
If a non-preferred site cannot contact its partner directly, but can contact it via the mediator, it will take its paths offline and await the return of the replication connection.
If a non-preferred site cannot contact its partner directly or via an operational mediator, it will assume the partner is either actually unavailable or was isolated and has taken its LUN paths offline. The non-preferred site will then proceed to release the RPO=0 state and continue processing both read and write IO. It will assume the role of the replication source and will become the new preferred site.
If the mediator is wholly unavailable:
Failure of replication services for any reason will result in the preferred site releasing the RPO=0 state and resuming read and write IO processing. The non-preferred site will take its paths offline.
Failure of the preferred site will result in an outage because the non-preferred site will be unable to verify that the opposite site is truly offline and therefore it would not be safe for the nonpreferred site to resume services.
Restoring services
It's difficult to demonstrate what happens when power is restored after a failure or a replication link failure is repaired. SnapMirror active sync will automatically detect the presence of a faulty replication relationship and bring it back to an RPO=0 state. Once synchronous replication is reestablished, the paths will come online again.
In many cases, clustered applications will automatically detect the return of failed paths, and those applications will also come back online. In other cases, a host-level SAN scan may be required, or applications may need to be brought back online manually. It depends on the application and how it's configured, and in general such tasks can be easily automated. ONTAP itself is self-healing and should not require any user intervention to resume RPO=0 storage operations.
Another feature of SnapMirror active sync that is difficult to demonstrate is the speed of recovery. Unlike many competitors, ONTAP replication uses a pointer-based mechanism to track changes. The result is resynchronization after replication interruption is extremely fast. ONTAP is able to identify changed blocks and efficiently and quickly ship them to the remote system, even if replication was interrupted for days, weeks, or months. Many competitors use a slower and less efficient extent-based approach which requires reading and transferring much more data than merely those blocks which have changed. In some cases, the entire mirror must be rebuilt.
In addition, ONTAP's unique pointer-based technology preserves an intact copy of your data during resynchronization. The copy would be out of date, but it would exist. Resynchronization with many competing technologies results in a copy that is corrupt until the resynchronization is complete. This leaves a customer exposed to significant data loss if the last remaining copy of the data is damaged.
This is one of many examples of how ONTAP features not only deliver results, but also deliver a superior end result because ONTAP just works better.
SnapMirror active sync failover
Changing the preferred site requires a simple operation. IO will pause for a second or two as authority over replication behavior switches between clusters, but IO is otherwise unaffected.
GUI example:
Example of changing it back via the CLI:
Cluster2::> snapmirror failover start -destination-path jfs_as2:/cg/jfsAA [Job 9575] Job is queued: SnapMirror failover for destination "jfs_as2:/cg/jfsAA ". Cluster2::> snapmirror failover show Source Destination Error Path Path Type Status start-time end-time Reason -------- ----------- -------- --------- ---------- ---------- ---------- jfs_as1:/cg/jfsAA jfs_as2:/cg/jfsAA planned completed 9/11/2024 9/11/2024 09:29:22 09:29:32
The new destination path can be verified as follows:
Cluster1::> snapmirror show -destination-path jfs_as1:/cg/jfsAA Source Path: jfs_as2:/cg/jfsAA Destination Path: jfs_as1:/cg/jfsAA Relationship Type: XDP Relationship Group Type: consistencygroup SnapMirror Policy Type: automated-failover-duplex SnapMirror Policy: AutomatedFailOverDuplex Mirror State: Snapmirrored Relationship Status: InSync
Summary
The examples above are designed to illustrate SnapMirror active sync functionality. As with most enterprise IT projects, there are few true best practices. The right configuration depends on business needs including RPO, RTO, and SLAs. With respect to disaster recovery, it also depends on which disaster scenarios are more likely than others to occur.
The next posts planned will include use of an Oracle RAC tiebreaker, use of uniform networking, and quantifying application IO resumption times. The foundation remains the same - SnapMirror active sync operating in active-active mode, an integrated RPO=0/RTO=0 SAN replication technology that is reliable, self-healing, easily configured, and easily managed.
If you want to learn more, you can visit the official ONTAP documentation for SnapMirror active sync, and if you'd really like to see it in action I recommend getting in touch with Neto and the rest of the CPOC (Customer Proof Of Concept) team. They've got all the cool hardware to simulate your real-world workloads.
... View more
The inevitability of migration
There is one certainty in all enterprise IT infrastructures – migration. It doesn’t matter whose product you buy - you'll need to migrate data eventually. Sometimes it's because your current storage array has reached end-of-life. Other times, you'll find that a particular workload is in the wrong place. Sometimes it's about real estate. For example, you might need to power down a data center for maintenance and a critical workload would need to be relocated temporarily.
When this happens, you have lots of options. If you have an RPO=0 DR plan, you might be able to use your disaster recovery procedures to execute the migration. If it's just a tech refresh, you might use OS features to nondisruptively migrate data on a host-by-host basis. Logical volume managers can help you do that. With ONTAP storage systems, you might choose to swap the controllers and get yourself to a newer hardware platform. If you're moving datasets around geographically, ONTAP's SnapMirror is a convenient and highly scalable option for large-scale migration.
This post is about SVM Migrate, which is a feature that allows you to transparently migrate a complete storage environment from one array to another.
SVMs
Before I show the data from my testing, I need to explain the Storage Virtual Machine, or SVM. This is one of the most underrated and underutilized features of ONTAP.
ONTAP multitenancy is a little like ESX. To do anything useful, you have to create a virtual machine. In the case of ONTAP, we call it an SVM. The SVM is essentially a logical storage array, including security policies, replication policies, LUNs, NFS shares, SMB shares, and so forth. It’s a self-contained storage object, much like a guest on an ESX server is a self-contained operating system. ONTAP isn’t really a hypervisor, of course, but the result is still multitenancy.
Most customers seem to create just a single SVM on their ONTAP cluster, and that usually makes sense to me. Most customers want to share out LUNs and files to varies clients and there is a single team in charge of the array. It's just a single array to them.
Sometimes, however, they're missing an opportunity. For example, they could have created two SVMs, one for production and one for development. This would allow them to safely give the developers more direct control over provisioning and management of their storage. They could have created a third SVM that contains sensitive file shares, and they could lock that SVM down to select users.
There’s no right or wrong answer, it depends on business needs. It's really about granularity of data management.
SVM Migrate
You can migrate an entire SVM nondisruptively. There are some restrictions, and you can read more here, but if you're running a vanilla NFS configuration with workloads such as VMware or Oracle databases it can be a great way to perform nondisruptive migration. As mentioned above, there are many reasons you might want to do that, including moving select storage environments to new hardware, rebalancing workloads as performance needs evolve, or even shifting work around in an emergency situation.
The key difference between SVM Migrate and other options is that you are essentially migrating a storage array from one hardware platform to another. As mentioned above, an SVM is a complete logical storage array unto itself. Migrating an SVM means migrating all the storage, snapshots, security policies, logins, IP addresses, and other aspects of configuration from one hardware platform to another. It’s also designed to be used on a running system.
I’ll explain some of the internals below. It’s easier to understand if you look at the graph.
Test environment
I usually work with complicated application environments, so to test SVM Migrate I picked the touchiest configuration I could think of – Oracle RAC. I built an Oracle RAC cluster using version 21c for both the Grid and Database software.
A test with a database that is just sitting there inert proves nothing, so I added a load generator. I normally use Oracle SLOB, available here. It’s an incredibly powerful tool, and the main value is that it’s a real Oracle database doing real Oracle IO. It’s not synthetic like vdbench. It’s the real thing, and you can measure IO and response times at the database layer. Anything I do in a migration test would be affecting a real database and associated real timeouts and error handling.
My main interest was in the effect on cutover. At some point, the storage personality (the SVM) is going to have to cease operating on the old hardware platform and start operating on the new platform. That’s the point where IP addresses will be relocated and the location of IO processing will change.
What will cutover look like? After multiple migrations back and forth within my lab setup, I decided to graph it.
The Graph
Here’s what it looks like:
Here’s what was happening:
I started the workload and let it reach a steady state. There was about 180MB/sec of total database read IO and 35MB/sec of write IO. This isn’t enormous, but it’s a respectable amount of activity for a single database. This is also a very latency-sensitive workload, so any changes to storage system IO service times will be clearly reflected in the graph.
I initiated the SVM migration at the 0 seconds mark shown on the X-axis.
The first 25 seconds or so I could see setup operations occurring. The new SVM personality was being prepared on the new environment. This will require the transfer of basic configuration information, future IP addresses, security policies, and so forth. I wouldn’t expect any impact on performance yet, and as expected, there was none.
Stating at about 25 seconds, I could see a SnapMirror operation initialize and transfer data. This creates a mirror copy of the source SVM and all of its data from the current hardware cluster to the new cluster.
Up through the 175 second mark, I could see repeated SnapMirror transfers as the individual snapshot deltas were also replicated from the source cluster to the destination cluster. I'm not just migrating the data, I'm migrating the snapshot used backups, clones, and other purposes.
The system then entered a synchronous replication state for a few seconds. You can see throughput drop noticeably on the graph. This is a natural result of a database needing to wait extra time for writes to complete because those writes are now being committed to two different storage systems before being acknowledged
Cutover occurred at about the 180 second mark.
You can then see the cache start to warm up on the destination cluster. The total IO climbs as response times improve.
The IO eventually stabilizes at around 250MB/sec of read IO and 45MB/sec of write IO. This increase in IO reflects the fact the new storage array had a slightly better network connection between the storage system and the database server. There are fewer network hops.
That’s it. It just works, and all it took was a single command.
Cluster1::> vserver migrate start -vserver jfs_svmmigrate -source-cluster rtp-a700s-c01 Info: To check the status of the migrate operation use the "vserver migrateshow" command. Cluster1::>
I’m impressed. I know ONTAP internals well enough to have predicted how this would work, and SVM Migrate really isn’t doing anything new. It’s orchestrating basic ONTAP capabilities, but whoever put all this together did a great job. I was able to monitor all the steps as they proceeded, I didn’t note any unexplained or problematic pauses, and the cutover should be almost undetectable to database users.
I wouldn’t have hesitated to use SVM Migrate in the middle of the workday if I was still in my prior job. If the DBAs were really looking, they might have noticed a short and minor impact on performance, but as a practical matter this was a nondisruptive operation.
There’s more to the command “vserver migrate” than I showed here too. For example, you might have a lot of data to move and you want to set up the initial copying by defer the cutover until later. You can read about it in the documentation.
... View more
As far as I know, the same API's should work on all platforms within reason. Obviously you can't make a call for MetroCluster switchovers if you're not using a MetroCluster in the first place, and an NFS-related API call shouldn't work on an ASA because there are no file protocols on an ASA. Other than that, ONTAP should be ONTAP. What API call did you encounter that failed? If you tried to disable TSSE, that might have been the problem because TSSE cannot be turned off in the C-Series systems. That's documented in the overall C-Series material, but that restriction should be reiterated in the API documentation. I'll file a doc update request on that.
... View more
Active-active data center (with Oracle!)
I've worked with Oracle customers on DR solutions for 15+ years. The perfect solution would, of course, be RPO=0 and RTO=0*, but not all applications can tolerate the write latency involved in an RPO=0 synchronous solution. Sometimes you have to settle for an RPO of 15 minutes or a slightly longer RTO.
Sometimes, however, RPO=0 and RTO=0 are required because the data is really that critical.
We've been able to do this with SnapMirror active sync (formerly known as SnapMirror Business Continuity) for a while, but now we can do it in symmetric active-active mode. You can now have two clusters in two completely different sites, each serving data, with identical performance characteristics, and you don't even need to extend the SAN across sites.
This is the foundation of what customers call "active-active data center". There is no primary site and DR site. There are just two sites. Half your database is running on site A, and the other half is running on site B. Each local storage system will service all read IO from its local copy of the data. Write IO will, of course, be replicated to the opposite site before being acknowledged, because that's how synchronous mirroring works. Symmetric storage IO means symmetric database responses and symmetric application behavior.
SnapMirror active sync in active-active mode is in tech preview now with select customers. Oracle RAC is not yet a supported configuration, but there's no technical reason it shouldn't work, and I wanted to be ready for this feature to become generally available. I've been cutting power and network links for the past couple weeks, and I haven't managed to crash my database yet.
*Note: There's really no such thing as RTO=0 because it takes a certain amount of time to know whether recovery procedures are even warranted. You don't want to have a total disaster failover triggered just because a single IO operation didn't complete on one second. I consider SnapMirror active sync to be an RTO=0 solution because the environment is already running at the opposite site. The lag time in resuming operations isn't because of failover itself, it's sometimes it takes at least 15-30 seconds, even under automation, to be sure that failover is required.
I'm developing reference architectures with and without a 3rd site Oracle RAC tiebreaker and plan to release some accompanying videos, but here's an overview of how it works. Take a look at the diagram and then continue reading to understand the value.
Architecture
This is a typical Oracle RAC configuration with a database called NTAP, with two instances, NTAP1 and NTAP2. The diagram might look complicated at first, but here's the key to understanding it:
SnapMirror active sync is invisible
From an Oracle and host point of view, this is just one set of LUNs on a single cluster. The replication is invisible. It's the same set of LUNs at both sites. I haven't even stretched the SAN across sites, although I could have done that if I wanted to. I'd rather not create a cross-site ISL if I don't have to.
When I installed RAC, I had a couple of hosts that each had a set of 3 LUNs to be used for quorum management. These hosts, jfs12 and jfs13, each see the same LUNs with the same serial numbers and the same data.
When I created the database, I created an 8-LUN ASM diskgroup for the datafiles and an 8-LUN ASM diskgroup for logs. It doesn't matter which host I use to make the database. They're both using the same LUNs.
Think of it as one single system with paths that happen to exist on two different sites. Any path either cluster leads to the same LUN.
SnapMirror active sync is symmetric
Database connections can now be made to either instance. If that instance needs to perform a read, the data will be retrieved from the local drives. Writes will be replicated to the opposite site before being acknowledged, of course, so site-to-site latency needs to be as low as possible.
It doesn't matter which site you're using. Database performance is the same, unless you intentionally used different controller models with differing performance limits. This is a valid choice. Maybe you want RPO=0/RTO=0 but one of your sites is designed to be just a temporary site, and doesn't require the same storage horsepower as the other site.
SnapMirror active sync is resilient
This is the part I'm still working on documenting. There's a mediator service that acts as a heartbeat to detect controller failures. The mediator isn't an active tiebreaker service, but it's the same idea. It works as an alternate communication channel for each cluster to check the health of the opposite cluster. For example, if the cluster on site B suddenly fails, the cluster on site A will lose the ability to contact cluster B either directly or via the mediator. That allows cluster A to release the mirroring and resume operations.
Overall, "it just works". For example, my initial tests involved simply cutting the power at one site. Here's what happened:
One set of paths ceased responding, while the other set of paths remained available
All write IO paused because it was no longer possible to replicate the writes
After about 30 seconds, the surviving site considered the site with the power failure truly dead and broke the mirroring so the surviving site could resume operations
The Oracle instance on the failed site continued to try to contact storage for a full 200 seconds. This is the default timeout setting with RAC. You can change it if required.
After the 200 second expiration, the Oracle instance performed a self-reboot. It does this to help protect data from corruption due to a lingering IO operation stuck in a retry loop on the host.
This also means that stalled transactions on the failed node were held for 200 seconds before being replayed on the surviving node. This is a good example of how the RTO of a storage system is not the only factor affecting failover times.
The recovery process was unexpectedly seamless:
Power was restored to the failed storage system
It took about 8 minutes to fully power up, self-test, boot, and resume clustered operations
The surviving site detected the return of the other site.
The mirror was asynchronously resynchronized to get the states of site A and site B really close together. This took about 5 minutes.
The mirror then transitioned to synchronized state
The Oracle server detected the presence of SAN paths
The Oracle RAC process, which had been delaying the boot process, found usable RAC quorum devices
The database instance came up again
That was an unexpected surprise. I expected more recovery work would be required, but it was just "turn the power back on" and everything went back to normal.
I've got more to do, including getting timings of various operations, collecting logs, tuning RAC, and especially writing up Oracle RAC quorum behavior. It's not complicated, but it's not well documented by Oracle.
Look for a lot more when the next version of ONTAP ships.
... View more
The serial_number property returned by the GET /storage/luns/{uuid} can be converted to hex. That ASCII and the hex are the same thing. For example: [root@jfs0 current]# echo -n '80A2z+UglTqg' | od -A n -t x1 38 30 41 32 7a 2b 55 67 6c 54 71 67
... View more
This is a post about how I expectedly needed to use an ONTAP feature in order to test a completely different ONTAP feature. If you haven't heard of it, SVM Migrate is a high availability feature that allows you to migrate a running storage environment from one cluster to a completely different cluster, nondisruptively.
ONTAP environment
The feature I wanted to test was SnapMirror active sync (SM-AS) running in symmetric active-active mode. We enhanced SM-AS last year to offer symmetric active-active replication. Here’s a basic diagram of what I was working with:
It’s a couple of A700 clusters with SM-AS enabled. I set up my Oracle RAC configuration, including databases and quorum drives, on the jfs_as1 and jfs_as2 SVMs. Oracle RAC is not yet supported with SM-AS in active/active mode, but I couldn’t think of a reason it shouldn’t work, and I wanted to give this a spin. The idea here is creating a single cross-site, ultra-available Oracle RAC cluster. I'll post on this later.
What’s an SVM again?
When you first set up an ONTAP system, it’s a little like VMware ESX. You’ll have an operational cluster, but it doesn’t do anything yet. You need to define an Storage Virtual Machine (SVM). It’s basically a self-contained storage personality. As with VMware, it’s about multitenancy and security and manageability. You might only have the one SVM on your cluster, but if you want to have different SVMs serving different types of data or managed by different teams, you can do that too. For example, maybe you have a production SVM that is treated extra-carefully, but then you have a development SVM where you give your developers more control over their storage environment.
SnapMirror active sync
This isn’t the point of this post, but SnapMirror active sync (SM-AS) is a zero-RPO replication solution. When operated in active-active mode, what you have is the same data and the same LUNs available on two different systems. All reads are serviced locally. Writes obviously must be replicated to the partner cluster to maintain consistency. The result is symmetric active-active access to the same dataset.
I know how it works internally, so I was sure that simply configuring replication would result in a perfectly usable solution. The question I had was about failover. When you configure SM-AS, you also have a mediator service that manages tiebreaking and failover.
The Problem
My first test to validate is what happens when Cluster2 fails. What SHOULD happen is the replication should fail and the mediator should signal to Cluster1 that is can resume operations unmirrored. After all, the point here is ultra high availability.
Here’s the issue – my Oracle RAC hosts are all running under VMware using VMDK files hosted on the SVM called jfs_esx. If I cut the power on Cluster2, I’m going to take out my hosts as well. I really, really didn’t want to take the time to configure a new ONTAP system and vMotion my VMDK files over.
SVM Migrate to the rescue!
I decided to give SVM Migrate a try. It’s been around since ONTAP 9.10, but I never used it before. The purpose of SVM Migrate is to replicate that entire SVM personality. There are some restrictions, but in my case I just had a 1TB NFS share hosting all my VMDKs.
Since I was working in a lab environment that I own, I figured I’d just give this a try. It was a good test of simplicity. I didn't shut anything down. All my VMs are operational and the RAC clusters are running. Will it all survive the migration? Let's find out! I don't need no documentation.
Caution: Please read the documentation. I didn’t read the documentation, but I’ve been working with ONTAP since ’95 and half my job is trying to break things.
Starting the migration
I knew the command was probably vserver something (an SVM is known as a vserver at the CLI) so I just started typing and using the tab key to see what arguments were required. It looked like I could just do this:
rtp-a700s-c01::> vserver migrate start -vserver jfs_esx -source-cluster rtp-a700s-c02 Info: To check the status of the migrate operation use the "vserver migrate show" command.
I was then pretty sure I was moving my jfs_esx SVM from cluster2 to cluster1. Then again, maybe I didn't provide a required argument or maybe there was some aspect of configuration that blocked the migration. Let's find out what happened...
Monitoring
The prior command told me to run vserver migrate show to monitor, so that's what I did. I ran it a couple times.
rtp-a700s-c01::> vserver migrate show Destination Source Vserver Cluster Cluster Status ---------------- ------------------- ------------------- --------------------- jfs_esx rtp-a700s-c01 rtp-a700s-c02 setup-configuration rtp-a700s-c01::> vserver migrate show Destination Source Vserver Cluster Cluster Status ---------------- ------------------- ------------------- --------------------- jfs_esx rtp-a700s-c01 rtp-a700s-c02 transferring
Looks like it's working. It appears to have configured the destination and commenced data transfer.
SnapMirror
The most important part of the SVM Migrate operation is moving the data itself, which happens via SnapMirror. That's what the word transferring means above. The SVM Migrate operation is transferring my data. How much data to I need to move?
rtp-a700s-c02::> vol show -vserver jfs_esx jfs_esx -fields used vserver volume used ------- ------- ------- jfs_esx jfs_esx 536.8GB
Looks like I'll need to transfer around a half terabyte of total data. I just have the one volume in this SVM. It's a 1TB volume, but after efficiency savings it's 536GB of data.
I was monitoring the status by repeatedly running snapmirror show when I saw something odd
rtp-a700s-c01::> snapmirror show -destination-path jfs_esx:jfs_esx -fields snapshot-progress source-path destination-path snapshot-progress --------------- ---------------- ----------------- jfs_esx:jfs_esx jfs_esx:jfs_esx 175.8GB rtp-a700s-c01::> snapmirror show -destination-path jfs_esx:jfs_esx -fields snapshot-progress source-path destination-path snapshot-progress --------------- ---------------- ----------------- jfs_esx:jfs_esx jfs_esx:jfs_esx 23.09GB
What happened? Why did I go from 175GB transferred to just 23GB? The reason is I'm looking at a different SnapMirror operation, and the reason that happened was snapshots.
Snapshot transfers
I guessed that SVM Migrate had initialized the mirror, and then was transferring the individual snapshots from the source. I checked the snapshots at the destination to confirm:
rtp-a700s-c01::> snapshot show -vserver jfs_esx
---Blocks---
Vserver Volume Snapshot Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
jfs_esx jfs_esx
snapmirror.ac509ea6-fa33-11ed-ae6e-00a098f7d731_2152057340.2023-05-25_180109
90.08GB 9% 20%
smas_testing_baseline 6.53GB 1% 2%
snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520100.2024-02-27_172427
133.6MB 0% 0%
snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184257
668KB 0% 0%
snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184300
5.35GB 1% 1%
nightly.2024-02-28_0105 14.40GB 1% 4%
6 entries were displayed.
rtp-a700s-c01::> snapshot show -vserver jfs_esx
---Blocks---
Vserver Volume Snapshot Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
jfs_esx jfs_esx
snapmirror.ac509ea6-fa33-11ed-ae6e-00a098f7d731_2152057340.2023-05-25_180109
90.08GB 9% 20%
smas_testing_baseline 6.53GB 1% 2%
snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520100.2024-02-27_172427
133.6MB 0% 0%
snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184257
668KB 0% 0%
snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184300
5.35GB 1% 1%
nightly.2024-02-28_0105 19.26GB 2% 5%
nightly.2024-02-29_0105 33.84MB 0% 0%
7 entries were displayed.
You can see I went from 6 snapshots to 7 snapshots in just a few moments. I asked engineering, "Hey, does SVM Migrate initialize a baseline transfer of my data, and then start transferring the deltas to copy the snapshots too?" and they said, "Yup".
There were 15 snapshots on this volume, so I'm halfway done moving them. My transfer been running for about 10 minutes at this point.
Monitoring, again
I went back to monitoring the status, but this time I used the show-volume rather than show argument.
rtp-a700s-c01::> vserver migrate show-volume
Volume Transfer
Vserver Volume State Healthy Status Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
jfs_esx online true Transferring
-
jfs_esx_root online true ReadyForCutoverPreCommit
Looks like one of my volumes is fully transferred, but there's a lot of data in that jfs_esx volume, so that's still running.
After another 5 minutes or so, I got to this:
rtp-a700s-c01::> vserver migrate show-volume
Volume Transfer
Vserver Volume State Healthy Status Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
jfs_esx online true ReadyForCutoverPreCommit
-
jfs_esx_root online true ReadyForCutoverPreCommit
Cool. All data is transferred. Ready for the cutover process. If I didn't want this to happen automatically, I could have deferred the cutover. There several other options available with the vserver migrate command that I didn't know about initially because, as mentioned before, I didn't actually read the documentation.
SnapMirror Synchronous
Once all the basic data is transferred, it's time for SVM Migrate to perform the cutover. Since this is an RPO=0 migration, the underlying data must be brought into an RPO=0 synchronous replication configuration. SVM Migrate orchestrates that process, and I saw that transition occur:
rtp-a700s-c01::> vserver migrate show-volume
Volume Transfer
Vserver Volume State Healthy Status Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
jfs_esx online true InSync -
jfs_esx_root online true InSync -
2 entries were displayed.
Finalization
I then went back to watching the migrate-show output and saw these responses:
rtp-a700s-c01::> vserver migrate show
Destination Source
Vserver Cluster Cluster Status
---------------- ------------------- ------------------- ---------------------
jfs_esx rtp-a700s-c01 rtp-a700s-c02 post-cutover
rtp-a700s-c01::> vserver migrate show
Destination Source
Vserver Cluster Cluster Status
---------------- ------------------- ------------------- ---------------------
jfs_esx rtp-a700s-c01 rtp-a700s-c02 cleanup
rtp-a700s-c01::> vserver migrate show
Destination Source
Vserver Cluster Cluster Status
---------------- ------------------- ------------------- ---------------------
jfs_esx rtp-a700s-c01 rtp-a700s-c02 migrate-complete
Thoughts
I'm impressed. I was in some early conversations about the SVM Migrate feature, but I hadn't thought about it since then.
I successfully relocated all the storage for all my VMs, nondisruptively, with a single command, and without even reading the documentation (again, please read the documentation anyway).
It was simple, and it simply worked. As it should.
... View more
Fun with automation – ONTAP Consistency Groups
There's a lot to this post. I'll cover what the heck Consistency Groups (CGs) are all about, how to automate CG operations via the REST API, how to convert existing volume snapmirrors into a CG configuration without a requirement to retransfer the whole data set, and finally how to do it all via the CLI.
Some of the content below is copied directly from https://community.netapp.com/t5/Tech-ONTAP-Blogs/Consistency-Groups-in-ONTAP/ba-p/438567. I did that in order to have all the key concepts in the same place.
Consistency Groups in ONTAP
There’s a good reason you should care about CGs – it’s about manageability.
If you have an important application like a database, it probably involves multiple LUNs or multiple filesystems. How do you want to manage this data? Do you want to manage 20 LUNs on an individual basis, or would you prefer just to manage the dataset as a single unit?
Volumes vs LUNs
If you’re relatively new to NetApp, there’s a key concept worth emphasizing – volumes are not LUNs.
Other vendors use those two terms synonymously. We don’t. A Flexible Volume, also known as a FlexVol, or usually just a “volume,” is just a management container. It’s not a LUN. You put data, including NFS/SMB files, LUNs, and even S3 objects, inside of a volume. Yes, it does have attributes such as size, but that’s really just accounting. For example, if you create a 1TB volume, you’ve set an upper limit on whatever data you choose to put inside that volume, but you haven’t actually allocated space on the drives.
This sometimes leads to confusion. When we talk about creating 5 volumes, we don’t mean 5 LUNs. Sometimes customers think that they create one volume and then one LUN within that volume. You can certainly do that if you want, but there’s no requirement for a 1:1 mapping of volume to LUN. The result of this confusion is that we sometimes see administrators and architects designing unnecessarily complicated storage layouts. A volume is not a LUN.
Okay then, what is a volume?
If you go back about eighteen years, an ONTAP volume mapped to specific drives in a storage controller, but that’s ancient history now.
Today, volumes are there mostly for your administrative convenience. For example, if you have a database with a set of 10 LUNs, and you want to limit the performance for the database using a specific quality of service (QoS) policy, you can place those 10 LUNs in a single volume and slap that QoS policy on the volume. No need to do math to figure out per-LUN QoS limits. No need to apply QoS policies to each LUN individually. You could choose to do that, but if you want the database to have a 100K IOPS QoS limit, why not just apply the QoS limit to the volume itself? Then you can create whatever number of LUNs that are required for the workload.
Volume-level management
Volumes are also related to fundamental ONTAP operations, such as snapshots, cloning, and replication. You don’t selectively decide which LUN to snapshot or replicate, you just place those LUNs into a single volume and create a snapshot of the volume, or you set a replication policy for the volume. You’re managing volumes, irrespective of what data is in those volumes.
It also simplifies how you expand the storage footprint of an application. For example, if you add LUNs to that application in the future, just create the new LUNs within the same volume. They will automatically be included in the next replication update, the snapshot schedule will apply to all the LUNs, including the new ones, and the volume-level QoS policy will now apply to IO on all the LUNs, including the new ones.
You can selectively clone individual LUNs if you like, but most cloning workflows operate on datasets, not individual LUNs. If you have an LVM with 20 LUNs, wouldn’t you rather just clone them as a single unit than perform 20 individual cloning operations? Why not put the 20 LUNs in a single volume and then clone the whole volume in a single step?
Conceptually, this makes ONTAP more complicated, because you need to understand that volume abstraction layer, but if you look at real-world needs, volumes make life easier. ONTAP customers don’t buy arrays for just a single LUN, they use them for multiple workloads with LUN counts going into the 10’s of thousands.
There’s also another important term for a “volume” that you don’t often hear from NetApp. The term is “consistency group,” and you need to understand it if you want maximum manageability of your data.
What’s a Consistency Group?
In the storage world, a consistency group (CG) refers to the management of multiple storage objects as a single unit. For example, if you have a database, you might provision 8 LUNs, configure it as a single logical volume, and create the database. (The term CG is most often used when discussing SAN architectures, but it can apply to files as well.)
What if you want to use array-level replication to protect that database? You can’t just set up 8 individual LUN replication relationships. That won’t work, because the replicated data won’t be internally consistent across volumes. You need to ensure that all 8 replicas of the source LUN are consistent with one another, or the database will be corrupt.
This is only one aspect of CG data management. CGs are implemented in ONTAP in multiple ways. This shouldn’t be surprising – an ONTAP system can do a lot of different things. The need to manage datasets in a consistent manner requires different approaches depending on the chosen NetApp storage system architecture and which ONTAP feature we’re talking about.
Consistency Groups – ONTAP Volumes
The most basic consistency group is a volume. A volume hosting multiple LUNs is intrinsically a consistency group. I can’t tell you how many times I’ve had to explain this important concept to customers as well as NetApp colleagues simply because we’ve historically never used the term “consistency group.”
Here’s why a volume is a consistency group:
If you have a dataset and you put the dataset components (LUNs or files) into a single ONTAP volume, you can then create snapshots and clones, perform restorations, and replicate the data in that volume as a single consistent unit. A volume is a consistency group. I wish we could update every reference to volumes across all the ONTAP documentation in order to explain this concept, because if you understand it, it dramatically simplifies storage management.
Now, there are times where you can’t put the entire dataset in a single volume. For example, most databases use at least two volumes, one for datafiles and one for logs. You need to be able to restore the datafiles to an earlier point in time without affecting the logs. You might need some of that log data to roll the database forward to the desired point in time. Furthermore, the retention times for datafile backups might differ from log backups.
Native ONTAP Consistency Groups
ONTAP also allows you to configure advanced consistency groups within ONTAP itself. The results are similar to what you’d get with the API calls I mentioned above, except now you don’t have to install extra software like SnapCenter or write a script.
For example, I might have an Oracle database with datafiles distributed across 4 volumes located on 4 different controllers. I often do that to ensure my IO load is guaranteed to be evenly distributed across all controllers in the entire cluster. I also have my logs in 3 different volumes, plus I have a volume for my Oracle binaries.
I can still create snapshots, create clones, and replicate that entire 4-controller configuration. All I have to do is define a consistency group. I’ll be writing more about ONTAP consistency groups in the near future, but I’ll start with an explanation of how to take existing flat volumes replicated with regular asynchronous SnapMirror and convert it into consistency group replication without having to perform a new baseline transfer.
SnapMirror -> CG SnapMirror conversion
Why might you do this? Well, let’s say you have an existing 100TB database spread across 10 different volumes and you’re protecting it with snapshots. You might also be replicating those snapshots to a remote site via SnapMirror. As long as you’ve created those snapshots correctly, you have recoverability at the remote site. The problem is you might have to perform some snaprestore operations to make that data usable.
The point of CG snapmirror is to make a replica of a multi-volume dataset where all the volumes are in lockstep with another. That yields what I call “break the mirror and go!” recoverability. If you break the mirrors, the dataset is ready without a need for additional steps. It’s essentially the same as recovering from a disaster using synchronous mirroring. That CG snapmirror replica represents the state of your data at a single atomic point in time.
Critical note: when deleting existing SnapMirror relations be extremely careful with the API and CLI calls. If you use the wrong JSON with the API calls or the wrong arguments using the CLI, you will delete all common snapshots on the source and destination volumes. If this happens you will have to perform a new baseline transfer of all data.
SnapMirror and the all-important common snapshot.
The foundation of snapmirror is two volumes with the same snapshot. As long as you have two volumes with the exact same snapshot, you can incrementally update one of those volumes using the data in the other volume. The logic is basically this:
Create a new snapshot on the source.
Identify the changes between that new snapshot and the older common snapshot that exists in both the source and target volumes.
Ship the changes between those two snapshots to the target volume.
Once that’s complete, the state of the target volume now matches the content of that newly created snapshot at the source. There’s a lot of additional capabilities regarding storing and transferring other snapshots, controlling retention policies, and protecting snapshots from deletion. The basic logic is the same, though – you just need two volumes with a common snapshot.
Initial configuration - volumes
Here's my current 5 volumes being replicated as 5 ordinary snapmirror replicas:
rtp-a700s-c02::> snapmirror show -destination-path jfs_svm2:jfs3* Source Path Destination Mirror Path Status ------------------- ----------------------- -------------- jfs_svm1:jfs3_dbf1 jfs_svm2:jfs3_dbf1_mirr Snapmirrored jfs_svm1:jfs3_dbf2 jfs_svm2:jfs3_dbf2_mirr Snapmirrored jfs_svm1:jfs3_logs1 jfs_svm2:jfs3_logs1_mirr Snapmirrored jfs_svm1:jfs3_logs2 jfs_svm2:jfs3_logs2_mirr Snapmirrored jfs_svm1:jfs3_ocr jfs_svm2:jfs3_ocr_mirr Snapmirrored
Common snapshots
Here’s the snapshots I have on the source:
rtp-a700s-c01::> snapshot show -vserver jfs_svm1 -volume jfs3* Vserver Volume Snapshot -------- -------- ------------------------------------- jfs_svm1 jfs3_dbf1 snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520140.2024-02-23_190259 jfs3_dbf2 snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520141.2024-02-23_190315 jfs3_logs1 snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520142.2024-02-23_190257 jfs3_logs2 snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520143.2024-02-23_190258 jfs3_ocr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520139.2024-02-23_190256
And here’s the snapshots on my destination volumes:
rtp-a700s-c02::> snapshot show -vserver jfs_svm2 -volume jfs3* Vserver Volume Snapshot -------- -------- ------------------------------------- jfs_svm2 jfs3_dbf1_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520140.2024-02-23_190259 jfs3_dbf2_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520141.2024-02-23_190315 jfs3_logs1_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520142.2024-02-23_190257 jfs3_logs2_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520143.2024-02-23_190258 jfs3_ocr_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520139.2024-02-23_190256
See the common snapshot in each volume? As long as those snapshots exist, I can do virtually anything I want to these volumes and I’ll still be able to resynchronize the replication relationships without a total retransfer of everything.
Do it with REST
The customer request was to automate the conversion process. The output below used a personal toolbox of mine to issue REST API calls and print the complete debug output. I normally script in Python.
The POC code used the following inputs:
Name of the snapmirror destination server
Pattern match for existing snapmirrored volumes
Name for the ONTAP Consistency Groups to be created
The basic steps are these:
Enumerate replicated volumes on the target system using the pattern match
Identify the name of the source volume and the source SVM hosting that volume
Delete the snapmirror relationships
Release the snapmirror destination at the source.
Define a new CG at the source
Define a new CG at the destination
Define a CG snapmirror relationship
Resync the mirror
Caution: Step 4 is the critical step. I'll keep repeating this warning in this post. By default, releasing a snapmirror relationship will delete all common snapshots. You need to use addition, non-default CLI/REST arguments to stop that from happening. If you make an error, you’ll lose your common snapshots.
In the following sections, I’ll walk you through my POC script and show you the REST conversation happening along the way.
The script
Here’s the first few lines:
#! /usr/bin/python3 import sys sys.path.append(sys.path[0] + "/NTAPlib") import doREST svm1='jfs_svm1' svm2='jfs_svm2'
The highlights are that I’m importing my doREST module and defining a couple of variables with the names of the svm’s I’m using. The svm jfs_svm1 is the source of the target SVM relationship, and jfs_svm2 is the destination SVM.
A note about doREST. It’s a wrapper for ONTAP APIs that is designed to package up the responses in a standard way. It also has a credential management system and hostname registry. I use this module to string together multiple calls and build workflows. It also handles calls synchronously. For calls such as a POST /snapmirror, which is asynchronous, the doREST module will read the job uuid and repeatedly poll ONTAP until the job is complete. It will then return the results. In the examples below, I’ll include the input/output of that looping behavior. If you want to know more, visit my github repo here.
You'll see I'm running it in debug mode where the API, JSON, and REST response are printed at the CLI. I've included that information to help you understand how to build your own REST workflows.
Enumerate the snapmirror relationships
If I'm going to convert a set of snapmirror relationships into a CG configuration, I'll obviously need to know which ones i'm converting.
api='/snapmirror/relationships' restargs='fields=uuid,' + \ 'state,' + \ 'destination.path,' + \ 'destination.svm.name,' + \ 'destination.svm.uuid,' + \ 'source.path,' + \ 'source.svm.name,' + \ 'source.svm.uuid' + \ '&query_fields=destination.path' + \ '&query=jfs_svm2:jfs3*' snapmirrors=doREST.doREST(svm2,'get',api,restargs=restargs,debug=2)
This code sets up the REST arguments that go with a GET /snapmirror/relationships. I’ve passed a query for a path of jfs_svm2:jfs3* which means the results will only contain the SnapMirror destinations I mentioned earlier in this post. It's a wildcard search.
Here’s the debug output that shows the REST conversation with ONTAP:
->doREST:REST:API: GET https://10.192.160.45/api/snapmirror/relationships?fields=uuid,state,destination.path,destination.svm.name,destination.svm.uuid,source.path,source.svm.name,source.svm.uuid&query_fields=destination.path&query=jfs_svm2:jfs3* ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "records": [ ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "26b40c82-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_ocr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_ocr_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/26b40c82-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "2759306a-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_logs1", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_logs1_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/2759306a-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "27fdd036-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_logs2", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_logs2_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/27fdd036-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "28a265e8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_dbf1", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_dbf1_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/28a265e8-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "320db78d-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_dbf2", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_dbf2_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/320db78d-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: ], ->doREST:REST:RESPONSE: "num_records": 5, ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships?fields=uuid,state,destination.path,destination.svm.name,destination.svm.uuid,source.path,source.svm.name,source.svm.uuid&query_fields=destination.path&query=jfs_svm2:jfs3*" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Highlights:
The uuid of the snapmirror relationship uuid are in red
Snapmirror sources are highlighted in in purple
Snapmirror destinations are in blue
Delete the snapmirror relationships
for record in snapmirrors.response['records']: delete=doREST.doREST(svm2,'delete','/snapmirror/relationships/' + record['uuid'] + '/? destination_only=true',debug=2)
This block extracts the records returned by the prior GET /snapmirror/relationships and extracts the uuid. It then deletes all 5 of the relationships.
Caution: the destination_only=true argument is required to stop ONTAP from deleting the common snapshots. Do not overlook this parameter.
->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/26b40c82-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "d905b4e3-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/d905b4e3-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/d905b4e3-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "d905b4e3-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/d905b4e3-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/2759306a-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "d9ad1f48-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/d9ad1f48-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/d9ad1f48-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "d9ad1f48-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/d9ad1f48-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/27fdd036-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "da546656-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/da546656-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/da546656-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "da546656-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/da546656-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/28a265e8-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "daf9c09a-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/daf9c09a-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/daf9c09a-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "daf9c09a-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/daf9c09a-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/320db78d-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "dba0429b-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dba0429b-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/dba0429b-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dba0429b-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dba0429b-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200
You can see in the above output that the actual DELETE /snapmirror/relationships operation was asynchronous. The REST call returned a status of 202, which means the operation was accepted, but is not yet complete.
The doREST module then captured the uuid of the job and polled ONTAP until complete.
Release the snapmirror relationships
The next part of the script is almost identical to the prior snippet, except this time it’s doing a snapmirror release operation.
The relationship itself was deleted in the prior step, and deletion of the relationship stops updates. That deletion operation was executed against the destination controller and it included an argument destination_only=true.
The next deletion operation will target the source and will include source_info_only=true. Asynchronous SnapMirror is a pull technology, so deleting the relationship in the prior step halted further updates. We still need to de-register the destination from the source, which is what the next step does.
Caution: the source_info_only=true argument is required to stop ONTAP from deleting the common snapshots. Do not overlook this parameter.
for record in snapmirrors.response['records']: delete=doREST.doREST(svm1,'delete','/snapmirror/relationships/' + record['uuid'] + '/? source_info_only=true',debug=2)
->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/26b40c82-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "dc4fcade-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dc4fcade-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dc4fcade-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dc4fcade-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dc4fcade-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/2759306a-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "dcfd165f-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dcfd165f-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dcfd165f-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dcfd165f-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dcfd165f-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/27fdd036-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "ddac905c-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/ddac905c-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/ddac905c-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "ddac905c-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/ddac905c-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/28a265e8-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "de9526a2-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/de9526a2-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/de9526a2-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "de9526a2-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/de9526a2-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/320db78d-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "df43391f-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/df43391f-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/df43391f-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "df43391f-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/df43391f-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
At this point, the original snapmirror relationships are completely deconfigured, but the volumes still contain a common snapshot, which is all that is required to perform a resync.
Create a CG at the source
Assuming is hasn’t already been done before, we’ll need to define the source volumes as a CG. The process starts by creating a mapping of source volumes to destination volumes using the information obtained when the original snapmirror data was collected.
mappings={} for record in snapmirrors.response['records']: mappings[record['source']['path'].split(':')[1]] = record['destination']['path'].split(':')[1]
The mappings dictionary looks like this:
{'jfs3_ocr': 'jfs3_ocr_mirr', 'jfs3_logs1': 'jfs3_logs1_mirr', 'jfs3_logs2': 'jfs3_logs2_mirr', 'jfs3_dbf1': 'jfs3_dbf1_mirr', 'jfs3_dbf2': 'jfs3_dbf2_mirr'}
The next step is to create the consistency group using the keys from this dictionary, because the keys are the volumes at the source. Note that I’m naming the cg jfs3, which is the name of the host where this database resides.
vollist=[] for srcvol in mappings.keys(): vollist.append({'name':srcvol,'provisioning_options':{'action':'add'}}) api='/application/consistency-groups' json4rest={'name':'jfs3', \ 'svm.name':'jfs_svm1', \ 'volumes': vollist} cgcreate=doREST.doREST(svm1,'post',api,json=json4rest,debug=2)
->doREST:REST:API: POST https://10.192.160.40/api/application/consistency-groups ->doREST:REST:JSON: {'name': 'jfs3', 'svm.name': 'jfs_svm1', 'volumes': [{'name': 'jfs3_ocr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_logs1', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_logs2', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_dbf1', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_dbf2', 'provisioning_options': {'action': 'add'}}]} ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "running", ->doREST:REST:RESPONSE: "message": "Unclaimed", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "running", ->doREST:REST:RESPONSE: "message": "Unclaimed", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "running", ->doREST:REST:RESPONSE: "message": "Creating consistency group volume record - 3 of 5 complete.", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Create a CG at the destination
The next step is to create a CG at the destination:
The list of volumes is also taken from the mappings dictionary, except rather than using the keys, I’ll use the values of the keys. Those are the snapmirror destination volumes discovered in the first step.
vollist=[] for srcvol in mappings.keys(): vollist.append({'name':mappings[srcvol],'provisioning_options':{'action':'add'}}) api='/application/consistency-groups' json4rest={'name':'jfs3', \ 'svm.name':'jfs_svm2', \ 'volumes': vollist} cgcreate=doREST.doREST(svm2,'post',api,json=json4rest,debug=2)
->doREST:REST:API: POST https://10.192.160.45/api/application/consistency-groups ->doREST:REST:JSON: {'name': 'jfs3', 'svm.name': 'jfs_svm2', 'volumes': [{'name': 'jfs3_ocr_mirr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_logs1_mirr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_logs2_mirr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_dbf1_mirr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_dbf2_mirr', 'provisioning_options': {'action': 'add'}}]} ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "e25c2f6f-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e25c2f6f-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/e25c2f6f-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "e25c2f6f-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e25c2f6f-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Create the consistency group mirror
To define the CG mirror, I need to built the CG snapmirror map. Order matters. I need a list of source volumes and destination volumes, and then ONTAP will match element X of the first list to element X of the second list. That’s how you control which volume in the source CG should be replicated to which volume in the destination CG.
for record in snapmirrors.response['records']: mappings[record['source']['path'].split(':')[1]] = record['destination']['path'].split(':')[1] srclist=[] dstlist=[] for srcvol in mappings.keys(): srclist.append({'name':srcvol}) dstlist.append({'name':mappings[srcvol]})
Now I can create the mirror of the jfs3 CG on the source to the jfs3 CG on the destination
api='/snapmirror/relationships' json4rest={'source':{'path':'jfs_svm1:/cg/jfs3', \ 'consistency_group_volumes' : srclist}, \ 'destination':{'path':'jfs_svm2:/cg/jfs3', \ 'consistency_group_volumes' : dstlist}, \ 'policy':'Asynchronous'} cgsnapmirror=doREST.doREST(svm2,'post',api,json=json4rest,debug=2)
->doREST:REST:API: POST https://10.192.160.45/api/snapmirror/relationships ->doREST:REST:JSON: {'source': {'path': 'jfs_svm1:/cg/jfs3', 'consistency_group_volumes': [{'name': 'jfs3_ocr'}, {'name': 'jfs3_logs1'}, {'name': 'jfs3_logs2'}, {'name': 'jfs3_dbf1'}, {'name': 'jfs3_dbf2'}]}, 'destination': {'path': 'jfs_svm2:/cg/jfs3', 'consistency_group_volumes': [{'name': 'jfs3_ocr_mirr'}, {'name': 'jfs3_logs1_mirr'}, {'name': 'jfs3_logs2_mirr'}, {'name': 'jfs3_dbf1_mirr'}, {'name': 'jfs3_dbf2_mirr'}]}, 'policy': 'Asynchronous'} ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "e304e8d8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e304e8d8-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/e304e8d8-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "e304e8d8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e304e8d8-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Retrieve the UUID
My final setup step will be to resync the relationship as a CG replica using the previously existing common snapshots, but in order to do that I need the uuid of the CG snapmirror I created. I’ll reuse the same query as before. Strictly speaking, I don’t need all these fields for this workflow, but for the sake of consistency and futureproofing, I’ll gather all the core information about the snapmirror relationship in a single call.
Note that I’ve changed my query to jfs_svm2:/cg/jfs3. This is the syntax for addressing a CG snapmirror.
svm:/cg/[cg name]
api='/snapmirror/relationships' restargs='fields=uuid,' + \ 'state,' + \ 'destination.path,' + \ 'destination.svm.name,' + \ 'destination.svm.uuid,' + \ 'source.path,' + \ 'source.svm.name,' + \ 'source.svm.uuid' + \ '&query_fields=destination.path' + \ '&query=jfs_svm2:/cg/jfs3' cgsnapmirror=doREST.doREST(svm2,'get',api,restargs=restargs,debug=2) cguuid=cgsnapmirror.response['records'][0]['uuid']
->doREST:REST:API: GET https://10.192.160.45/api/snapmirror/relationships?fields=uuid,state,destination.path,destination.svm.name,destination.svm.uuid,source.path,source.svm.name,source.svm.uuid&query_fields=destination.path&query=jfs_svm2:/cg/jfs3 ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "records": [ ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "e304e0fe-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:/cg/jfs3", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:/cg/jfs3", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/e304e0fe-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: ], ->doREST:REST:RESPONSE: "num_records": 1, ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships?fields=uuid,state,destination.path,destination.svm.name,destination.svm.uuid,source.path,source.svm.name,source.svm.uuid&query_fields=destination.path&query=jfs_svm2:/cg/jfs3" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Resync
Now I’m ready to resync with a PATCH operation. I’ll take the first record from the prior operation and extract the uuid. If I was doing this in production code, I’d validate the results to ensure that the query returned one and only one record. That ensures I really do have the CG uuid for the CG I created.
api='/snapmirror/relationships/' + cguuid json4rest={'state':'snapmirrored'} cgresync=doREST.doREST(svm2,'patch',api,json=json4rest,debug=2)
->doREST:REST:API: PATCH https://10.192.160.45/api/snapmirror/relationships/e304e0fe-d27e-11ee-a514-00a098af9054 ->doREST:REST:JSON: {'state': 'snapmirrored'} ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "e3b577a8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e3b577a8-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/e3b577a8-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "e3b577a8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e3b577a8-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Done. I can now see a healthy CG snapmirror relationship.
rtp-a700s-c02::> snapmirror show -destination-path jfs_svm2:/cg/jfs3 Source Path: jfs_svm1:/cg/jfs3 Destination Path: jfs_svm2:/cg/jfs3 Relationship Type: XDP Relationship Group Type: consistencygroup SnapMirror Schedule: - SnapMirror Policy Type: mirror-vault SnapMirror Policy: Asynchronous Tries Limit: - Throttle (KB/sec): unlimited Mirror State: Snapmirrored Relationship Status: Idle File Restore File Count: - File Restore File List: - Transfer Snapshot: - Snapshot Progress: - Total Progress: - Percent Complete for Current Status: - Network Compression Ratio: - Snapshot Checkpoint: - Newest Snapshot: snapmirrorCG.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520139.2024-02-23_190812 Newest Snapshot Timestamp: 02/23 19:09:12 Exported Snapshot: snapmirrorCG.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520139.2024-02-23_190812 Exported Snapshot Timestamp: 02/23 19:09:12 Healthy: true Unhealthy Reason: - Destination Volume Node: - Relationship ID: e304e0fe-d27e-11ee-a514-00a098af9054 Current Operation ID: - Transfer Type: - Transfer Error: - Current Throttle: - Current Transfer Priority: - Last Transfer Type: resync Last Transfer Error: - Last Transfer Size: 99.81KB Last Transfer Network Compression Ratio: 1:1 Last Transfer Duration: 0:1:5 Last Transfer From: jfs_svm1:/cg/jfs3 Last Transfer End Timestamp: 02/23 19:09:17 Progress Last Updated: - Relationship Capability: 8.2 and above Lag Time: 3:24:1 Identity Preserve Vserver DR: - Volume MSIDs Preserved: - Is Auto Expand Enabled: true Backoff Level: - Number of Successful Updates: 0 Number of Failed Updates: 0 Number of Successful Resyncs: 1 Number of Failed Resyncs: 0 Number of Successful Breaks: 0 Number of Failed Breaks: 0 Total Transfer Bytes: 102208 Total Transfer Time in Seconds: 65 FabricLink Source Role: - FabricLink Source Bucket: - FabricLink Peer Role: - FabricLink Peer Bucket: - FabricLink Topology: - FabricLink Pull Byte Count: - FabricLink Push Byte Count: - FabricLink Pending Work Count: - FabricLink Status: -
I would still need to ensure I have the correct snapmirror schedules and policies, but that’s all essentially the same procedures used for regular volume-based asynchronous snapmirror. The primary difference is you reference the the paths, if necessary, using the svm:/cg/[cg name] syntax. Start here https://docs.netapp.com/us-en/ontap/data-protection/create-replication-job-schedule-task.html for those details.
CLI procedure
If you’re using ONTAP 9.14.1 or higher, you can do everything via the CLI or SystemManager too.
Delete the existing snapmirror relationships
rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_ocr_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_ocr_mirr". rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_dbf1_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_dbf1_mirr". rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_dbf2_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_dbf2_mirr". rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_logs1_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_logs1_mirr". rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_logs2_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_logs2_mirr".
Release the snapmirror destinations
Don’t forget the "-relationship-info-only true"!
rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_ocr_mirr -relationship-info-only true [Job 4984] Job succeeded: SnapMirror Release Succeeded rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_dbf1_mirr -relationship-info-only true [Job 4985] Job succeeded: SnapMirror Release Succeeded rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_dbf2_mirr -relationship-info-only true [Job 4986] Job succeeded: SnapMirror Release Succeeded rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_logs1_mirr -relationship-info-only true [Job 4987] Job succeeded: SnapMirror Release Succeeded rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_logs2_mirr -relationship-info-only true [Job 4988] Job succeeded: SnapMirror Release Succeeded
Create a CG at the source
rtp-a700s-c01::> consistency-group create -vserver jfs_svm1 -consistency-group jfs3 -volumes jfs3_ocr,jfs3_dbf1,jfs3_dbf2,jfs3_logs1,jfs3_logs2 (vserver consistency-group create) [Job 4989] Job succeeded: Success
Create a CG at the destination
rtp-a700s-c02::> consistency-group create -vserver jfs_svm2 -consistency-group jfs3 -volumes jfs3_ocr_mirr,jfs3_dbf1_mirr,jfs3_dbf2_mirr,jfs3_logs1_mirr,jfs3_logs2_mirr (vserver consistency-group create) [Job 5355] Job succeeded: Success
Create the CG snapmirror relationships
rtp-a700s-c02::> snapmirror create -source-path jfs_svm1:/cg/jfs3 -destination-path jfs_svm2:/cg/jfs3 -cg-item-mappings jfs3_ocr:@jfs3_ocr_mirr,jfs3_dbf1:@jfs3_dbf1_mirr,jfs3_dbf2:@jfs3_dbf2_mirr,jfs3_logs1:@jfs3_logs1_mirr,jfs3_logs2:@jfs3_logs2_mirr Operation succeeded: snapmirror create for the relationship with destination "jfs_svm2:/cg/jfs3".
Perform the resync operation
rtp-a700s-c02::> snapmirror resync -destination-path jfs_svm2:/cg/jfs3 Operation is queued: snapmirror resync to destination "jfs_svm2:/cg/jfs3".
Done!
rtp-a700s-c02::> snapmirror show -destination-path jfs_svm2:/cg/jfs3 Source Path: jfs_svm1:/cg/jfs3 Destination Path: jfs_svm2:/cg/jfs3 Relationship Type: XDP Relationship Group Type: consistencygroup SnapMirror Schedule: - SnapMirror Policy Type: mirror-vault SnapMirror Policy: MirrorAndVault Tries Limit: - Throttle (KB/sec): unlimited Mirror State: Snapmirrored Relationship Status: Idle File Restore File Count: - File Restore File List: - Transfer Snapshot: - Snapshot Progress: - Total Progress: - Percent Complete for Current Status: - Network Compression Ratio: - Snapshot Checkpoint: - Newest Snapshot: snapmirrorCG.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520144.2024-02-26_005106 Newest Snapshot Timestamp: 02/26 00:52:06 Exported Snapshot: snapmirrorCG.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520144.2024-02-26_005106 Exported Snapshot Timestamp: 02/26 00:52:06 Healthy: true Unhealthy Reason: - Destination Volume Node: - Relationship ID: 15f75947-d441-11ee-a514-00a098af9054 Current Operation ID: - Transfer Type: - Transfer Error: - Current Throttle: - Current Transfer Priority: - Last Transfer Type: resync Last Transfer Error: - Last Transfer Size: 663.3KB Last Transfer Network Compression Ratio: 1:1 Last Transfer Duration: 0:1:5 Last Transfer From: jfs_svm1:/cg/jfs3 Last Transfer End Timestamp: 02/26 00:52:11 Progress Last Updated: - Relationship Capability: 8.2 and above Lag Time: 0:0:21 Identity Preserve Vserver DR: - Volume MSIDs Preserved: - Is Auto Expand Enabled: true Backoff Level: - Number of Successful Updates: 0 Number of Failed Updates: 0 Number of Successful Resyncs: 1 Number of Failed Resyncs: 0 Number of Successful Breaks: 0 Number of Failed Breaks: 0 Total Transfer Bytes: 679208 Total Transfer Time in Seconds: 65 FabricLink Source Role: - FabricLink Source Bucket: - FabricLink Peer Role: - FabricLink Peer Bucket: - FabricLink Topology: - FabricLink Pull Byte Count: - FabricLink Push Byte Count: - FabricLink Pending Work Count: - FabricLink Status: -
... View more
If the data isn't already encrypted or compressed, then 3:1 is about the median. As Dave said, it's also a lot of "It depends". We evaluated some internal production datafiles here at NetApp, taken at random, and we found between 2:1 ands 6:1 efficiency. We've also had customers with a lot of datafiles that didn't have blocks, and that sort of data gets about 80:1 efficiency because that gets stored is the datafile block header/trailer. We also had a support case a while back where efficiency was basically 1:1. It wasn't compressed, it was just a massive index of flat files stored elsewhere. It was an extremely efficient way to store data that was also super-random. Conceptually, it was like compressed data, but it wasn't "compression" as we know it.
... View more
Could you provide a snippet of the REST API you're calling to do the commit? Also, have you considered autocommit, where you don't need to invoke SnapLock at all? The file just commits for the required time after being written? Obviously that wouldn't work if you use different retention times, I was only wondering if you'd considered it.
... View more
As a general rule, POST is for creating something or executing an operation, PATCH is for changing something, and GET is just retrieving information. I tested this now: GET https://10.192.160.45/api/storage/volumes?fields=uuid,size,svm.name,svm.uuid,clone.split_initiated,clone.split_complete_percent,clone.split_estimate,nas.path,aggregates,type&name=*&svm.name=jfs_svm2 and it returned what you'd expect on a system without any current split operations for each volume: ->doREST:REST:RESPONSE: "clone": { ->doREST:REST:RESPONSE: "split_estimate": 3266568192, ->doREST:REST:RESPONSE: "split_initiated": false The split_estimate is misleading. That's the amount of used space on the volume to be split. It does NOT mean that amount of data will be copied or consumed after the split. The space consumption after a split depends on what the space allocation polices are for the volume. If you're in fully thin provisioned configuration, splitting a clone requires no additional space. If you're thick provisioned, the split clone would allocate its full size on the aggregate for itself.
... View more
"volume clone split show" is the CLI command. For REST, you can do a GET /storage/volumes and retrieve fields like: clone.split_complete_percent clone.split_estimate clone.split_initiated
... View more
Certification just came through a few days ago. In principle, A-Series and C-Series should perform about the same because SAP HANA is all about sequential IO. You'll see a random read latency increase with C-Series, but there's only a very small difference between A-Series and C-Series with sequential IO. It's not zero difference, but it's negligible.
It might take a few more days until the certification will be listed at SAP’s webpage https://www.sap.com/dmc/exp/2014-09-02-hana-hardware/enEN/#/solutions?filters=storage.
... View more
The C-Series will definitely have higher latency. A C250 should be able to deliver more total IOPS than the A200, but the latency of the individual IO operations would be higher. You're correct, it's the differing media.
Whether you notice the difference depends on the workload. A lot of virtualization projects really don't need 150us of latency from an A-Series. A lot of VMware footprints are still using FAS with spinning drives, and they work well. If, however, you're hosting a database on those VMDK's, then C-Series latency might cause problems.
... View more
A-Series and C-Series
I've seen an extraordinary amount of interest in the C-Series systems for all sorts of workloads. I'm the Big Scary Enterprise Workload guy, which means I want proof before I recommend anything to a customer. So, I reached out to the Workload Engineering team and got some realistic test data that demonstrates what it can do.
If you want to real all the low-level details about the A-Series and C-Series architectures and their performance characteristics, there's a couple of hyperlinks below.
Before I get into the numbers, I wanted recap my personal view of "What is C-Series?, or to phrase it better…
Why is C-Series?
For a long time, solid-state storage arrays just used "Flash". For a brief time, it appeared the solid state market was going to divide into Flash media and 3DCrosspoint media, which was a much faster solid state technology, but 3DXP couldn't break out of its niche.
Instead, we've seen the commercial Flash market divide into SLC/TLC drives and QLC drives. Without going into details, SLC/TLC are the faster and more durable options aimed at high-speed storage applications, whereas QLC is somewhat slower and less durable* but also less expensive and are aimed at capacity-centric, less IO-intensive storage applications.
*Note: Don't get misled on this topic. The durability difference between TLC and QLC might be important if you're purchasing a drive for your personal laptop, but the durabilty is essentially identical when drives are inserted into ONTAP. ONTAP RAID technology is still there protecting data against media failures. Furthermore, ONTAP WAFL technology distributes inbound write data to free blocks across multiple drives. This minimizes overwrites of the individual cells within the drive, which maximizes the drive's useful life. Also, NetApp support agreements that cover drive failures also include drive replacement for SSDs that have exhausted their write cycles.
The result of the market changes is that NetApp now offers the A-Series for high-speed, latency sensitive databases or IO-intensive VMware estates, while the C-Series is for less latency-sensitive and more capacity-centric workloads.
That's easy enough to understand in principle, but it's not enough for DBAs, virtualization admins, and storage admins to make decisions. They want to see the numbers…
The Numbers
What makes this graph so compelling is its simplicity.
The IOPS capabilities are comparable. Yes, the C-Series is saturating a touch quicker than A-Series, but it's very close.
As expected, the C-Series is showing higher latency, but it's very consistent and also much closer to A-Series performance than it is to hybrid/spinning disk solutions.
The workload is an Oracle database issuing a roughly 80%/20% random-read, random-write split. We graph total IOPS against read latency because read latency is normally the most important factor with real-world workloads.
The reason we use Oracle databases is twofold. First, we've got thousands and thousands of controllers servicing Oracle databases, so it's an important market for us. Second, Oracle databases are an especially brutal, latency sensitive workloads with lots of interdependencies between different IO operations. If anything is wrong with storage behavior, you should see it in the charts. You can also extrapolate these results to lots of different workloads beyond databases.
We are also graphing the latency as seen from the database itself. The timing is based on the elapsed time between the application submitting an IO, and the IO being reported as complete. The means we're measuring the entire storage path from the OS, through the network, into the storage system, back across the network to the OS, and up to the application layer.
I'd also like to point out that the performance of the A-Series is amazing. 900K IOPS without even breaching the 1ms latency runs circles around the competition, but I've posted about that before. This post is focusing on C-Series.
Note: These tests all used ONTAP 9.13.1, which includes some significant performance improvements for both A-Series and C-Series.
Write latency
Obviously write latency is also important, but the A-Series and C-Series both use the same write logic from the point of view of the host. Write operations commit to the mirrored, nonvolatile NVRAM journal. Once the data is in NVRAM, the write is acknowledged and host continues. The write to the drive layer comes much later.
Want proof?
This is a graph of total IOPS versus the write latency. Note that the latency is reported in microseconds.
You can see your workload's write latency is mostly unaffected by the choice of A-Series or C-Series. As the IOPS increase toward the saturation point, the write latency on C-Series increases more quickly than A-Series as a result of the somewhat slower media in use, but keep this in perspective. Most real-world workloads run as expected so long as write latency remains below 500µs. Even 1ms of write latency is not necessarily a problem, even with databases.
Caching
These tests used 10TB of data within the database (the database itself was larger, but we're accessing 10TB during the test runs). This means the test results above do include some cache hits on the controller, which reflects how storage is used in the real world. There will be some benefit from caching, but nearly all IO in these tests is being serviced from the actual drives.
We also run these tests with storage efficiency features enabled, using a working set with a reasonable level of compressibility. If you use unrealistic test data, you can get outrageously unreasonable amounts of caching that skew the results in the same way that running a test with a tiny working set that is unrealistically cacheable can skew results.
The reason I want to point this out is that customers considering C-Series need to understand that not all IO latency is affected. The higher latencies only appear with read operations that actually require a disk IO. Reads that can be serviced by onboard cache should be measurable in microseconds, as is true with the A-Series. This is important because all workloads, especially databases, include hot blocks of data that require ultra-low latency. Access times for cached blocks should be essentially identical between A-Series and C-Series.
Sequential IO
Sequential IO is much less affected by the drive type than random IO. The reason is sequential IO involves both readahead and larger blocks. That means the storage system can start performing read IO to the drives before the host even requests the data, and there are much fewer (but larger) IO operations happening on the backend drives.
On the whole, if you're doing sequential IO you should see comparable performance with A-Series, C-Series and even old FAS arrays if they have enough drives.
We compared A-Series to C-Series and saw a peak sequential read throughput of about 30GB/sec with the A-Series and about 27GB/sec with the C-Series. These numbers were using synthetic tools. It's difficult to perform a truly representative sequential IO tests from a database because of the configuration requirements. You'd need an enormous number of FC adapters to run a single A800 or C800 controller to its limit, and it's difficult to get a database to try to read 30GB/sec in the first place.
As a practical matter, few workloads are purely sequential IO, but tasks such as Oracle RMAN backups or database full table scans should perform about the same on both A-Series and C-Series. The limiting factor should normally be the available network bandwidth, not the storage controller.
Summary: A-Series or C-Series?
It's all ONTAP, it's just about the media. If you have a workload that genuinely requires consistent latency down in the 100's of µs range, then choose A-Series. If your workload can accept 2ms (or so) of read latency (and remember, cache hits and write IO latency is much faster) then look at C-Series.
As a general principle, think about whether your workload is about computational speed or is about the end-user experience. A bank performing end-of-day account reconciliation probably needs the ultra-low latency of the A-Series. In contrast, a CRM database is usually about the end users. If you're updating customer records, you probably don't care if it takes an extra 2ms to retrieve a block of data that contains the customer contact information.
You can also build mixed clusters and tier your databases between A-Series and C-Series as warranted. It's all still ONTAP, and you can nondisruptively move your workloads between controllers.
Finally, the C-Series is an especially good option for upgrading legacy spinning drive and hybrid arrays. Spinning drive latencies are typically around 8-10ms, which means C-Series is about 4X to 5X faster in terms of latency. If you're looking at raw IOPS, there's no comparison. A spinning drive saturates around 120 IOPS/drive. You would need about 6000 spinning drives to reach the 800,000 IOPS delivered by just 24 QLC drives as shown in these tests. It's a huge improvement in terms of performance, power/cooling requirements, and costs, but at a lower price point than arrays using SLC or TLC drives.
If you want to learn more, we published the following two technical reports today:
Oracle Performance on AFF A-Series and C-Series
Virtualized Oracle Performance on AFF A-Series and C-Series
Bonus Chart:
If you're wondering how A-Series and C-Series compare under virtualization, here it is. It's not easy building a truly optimized 4-node virtualized RAC environment, and we probably could have tuned this better to reduce the overhead from ESX and VMDK's, but the results are still outstanding and more importantly consistent with the bare metal tests. The latency is higher with C-Series, but the IOPS levels are comparable to A-Series.
We also did this test with FCP, not NVMe/FC, because most VMware customers are using traditional FCP. The protocol change is the primary reason for the lower maximum IOPS level.
... View more
That KB article is incorrect. I just put in a request to remove it. SM-S operates in one of two modes - sync and strict sync. The target customer for the "sync" option is customers who want RPO=0, but they do NOT want operations to screech to a halt if the replication link is lost. That applies to most customers. Some customers, however, want 100% guaranteed RPO=0. That's why we also included the option for strict sync. If a write cannot be replicated, an IO error is returned to the host OS, which typically results in an application shutdown. In both cases, the timeout is around 15 seconds, but the precise number varies a little and I believe there are some options for tuning. In normal operations, you have RPO=0 synchronous mirroring. If you lose connectivity to the remote site for more than 15 seconds, then regular SM-S will carry on accepting writes in broken state and resync when it gets the opportunity, and StrictSync will throw an error. I believe that field you've asked about indicates the time since a SM-S Synchronous (not Strict) replica lost sync.
... View more
NFS has been around for decades as the premier networked, clustered filesystem. If you're a unix/linux user, and you're storing a lot of files, you're probably using NFS right now, especially if you need multiple hosts accessing the same data.
If you're looking for high-performance NFS, NetApp's implementation is the best in the business. A lot of NetApp's market share was built on ONTAP's unique ability to deliver fast, easy-to-manage NFS storage for Oracle database workloads. It's an especially nice solution for Oracle RAC because it's an inherently clustered filesystem. The connected hosts are just reading and writing files. The actual filesystem management lives on the storage system itself. All the NFS clients on the hosts see the same logical data.
The NFSv3 specification was published in 1995, and that's still the version almost everyone is using today. You can store a huge number of files, it's easy to configure, and it's super-fast. There really wasn't much to improve, and as a result v3 has been the dominant version for decades.
Note: I originally wrote this post for Oracle database customers moving from NFSv3 to NFSv4, but it morphed into a more general explanation of the practical difference between managing NFSv3 storage and managing NFSv4 storage. Any sysadmin using NFS should understand the differences in protocol behavior.
Why NFSv4?
So, why is everyone increasingly looking at NFSv4?
Sometimes it's just perception. NFSv4 is newer, and 'newer' is often seen as 'better'. Most customers I see who are either migrating to NFSv4 or choosing NFSv4 for a new project honestly could have used either v3 or v4 and wouldn't notice a difference between the two. There are exceptions, though. There are subtle improvements in NFSv4 that sometimes make it a much better option, especially in cloud deployments.
This post is about the key practical differences between NFSv3 and NFSv4. I'll cover security improvements, changes in networking behavior, and changes in the locking model. It's especially critical you understand the section NFSv4.1 Locks and Leases. NFSv4 is significantly different from NFSv3. If you're running an application like an Oracle database over NFSv4, you need to change your management practices if you want to avoid accidentally crashing your database.
What this post is not
This is not a re-hash of the ONTAP NFS best practices. You can find that information here, https://www.netapp.com/media/10720-tr-4067.pdf.
NFSv4 versions
If someone says “NFSv4” they're usually referring to NFSv4.1. That’s almost certainly the version you’ll be using.
The first release of NFSv4, which was version 4.0, worked fine, but the NFSv4 protocol was designed to expand and evolve. The primary version you’ll see today is NFSv4.1. For the most part, you don't have to think about the improvements in NFSv.1. It just works better than NFSv4.0 in terms of performance and resiliency.
For purposes of this post, when I write NFSv4 just assume that I’m talking about NFSv4.1 It’s the most widely adopted and supported version. (NetApp has support for NFSv4.2, but the primary difference is we added support for labelled NFS, which is a security feature that most customers haven’t implemented.)
NFSv4 features
The most confusing part about the NFSv4 specification is the existence of optional features. The NFSv3 spec was quite rigid. A given client or server either supported NFSv3 or did not support NFSv3. In contrast, the NFSv4 spec is loaded with optional features.
Most of these optional NFSv4 features are disabled by default in ONTAP because they're not commonly used by sysadmins. You probably don't need to think about them, but there are some applications on the market that specifically require certain capabilities for optimum performance. If you have one of these applications, there should be a section in the documentation covering NFS that will explain what you need from your storage system and which options should be enabled.
If you plan to enable one of the options (delegations is the most commonly used optional feature), test it first and make sure your OS's NFS client fully supports the option and it's compatible with the application you're using. Some of the advanced features can be revolutionary, but only if the OS and application make use of those features. For more information on optional features, refer to the TR referenced above.
Again, it's rare you'll run into any issues. For most users, NFSv4 is NFSv4. It just works.
Exception #1
NFSv4.1 introduced a feature called parallel NFS (pNFS) which is a significant feature with broad appeal for a lot of customers. It separates the metadata path from the data path, which can simplify management and improve performance in very large scale environments.
For example, let's say you have a 20-node cluster. You could enable the pNFS feature, configure a data interface on all 20 nodes, and then mount your NFSv4.1 filesystems from one IP in the cluster. That IP becomes the control path. The OS will then retrieve the data path information and choose the optimal network interface for data traffic. The result is you can distribute your data all over the entire 20-node cluster and the OS will automatically figure out the correct IP address and network interface to use for data access. The pNFS feature is also supported by Oracle's direct NFS client.
pNFS is not enabled by default. NetApp has supported it for a long time, but at the time of the release some OS's had a few bugs. We didn't want customers to accidentally use a feature that might expose them to OS bugs. In addition, pNFS can silently change the network paths in use to move data around, which could also cause confusion for customers. It was safer to leave pNFS disabled so customers know for sure whether it's being used within their storage network.
"Upgrading" from NFSv3 to NFSv4
Don't think of this as an upgrade. NFSv4 isn't better than NFSv3, NFSv4 is merely different. Whether you get any benefits from those differences depends on the application.
For example - locking. NFSv3 has some basic locking capabilities, but it's essentially an honor system lock. NFSv3 locks aren't enforced by the server. NFSv3 clients can ignore locks. In contrast, NFSv4 servers, including ONTAP, must honor and enforce locks.
That opens up new opportunities for applications. For example, IBM WebSphere and Tibco offer clusterable applications where locking is important. There's nothing stopping those vendors from writing application-level logic that tracks and controls which parts of the application are using which files, but that requires work. NFSv4 can do that work too, natively, right on the storage system itself. NFSv4 servers track the state of open and locked files, which means you can build clustered applications where individual files can be exclusively locked for use by a specific process. When that process is done with the file, it can release the lock and other processes can acquire the lock. The storage system enforces the locking.
That's a cool feature, but do you need any of that? If you have an Oracle database, it's mostly just doing reads and write of various sizes and that's all. Oracle databases already manage locking and file access synchronization internally. NetApp does a lot of performance testing with real Oracle databases, and we're not seeing any significant performance difference between NFSv3 and NFSv4. Oracle simply hasn't coded their software to make use of the advanced NFSv4 features.
NFS through a firewall
While the choice of NFS version rarely matters to the applications you're running, it does affect your network infrastructure. In particular, it's much easier to run NFSv4 across a firewall.
With NFSv4, you have a single target port (2049) and the NFSv4 clients are required to renew leases on files and filesystems on regular basis. (more on leases below) This activity keeps the TCP session active. You can normally just open port 2049 through the firewall and NFSv4 will work reliably.
In contrast, NFSv3 is often impossible to run through a firewall. Among the problems experienced by customers trying to make it work is NFSv3 filesystems hanging for up to 30 minutes or more. The problem is that firewalls are almost universally configured to drop a network packet that isn't part of a known TCP session. If you have a lot of NFSv3 filesystems, one of them will probably have quiet periods where the TCP session has low activity. If your TCP session timeout limit on the firewall is set to 15 minutes, and an NFSv3 filesystem is quiet for 15 minutes, the firewall will make the TCP session stale and cease passing packets.
Even worse, it will probably drop them.
If the firewall rejected the packets, that would prompt the client to open a new session, but that's not how firewalls normally work. They'll silently drop the packets. You don't usually want a firewall rejecting a packet because that tells an intruder that the destination exists. Silently dropping an invalid packet is safer because it doesn't reveal anything about the other side of the firewall.
The result of silent packet drops with NFSv3 is the client will hang while it tries to retransmit packets over and over and over. Eventually it gives up and will open a fresh TCP session. The firewall will register the new TCP session and traffic will resume, but in the interim your OS might have been stalled out for 5, 10, 20 minutes or more. Most firewalls can't be configured to avoid this situation. You can increase the allowable timeout for an inactive TCP session, but there has to be some kind of timeout with fixed number of seconds.
We've had a few customers write scripts that did a repeated "stat" on an NFSv3 mountpoint in order to ensure there's enough network activity on the wire to prevent the firewall from closing the session. This is okay as a one-off hack, but it's not something I'd want to rely on for anything mission-critical and it doesn't scale well.
Even if you could increase the timeouts for NFSv3, how do you know which ports to open and ensure they're correctly configured on the firewall? You've got 2049 for NFS, 111 for portmap, 635 for mountd, 4045 for NLM, 4046 for NSM, 4049 for rquota…
NFSv4 works better because there's just a single target port, plus the "heartbeat" of lease renewal would keep the TCP/IP session alive.
NFS Security
NFSv4 is inherently more secure than NFSv3. For example, NFSv4 security is normally based on usernames, not user ID's. The result is it's more difficult for an intruder to spoof credentials to gain access to data on an NFSv4 server. You can also easily tell which clients are actively using an NFSv4. It's often impossible to know for sure with NFSv3. You might know a certain client mounted a filesystem at some point in the past, but are they still using the files? Is the filesystem still mounted now? You can't know for sure with NFSv3.
NFS Security - Kerberos
NFSv4 also includes options to make it even more secure. The primary security feature is Kerberos. You have three options -
krb5 - secure authentication
krb5i - data integrity
krb5p - privacy
In a nutshell, basic krb5 security means better, more secure authentication for NFS access. It's not encryption per se, but it uses an encrypted process to ensure that whoever is accessing an NFS resource is who they claimed to be. Think of it as a secure login process where the NFS client authenticates to the NFS server.
If you use krb5i, you add a validation layer to the payload of the NFS conversation. If a malicious middleman gained access to the network layer and tried to modify the data in transit, krb5i would detect and stop it. The intruder may be able to read data from the conversation, but they won't be able to intercept and tamper with the data.
If you're concerned about an intruder being able to read network packets on the wire, you can go all the way to krb5p. The letter p in krb5p means privacy. It delivers complete encryption.
In the field, few administrators use these options for a simple reason - what are the odds a malicious intruder is going to gain access to data center and start snooping on IP packets on the wire? If someone was able to do that, they'd probably be able to get actual login credentials to the database server itself. They'd then be able to freely access data as an actual user.
With increased interest in cloud, some customers are demanding that all data on the wire be encrypted, no exceptions, ever, and they're demanding krb5p. They don't necessarily use it across all NFS filesystems, but they want the option to turn it on. This is also an example of how NFSv4 security is superior to NFSv3. While some of NFSv3 could be krb5p encrypted, not all NFSv3 functions could be "kerberized". NFSv4, however, can be 100% encrypted.
NFSv4 with krb5p is still not generally used because the encryption/decryption work has overhead. Latency will increase and maximum throughput will drop. Most databases would not be affected to the point users would notice a difference, but it depends on the IO load and latency sensitivity. Users of a very active database would probably experience a noticeable performance hit with full krb5p encryption. That's a lot of CPU work for both the OS and the storage system. CPU cycles are not free.
NFS Security - Private VLANs
If you're genuinely concerned about network traffic being intercepted and decoded in-transit, I would recommend looking at all available options. Yes, you could turn on krb5p, but you could also isolate certain NFS traffic to a dedicated switch. Many switches support private VLANs where individual network ports can communicate with the storage system, but all other port-to-port traffic is blocked. An outside intruder wouldn't be able to intercept network traffic because there would be no other ports on the logical network. It's just the client and the server. This option mitigates the risk of an intruder intercepting traffic without imposing a performance overhead.
NFS Security - IPSec
In addition, you may want to consider IPSec. Any network administrator should know IPSec already, and it's been part of OSs for years. It's a lot like the VPN client you have on your PC, except it's used by server OSs and network devices.
As an ONTAP example, you can configure an IPSec endpoint on a linux OS and an IPSec endpoint on ONTAP and subsequently all IP traffic will use that IPSec tunnel for communication. The protocol doesn't really matter (although I wouldn't recommend using krb5p over IPsec. You don't really need to re-encrypt already encrypted traffic). NFS should perform about the same under IPSec as it would with krb5p, and in some environments IPSec is easier to configure than krb5p.
Note: You can also use IPsec with NFSv3 if you need to secure an NFS connection and NFSv4 is not yet an option for you.
NFS Security - Application Layer
Applications can encrypt data too.
For example, if you're an Oracle database user, consider encryption at the database layer. That also delivers encryption of data on the wire, plus one additional benefit - the backups are encrypted. A lot of the data leaks you read about are a result of someone leaving an unprotected backup in an insecure location. Oracle's Transparent Data Encryption (TDE) encrypts the tablespaces themselves, which means a breach of the backup location will yield access to a data that is still encrypted. As long as the Oracle Wallet data, which contains the decryption keys, is not stored with the backups themselves, that backup data is still secured.
Additionally, TDE scales better. The encryption/decryption work is distributed across all your database servers, which means more CPU's sharing in the work. In addition, and unlike krb5p encryption, TDE incurs zero overhead on the storage system itself.
NFSv4.1 Locks and Leases
In my opinion, this is the most important section of this post. If you don't understand this topic, you're likely to accidentally crash your database.
NFSv3 is stateless. That effectively means that the NFS server (ONTAP) doesn't keep track of which filesystems are mounted, by whom, or which locks are truly in place. ONTAP does have some features that will record mount attempts so you have an idea which clients may be accessing data, and there may be advisory locks present, but that information isn't guaranteed to be 100% complete. It can't be complete, because tracking NFS client state is not part of the NFSv3 standard.
In contrast, NFSv4 is stateful. The NFSv4 server tracks which clients are using which filesystems, which files exist, which files and/or regions of files are locked, etc. This means there needs to be regular communication between an NFSv4 server to keep the state data current.
The most important states being managed by the NFS server are NFSv4 Locks and NFSv4 Leases, and they are very much intertwined. You need to understand how each works by itself, and how they relate to one another.
Locking
With NFSv3, locks are advisory. An NFS client can still modify or delete a "locked" file. An NFSv3 lock doesn't expire by itself, it must be removed. This creates problems. For example, if you have a clustered application that creates NFSv3 locks, and one of the nodes fails, what do you do? You can code the application on the surviving nodes to remove the locks, but how do you know that's safe? Maybe the "failed" node is operational, but isn't communicating with the rest of the cluster?
With NFSv4, locks have a limited duration. As long as the client holding the locks continues to check in with the NFSv4 server, no other client is permitted to acquire those locks. If a client fails to check in with the NFSv4, the locks eventually get revoked by the server and other clients will be able to request and obtain locks.
Now we have to add a layer - leases. NFSv4 locks are associated with an NFSv4 lease.
Leases
When an NFSv4 client establishes a connection with an NFSv4 server, it gets a lease. If the client obtains a lock (there are many types of locks) then the lock is associated with the lease.
This lease has a defined timeout. By default, ONTAP will set the timeout value to 30 seconds:
EcoSystems-A200-B::*> nfs server show -vserver jfsCloud4 -fields v4-lease-seconds vserver v4-lease-seconds --------- ---------------- jfsCloud4 30
This means that an NFSv4 client needs to check in with the NFSv4 server every 30 seconds to renew its leases.
The lease is automatically renewed by any activity, so if the client is doing work there's no need to perform addition operations. If an application becomes quiet and is not doing real work, it's going to need to perform a sort of keep-alive operation (called a SEQUENCE) instead. It's essentially just saying "I'm still here, please refresh my leases."
Question: What happens if you lose network connectivity for 31 seconds?
NFSv3 is stateless. It's not expecting communication from the clients. NFSv4 is stateful, and once that lease period elapses, the lease expires, and locks are revoked and the locked files are made available to other clients.
With NFSv3, you could move network cables around, reboot network switches, make configuration changes, and be fairly sure that nothing bad would happen. Applications would normally just wait patiently for the network connection to work again. Many applications would wait until the end of time, but even an application like Oracle RAC allowed for a 200 second loss of storage connectivity by default. I've personally powered down and physically relocated NetApp storage systems that were serving NFSv3 shares to various applications, knowing that everything would just freeze until I completed my work and work would resume when I put the system back on the network.
With NFSv4, you have 30 seconds (unless you've increased the value of that parameter within ONTAP) to complete your work. If you exceed that, your leases time out. Normally this results in application crashes.
Example: Network failure with an Oracle Database using NFSv4
If you have an Oracle database, and you experience a loss of network connectivity (sometimes called a "network partition") that exceeds the lease timeout, you will crash the database.
Here's an example of what happens in the Oracle alert log if this happens:
2022-10-11T15:52:55.206231-04:00 Errors in file /orabin/diag/rdbms/ntap/NTAP/trace/NTAP_ckpt_25444.trc: ORA-00202: control file: '/redo0/NTAP/ctrl/control01.ctl' ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 1 Additional information: 4294967295 2022-10-11T15:52:59.842508-04:00 Errors in file /orabin/diag/rdbms/ntap/NTAP/trace/NTAP_ckpt_25444.trc: ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: '/redo1/NTAP/ctrl/control02.ctl' ORA-27061: waiting for async I/Os failed
If you look at the syslogs, you should see several of these errors:
Oct 11 15:52:55 jfs0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed! Oct 11 15:52:55 jfs0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed! Oct 11 15:52:55 jfs0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
The log messages are usually the first sign of a problem, other than the application freeze. Typically, you see nothing at all during the network outage because processes and the OS itself are blocked attempting to access the NFS filesystem.
The errors appear after the network is operational again. In the example above, once connectivity was reestablished, the OS attempted to reacquire the locks, but it was too late. The least had expired and the locks were removed. That results in an error that propagates up to the Oracle layer and causes the message in the alert log. You might see variations on these patterns depending on the version and configuration of the database.
There's nothing stopping vendors from writing software that detect loss of locks and reacquires the file handles, but I'm not aware of any vendor who has done that.
In summary, NFSv3 tolerates network interruption, but NFSv4 is more sensitive and imposes a defined lease period.
Now, what if a 30 second timeout isn't acceptable? What if you manage a dynamically changing network where switches are rebooted or cables are relocated and the result is the occasional network interruption? You could choose to extend the lease period, but whether you want to do that requires an explanation of NFSv4 grace periods.
NFSv4 grace periods
Remember how I said that NFSv3 is stateless, while NFSv4 is stateful? That affects storage failover operations as well as network interruptions.
If an NFSv3 server is rebooted, it's ready to serve IO almost instantly. It was not maintaining any sort of state about clients. The result is that an ONTAP takeover operation often appears to be close to instantaneous. The moment a controller is ready to start serving data it will send an ARP to the network that signals the change in topology. Clients normally detect this almost instantly and data resumes flowing.
NFSv4, however, will produce a brief pause. Neither NetApp nor OS vendors can do anything about it - it's just part of how NFSv4 works.
Remember how NFSv4 servers need to track the leases, locks, and who's using what? What happens if an NFS server panics and reboots, or loses power for a moment, or is restarted during maintenance activity? The lease/lock and other client information is lost. The server needs to figure out which client is using what data before resuming operation. This is where the grace period comes in.
Let's say you suddenly power cycle your NFSv4 server. When it comes back up, clients that attempt to resume IO will get a response that essentially says, "Hi there, I have lost lease/lock information. Would you like to re-register your locks?"
That's the start of the grace period. It defaults to 45 seconds on ONTAP:
EcoSystems-A200-B::> nfs server show -vserver jfsCloud4 -fields v4-grace-seconds vserver v4-grace-seconds --------- ---------------- jfsCloud4 45
The result is that, after a restart, a controller will pause IO while all the clients reclaim their leases and locks. Once the grace period ends, the server will resume IO operations.
Lease timeouts vs grace periods
The grace period and the lease period are connected. As mentioned above, the default lease timeout is 30 seconds, which means NFSv4 clients must check in with the server at least every 30 seconds or they lose their leases and, in turn, their locks. The grace period exists to allow an NFS server to rebuild lease/lock data, and it defaults to 45 seconds. ONTAP requires the grace period to be 15 seconds longer than the lease period. This ensures that an NFS client environment that is designed to renew leases at least every 30 seconds will have the ability to check in with the server after a restart. A grace period of 45 seconds ensures that all those clients that expect to renew their leases at least every 30 seconds definitely have the opportunity to do so.
As asked mentioned above:
Now, what if a 30 second timeout isn't acceptable? What if you manage a dynamically changing network where switches are rebooted or cables are relocated and the result is the occasional network interruption? You could choose to extend the lease period, but whether you want to do that requires an explanation of NFSv4 grace periods.
If you want to increase the lease timeout to 60 seconds in order to withstand a 60 second network outage, you're going to have to increase the grace period to at least 75 seconds. ONTAP requires it to be 15 seconds higher than the lease period. That means you're going to experience longer IO pauses during controller failovers.
This shouldn't normally be a problem. Typical users only update ONTAP controllers once or twice per year, and unplanned failovers due to hardware failures are extremely rare. Also, let's be realistic, if you had a network where a 60-second network outage was a concerning possibility, and you needed to the lease timeout to 60 seconds, then you probably wouldn't object to rare storage system failovers resulting in a 75 second pause either. You've already acknowledged you have a network that's pausing for 60+ seconds rather frequently.
You do, however, need to be aware that the NFSv4 grace period exists. I was initially confused when I noted IO pauses on Oracle databases running in the lab, and I thought I had a network problem that was delaying failover, or maybe storage failover was slow. NFSv3 failover was virtually instantaneous, so why isn't NFSv4 just as quick? That's how I learned about the real-world impact of NFSv4 lease periods and NFSv4 grace periods.
Deep Dive - ONTAP lease/lock monitoring
If you really, really want to see what's going on with leases and locks, ONTAP can tell you.
The commands and output can be confusing because there are two ways to look at NFSv4 locks:
The NFSv4 server needs to know which NFSv4 clients currently own NFSv4 locks
The NFSv4 server needs to know which NFSv4 files are currently locked by an NFSv4 client.
The end result is the networking part of ONTAP needs to maintain a list of NFSv4 clients and which NFSv4 locks they hold. Meanwhile, the data part of ONTAP also needs to maintain a list of open NFSv4 files and which NFSv4 locks exist on those files. In other words, NFSv4 locks are indexed by the client that holds them, and NFSv4 locks are indexed by the file they apply to.
Note: I've simplified the screen shots below a little so they're not 200 characters wide and 1000 lines long. Your ONTAP output will have extra columns and lines.
If you want to get NFSv4 locking data from the NFSv4 client point of view, you use the vserver locks show command. It accepts various arguments and filters.
Here's an example of what's locked on one of the datafile volumes on one of my Oracle lab systems:
EcoSystems-A200-A::vserver locks*> vserver locks show -volume jfs0_oradata0 -fields volume,path,lockid,client-id volume path lockid client-id ------------- -------------------------------- ------------------------------------ ---------------- jfs0_oradata0 /jfs0_oradata0/NTAP/system01.dbf 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/system01.dbf 721e4ce8-e6e3-4011-b8cc-7cea6e53661b 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/sysaux01.dbf bb7afcdb-6f8c-4fea-b47d-4a161cd45ceb 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/sysaux01.dbf 2eacf804-7209-4678-ada5-0b9cdefceee0 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/users01.dbf 693d3bb8-aed5-4abd-939b-2fdb8af54ae6 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/users01.dbf a7d24881-b502-40b6-b264-7414df8a98f5 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS000.dbf 1a33008c-573b-4ab7-ae87-33e9b5891e6a 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS000.dbf b6ef3873-217a-46e3-bdc7-5703fb6c82f4 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS004.dbf fef3204b-406c-4f44-a02b-d14adaba807c 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS004.dbf 9f9f737b-52de-4d7a-b169-3ba15df8bcc5 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS008.dbf b322f896-1989-43ab-9d83-eaa2850f916a 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS008.dbf cd33d350-ff79-4e29-8e13-f64ed994bc4e 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS012.dbf e4a54f25-5290-4da3-9a93-28c4ea389480 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS012.dbf f3faed7f-3232-46f4-a125-4d2ad8059bc4 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS016.dbf be7ad0d4-bb70-45a8-85b5-45edcb626487 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS016.dbf ce26918c-8a44-4d02-8c41-fafb7e5d2954 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS020.dbf 47517938-b944-4a0b-a9e8-960b721602f4 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS020.dbf 2808307d-46c9-4afa-af2a-bb13f0908ea3 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS024.dbf f21b6f26-0726-4405-9bac-d9e680baa4df 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS024.dbf 0a95f55b-3dfa-45db-8713-c5ad717441ae 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS028.dbf a0196191-4012-4615-b2fd-dda0ce2d7c3f 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS028.dbf fc769b9d-0fff-4e74-944a-068b82702fd1 0100000028aa6c80
The first time I used this command, I immediately asked, "Hey, where's the lease data? How many seconds are left on the lease for those locks?" That information is held elsewhere. Since an NFSv4 file might be the target of multiple locks with different lease periods, and the NFSv4 server needs to enforce locks, then the NFSv4 server needs to track the detailed locking data down at the file level. You get that data with vserver locks nfsv4 show . Yes, it's almost the same command.
In other words, the vserver locks show command tells you which locks exist. The vserver locks nfsv4 show command tells you the details about a lock.
Let's take the first line in the above output:
EcoSystems-A200-A::vserver locks*> vserver locks show -volume jfs0_oradata0 -fields volume,path,lockid,client-id volume path lockid client-id ------------- -------------------------------- ------------------------------------ ---------------- jfs0_oradata0 /jfs0_oradata0/NTAP/system01.dbf 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 0100000028aa6c80
If I want to know how many seconds are left on that lock, I can run this command:
EcoSystems-A200-A::*> vserver locks nfsv4 show -vserver jfsCloud3 -lock-uuid 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 There are no entries matching your query.
Wait, why didn't that work?
The reason is I'm using 2-node cluster. The NFSv4 client-centric command ( vserver locks show) shows me locking information up at the network layer. The NFSv4 server spans all ONTAP controllers in the cluster, so this command will look the same on all controllers. Individual file management is based on the controller that owns the drives. That means the low-level locking information is available only on a particular controller.
Here are the individual controllers in my HA pair:
EcoSystems-A200-A::*> network int show Logical Status Network Current Current Is Vserver Interface Admin/Oper Address/Mask Node Port Home ----------- ---------- ---------- ------------------ ------------- ------- ---- EcoSystems-A200-A A200-01_mgmt1 up/up 10.63.147.141/24 EcoSystems-A200-01 e0M true A200-01_mgmt2 up/up 10.63.147.142/24 EcoSystems-A200-02 e0M true
If I ssh into the cluster, and the management IP is currently hosted on EcoSystems-A200-01, then the command vserver locks nfsv4 show will only look at NFSv4 locks that exist on the files that are owned by that controller.
If I open an ssh connection to 10.63.147.142 then I'll be able to view the NFSv4 locks for files owned by EcoSystems-A200-02:
EcoSystems-A200-A::*> vserver locks nfsv4 show -lock-uuid 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 Logical Interface Lock UUID Lock Type ----------- --------------------------------- ------------ jfs3_nfs2 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 share-level
This is where I can see the lease data:
EcoSystems-A200-A::*> vserver locks nfsv4 show -lock-uuid 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 -fields lease-remaining lif lock-uuid lease-remaining --------- ------------------------------------ --------------- jfs3_nfs1 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 9
This particular system is set to a lease-seconds of 10. There's an active Oracle database, which means it's constantly performing IO, which in turn means it's constantly renewing the lease. If I cut the power on that how. you'd see the lease-remaining field count down to 0 and then disappear as the leases and associated locks expire.
The chance of anyone needing to go into these diag-level details is close to zero, but I was troubleshooting an Oracle dNFS bug related to leases and got to know all these commands. I thought it was worth writing up in case someone else ended up working on a really obscure problem.
So, that's the story on NFSv4. The top takeaways are:
NFSv4 isn't necessarily any better than NFSv3. Use whatever makes sense to you.
NFSv4 includes multiple security enhancements, but there are also other ways to secure an NFS connection.
NFSv4 is way WAY easier to run through a firewall than NFSv3
NFSv4 is much more sensitive to network interruptions than NFSv3, and you may need to tune the ONTAP NFS server.
NFSv4 Bonus Tip:
If you're playing with NFSv4, don't forget the domain. This is also documented in the big NFS TR linked above, but I missed it the first time through, and I've seen customers miss this as well. It's confusing because if you forget it, there's a good chance that NFSv4 will MOSTLY work, but you'll have some strange behavior with permissions.
Here's how my ONTAP systems are configured in my lab:
EcoSystems-A200-A::> nfs server show -vserver jfsCloud3 -fields v4-id-domain vserver v4-id-domain --------- ------------ jfsCloud3 jfs.lab
and this is on my hosts:
[root@jfs0 ~]# more /etc/idmapd.conf | grep Domain Domain = jfs.lab
They match. If I'd forgotten to update the default /etc/idmap.conf, weird things would have happened.
... View more
Sorry about the delay responding, I was out of the office for a few weeks. The queuing behavior is an internal architectural detail of the version of ONTAP and the controller I was using. The behavior does vary from version to version. I could probably find more details from the engineering team, but it changes often. The reason is ONTAP is designed to service multiple workloads in a manageable, predictable way. They're always tuning and improving something. Sometimes when you run a single, synthetic, IO-intensive workload you run into unusual patterns, and this happens to be one of them. Spreading my writes across two volumes slightly improved parallelism, which meant the write latency was a little lower. I was using SLOB to drive the database, which meant many of the read IO's could not happen until the preceding write IO completed. That almost never happens with a regular database. Datafile reads and datafile writes are largely independent of one another. With SLOB, I have that dependency. The result is the tiny different in write latency gets magnified, and it makes the graphs look odd. The performance difference between 1 and 2 volumes isn't nearly as significant in the real world as that graph suggests. I almost didn't add that graph, but we do have a lot of customers who host just one Big Huge Database on a single storage system, and they truly want the maximum possible performance. Every microsecond counts. Almost nobody will notice a 10µs write latency differences, but if you do, it's worth spreading a database across volumes.
... View more
How many LUNs do I need?
(notes on Consistency Groups in ONTAP)
This post is part 2 of 2. In the prior post, I explained consistency groups and how they exist in ONTAP in multiple forms. I’ll now explain the connection between consistency groups and performance, then show you some basic performance envelopes of individual LUNs and volumes.
Those two things may not seem connected at first, but they are. It all comes down to LUNs.
LUNs are accessed using the SCSI protocol, which has been around for over 40 years and is showing its age. The tech industry has worked miracles improving LUN technology over the years, but there are a lot of limits related to host OSs, drivers, HBAs, storage systems, and drives that limit the performance of a single LUN.
The end result is this – sometimes you need more than one LUN to host your dataset in order to get optimum performance. If you want to take advantage of the advanced features of a modern storage array, you’ll need to manage those multiple LUNs together, as a unit. You’ll need a consistency group.
My previous post explained how ONTAP delivers consistency group management. This post explains how you figure out just how many LUNs you might need in that group, and how to ensure you have the simplest, most easily managed configuration.
Note: There are a lot of performance numbers shown below. They do NOT represent maximum controller performance. I did not do any tuning at all beyond implementing basic best practices. I created some basic storage configurations, configured a database, and ran some tests to illustrate my point. That's it.
ONTAP < 9.9.1
First, let’s go back a number of years to the ONTAP versions prior to the 9.9.1 release. There were some performance boundaries and requirements that contributed to the need for multiple LUNs in order to get optimum performance.
Test Procedure
To compare performance with differing LUN/volume layouts, I wrote a script that built one Oracle database each, using the following configurations:
1 LUN in a single volume
2 LUNs in a single volume
4 LUNs in a single volume
8 LUNs in a single volume
16 LUNs in a single volume
16 LUNs in a single volume
16 LUNs across 2 volumes
16 LUNs across 4 volumes
16 LUNs across 8 volumes
16 LUNs across 16 volumes
Warning: I normally ban anyone from using the term “IOPS” in my presence without providing a definition, because “IOPS” has a lot of different meanings. What’s the block size? Sequential or random ratio? Read/write mix? Measured from where? What’s my latency cutoff? All that matters.
In the graphs below, IOPS refers to random reads, using 8K blocks, as measured from the Oracle database. Most tests used 100% reads.
I used SLOB2 for driving the workload. The results shown below are not the theoretical storage maximums, they're the result of a complicated test using an actual Oracle database where a lot of IO has interdependencies on other IO. If you used a synthetic tool like fio, you’d see higher IOPS.
The question was - “How many LUNs do I need?” These tests used *one* volume. Multiple LUNs, but one volume. Let’s say you have a database. How many LUNs did you need in that LVM or Oracle ASM volume group to support your workload? What is the expected performance? Here’s the answer to that question when using a single AFF8080 controller prior to ONTAP 9.9.1.
There are three important takeaways from that test:
A single LUN hit the wall at about 35K IOPS.
A single volume hit the wall at about 115K IOPS
The sweet spot for LUN count in a single volume was about 8, but there was some benefit going all the way to 16.
To rephrase that:
If you had a single workload that didn’t need more than 35K IOPS, just drop it on a single LUN.
If you had a single workload that didn’t need more than 115K IOPS, just drop it on a single volume, but distribute it across 8 LUNs.
If you had more than 115K IOPS, you would have needed more than one volume.
That’s all <9.9.1 performance data, so let see what improved in 9.9.1 and how it erased a lot of those prior limitations and vastly simplified consistency group architectures.
ONTAP >= 9.9.1
Threading is important for modern storage arrays, because they are primarily used to support multiple workloads. On occasion, we’ll see a single database hungry enough to consume the full performance capabilities of an A900 system, but usually we see dozens of databases hosted per storage system.
We have to strike a balance between providing good performance to individual workloads while also supporting lots of independent workloads in a predictable, manageable way. Without naming names, there are some competitors out there whose products offer impressive performance with a single workload but suffer badly from poor and unpredictable performance with multiple workloads. One database starts stepping on another, different LUNs outcompete others for attention from the storage OS, and things get bad. One of the ways storage systems manage multiple workloads is through threading, where work is divided into queues that can be processed in parallel.
ONTAP 9.9.1 included many improvements to internal threading. In prior versions, SAN IO was essentially being serviced in per-volume queues. Normally, this was not a problem. Controllers would be handling multiple workloads running with a lot of parallelism, all the queues stayed busy all the time, and it was easy for customers to reach platform maximum performance.
Most of my work is in the database space, and we’d often have the One Big Huge Giant Database challenge. I’ve architected systems where a single database ate the maximum capabilities of 12, yes twelve controllers. If you only have one workload, it can be difficult trying to create a configuration that ensures all those threads are busy and processing IO all the time. You had to be careful to avoid having one REALLY busy queue, while others would be idle. The result is leaving potential performance on the table, and you would not get maximum controller performance.
Those concerns are 99% gone as of 9.9.1. There are still threaded operations, of course, but overall, the queues that led to those performance concerns don’t exist anymore. ONTAP services SAN IO more like a general pool of FC operations, spread across all the CPUs all the time.
To illustrate, let’s start with the same set of tests I showed for <9.9.1, with a single volume and varying numbers of LUN in the diskgroup:
I see four important takeaways here:
A single LUN yields about 4X more IOPS than before.
A single LUN not only delivers 4X more IOPS, but the latency is also about 40% lower.
A single volume (with 8 LUNs) yields about 2X more IOPS
A single volume (with 8 LUNs) delivers 2X more IOPS and with 40% lower latency.
That might seem simple, but there are a lot of implications to those four points. Here are some of the things you need to understand.
ONTAP is only part of the picture
The graph above does show that two LUNs are faster than one LUN, but it doesn’t say why. It’s not really ONTAP that is the limiting factor, it’s the SCSI protocol itself. Even if ONTAP was infinitely fast with FC LUNs that delivered 0µs of latency, it can’t service IO it hasn’t received.
You also have to think about the host-side limits. Hosts also have queues, like per-LUN queues, and per-path queues, and HBA queues. You still need some parallelism up at the host level to get maximum performance.
In the tests above, you can see incremental improvements in performance as we bring more LUNs into play. I’m sure some of the benefits are a result of ONTAP parallelizing work better, but that’s only a small part of it. Most of the benefits flow from having more LUNs driven by the OS itself.
The reason I wanted to explain this is because we have a lot of support cases about performance that aren’t exactly complaints, but are instead more like “Why isn’t my database faster than it is?” There’s always a bottleneck somewhere. If there wasn’t, all storage operations would complete in 0 microseconds, database queries would complete in 0 milliseconds, and servers would boot in 0 seconds.
We often discover that whatever the performance bottleneck might be, it ain’t ONTAP. The performance counters show the controller is nowhere near any limits, and in many cases, ONTAP is outright bored. The limit is usually up at the host. In my experience, the #1 cause of SAN performance complaints is an insufficient number of LUNs at the host OS layer. We therefore advise the customer to add more LUNs, so they can increase the parallelism though the host storage stack.
Yes, LUNS simply got faster
A lot of customers had single-LUN workloads that suddenly became a lot faster, because they updated to 9.9.1 or higher. Maybe it was a boot LUN that got faster and now patching is peppier. Maybe there was an application on a single LUN that included an embedded database, and now that application is suddenly a lot more responsive.
A volume of LUNs got faster too
Previously, I maxed out SAN IOPS in a single volume at about 110K IOPS. The limit roughly doubled to 240K IOPS in 9.9.1. That’s a big increase. IO-intensive workloads that previously required multiple volumes can be consolidated to a single volume. That means simpler management. You can create a single snapshot, clone a single volume, set a single QoS policy, or configure a single SnapMirror replication relationship.
Even if you don’t need the extra IOPS, you still get better performance
The latency dropped, too. Even a smaller database that only required 25K IOPS and was happily running on a single volume prior to 9.9.1 should see noticeably improved performance, because the response times of those individual 25K IOPS got better. Application response times get better, queries complete faster, and end users get happier.
How Many Volumes Do I Need?
I’d like to start by saying there is no best practice suggesting the use of one LUN per volume. I don’t know for sure where this idea originated, but I think it came from a very old performance benchmark whitepaper that included a 1:1 LUN:Volume ratio.
As mentioned above, it used to be important to distribute a workload across volumes in some cases, but it mostly only applied to single-workload configuration. If we were setting up a 10-node Oracle RAC cluster, and we wanted to push performance to the limit, and we wanted to get every possible IOP with the lowest possible latency, then we’d need perhaps 16 volumes per controller. There were often only a small number of LUNs on the system as a whole, so we may have used a 1:1 LUN:Volume ratio.
We didn’t HAVE to do that, and it’s in no way a best practice. We often just wanted to squeeze out a few extra percentage points of performance.
Also, don’t forget that there’s no value in unneeded performance. Configure what you need. If you only need 80K IOPS, do yourself a favor and configure a 2-LUN or perhaps 4-LUN diskgroup. It’s not hard to create more LUNs if you need them, but why do that? Why create unnecessary storage objects to manage? Why clutter up the output of commands like “lun show” with extra items that aren’t providing value? I often use the post office as an analogy – a 200MPH vehicle is faster than a 100MPH vehicle, but the neighborhood postal carrier won’t get any benefit from that extra performance.
If you have an unusual management need where one-LUN-per-volume makes more sense, that’s fine, but you have more things to manage, too. Look at the big picture and decide what’s best for you.
Want proof that multiple volumes don’t help? Check this out.
It’s the same line! In this example, I created a 16-LUN volume group and compared performance between configurations where those 16 LUNs were in a single volume, 2 volumes, 4, 8, and 16. There’s literally no difference, nor should there be. As mentioned above, ONTAP SAN processing as of 9.9.1 does not care if the underlying LUNs were located in different volumes. The FC IO was processed as a common pool of FC IO operations.
Things get a little different when you introduce writes, because there still is some queuing behavior related to writes that may be important to you.
Write IO processing
If you have heavy write IO, you might want more than one volume. The graphs below illustrate the basic concepts, but these are synthetic tests. In the real world, especially with databases, you get different patterns of IO interdependencies.
For example, picture a banking database used to support online banking activity by customers. That will be mostly concurrent activity where a little extra latency doesn’t matter. If you need to withdraw money at the ATM, would you care if it took 2.001 seconds rather than the usual 2 seconds?
In contrast, if you have a banking database used for end-of-day processing, you have dependencies. Read #352 might only occur after read #351 has completed. A small increase in latency can have a ripple effect on the overall workload.
The graphs below show what happens when one IO depends on a prior IO and latency increases. It’s also a borderline worst-case scenario.
First, let’s look at a repeat of my first 9.9.1 test, but this time I’m doing 70% reads and 30% writes. What happens?
The maximum measured IOPS dropped. Why? The reason is that writes are more expensive to complete than reads for a storage array. Obviously, platform maximums will be reduced as write IO becomes a larger and larger percentage of the IO, but this is just one volume. I’m nowhere near controller maximums. Performance remains awesome. I’m at about 150us latency for most of the curve, and even at 100K IOPS, I’m only at 300us of latency. That’s great, but it is slower than the 100% read IOPS test.
What you’re seeing is the result of read IOPS getting held back by the write IOPS. There were more IOPS available to my database from this volume, but they weren’t consumed, because my database was waiting on write IO to complete. The result is that the total IOPS dropped quite a bit.
Multi-Volume write IOPS
Here’s what happens when I spread these LUNs across two volumes.
Looks weird, doesn’t it? Why would 2 volumes be 2X as fast as a single volume, and why would 2, 4, 8, and 16 volumes perform about the same?
The reason is that ONTAP is establishing queues for writes. If I want to maximize write IOPS, I’m going to need more queues, which will require more volumes. The exact behavior can change between configurations and platforms, so there’s no true best practice here. I’m just calling out the potential need to spread your database across more than one volume.
Key takeaways:
If I have 16 LUNs, there is literally no benefit to splitting them amongst multiple volumes with a 100% read workload. Look at that earlier graph. The datasets all graphed as a single line.
Two volumes with a 70% read workload showed a big improvement going from 1 volume to 2, but then nothing further. That’s because, in my configuration, there are two queues for write processing within ONTAP. Two volumes are no different than 3 or 4 or 5 in terms of keeping those queues busy.
I also want to repeat – the graphs are the worst-case scenario. A real database workload shouldn’t be affected nearly as much, because reads and writes should be largely decoupled from one another. In my test, there are about two reads for each write with limited parallelization, and those reads do not happen until the write completes. That does happen with real-world database workloads, but very rarely. For the most part, real database read operations do not have to wait for writes to complete.
Summary
To recap:
If you’re using ONTAP <9.9.1 with FC SAN, upgrade. We’ve observed LUNs deliver 4X more IOPS at 40% lower latency.
Once you get to ONTAP 9.9.1 (or higher):
A single LUN is good for around 100K IOPS on higher-end controllers. That’s not an ONTAP limit, it’s an “all things considered” limit that is a result of ONTAP limits, host limits, network limits, typical IO sizes, etc. I’ve seen much, much better results in certain configurations, especially ESX. I’m only suggesting 100K as a rule-of-thumb.
For a single workload, a 4-LUN volume group on a single volume can hit 200K with no real tuning effort. More LUNs in that volume are desirable in some cases (especially with AIX due to its known host FC behavior), but it’s probably not worth the effort for typical SAN workloads.
If you know you’ve got a very, very write-heavy workload, you might want to split your workload into two volumes. If you’re that concerned about IOPS, you probably did that anyway, simply because you probably chose to distribute your LUNs across controllers. That’s a common practice – split each workload evenly across all controllers to achieve maximum performance, as well as guaranteed even loading across the entire cluster.
Lastly, don’t lose perspective.
It’s nice to have an AFF system with huge IOPS capabilities for the sake of consolidating lots of workloads, but I find admins obsess too much about individual workloads and targeting hypothetical performance levels that offer no real benefits.
I look at a lot of performance stats, and virtually every application and database workload I see plainly shows no storage performance bottleneck whatsoever. The performance limits are almost universally the SQL code, the application code, available raw network bandwidth, or Oracle RAC cluster contention. Storage is usually less than 5% of the problem. The spinning-disk days of spending your way out of performance problems are over.
Storage performance sizing should be about determining actual performance requirements, and then architecting the simplest, most manageable solution possible. The SAN improvements introduced in ONTAP 9.9.1 noticeably improve manageability as well as performance.
... View more
Consistency Groups in ONTAP
There’s a good reason you should care about CGs – it’s about manageability.
If you have an important application like a database, it probably involves multiple LUNs or multiple filesystems. How do you want to manage this data? Do you want to manage 20 LUNs on an individual basis, or would you prefer just to manage the dataset as a single unit?
This post is part 1 of a 2. First, I will explain what we mean when we talk about consistency groups (CGs) within ONTAP.
Part II covers the performance aspect of consistency groups, including real numbers on how your volume and LUN layout affects (and does not affect) performance. It will also answer the universal database storage question, “How many LUNs do I need?” Part II will be of particular interest to long-time NetApp users who might still be adhering to out-of-date best practices surrounding performance.
Volumes vs LUNs
If you’re relatively new to NetApp, there’s a key concept worth emphasizing – volumes are not LUNs.
Other vendors use those two terms synonymously. We don’t. A Flexible Volume, also known as a FlexVol, or usually just a “volume,” is just a management container. It’s not a LUN. You put data, including NFS/SMB files, LUNs, and even S3 objects, inside of a volume. Yes, it does have attributes such as size, but that’s really just accounting. For example, if you create a 1TB volume, you’ve set an upper limit on whatever data you choose to put inside that volume, but you haven’t actually allocated space on the drives.
This sometimes leads to confusion. When we talk about creating 5 volumes, we don’t mean 5 LUNs. Sometimes customers think that they create one volume and then one LUN within that volume. You can certainly do that if you want, but there’s no requirement for a 1:1 mapping of volume to LUN. The result of this confusion is that we sometimes see administrators and architects designing unnecessarily complicated storage layouts. A volume is not a LUN.
Okay then, what is a volume?
If you go back about eighteen years, an ONTAP volume mapped to specific drives in a storage controller, but that’s ancient history now.
Today, volumes are there mostly for your administrative convenience. For example, if you have a database with a set of 10 LUNs, and you want to limit the performance for the database using a specific quality of service (QoS) policy, you can place those 10 LUNs in a single volume and slap that QoS policy on the volume. No need to do math to figure out per-LUN QoS limits. No need to apply QoS policies to each LUN individually. You could choose to do that, but if you want the database to have a 100K IOPS QoS limit, why not just apply the QoS limit to the volume itself? Then you can create whatever number of LUNs that are required for the workload.
Volume-level management
Volumes are also related to fundamental ONTAP operations, such as snapshots, cloning, and replication. You don’t selectively decide which LUN to snapshot or replicate, you just place those LUNs into a single volume and create a snapshot of the volume, or you set a replication policy for the volume. You’re managing volumes, irrespective of what data is in those volumes.
It also simplifies how you expand the storage footprint of an application. For example, if you add LUNs to that application in the future, just create the new LUNs within the same volume. They will automatically be included in the next replication update, the snapshot schedule will apply to all the LUNs, including the new ones, and the volume-level QoS policy will now apply to IO on all the LUNs, including the new ones.
You can selectively clone individual LUNs if you like, but most cloning workflows operate on datasets, not individual LUNs. If you have an LVM with 20 LUNs, wouldn’t you rather just clone them as a single unit than perform 20 individual cloning operations? Why not put the 20 LUNs in a single volume and then clone the whole volume in a single step?
Conceptually, this makes ONTAP more complicated, because you need to understand that volume abstraction layer, but if you look at real-world needs, volumes make life easier. ONTAP customers don’t buy arrays for just a single LUN, they use them for multiple workloads with LUN counts going into the 10’s of thousands.
There’s also another important term for a “volume” that you don’t often hear from NetApp. The term is “consistency group,” and you need to understand it if you want maximum manageability of your data.
What’s a Consistency Group?
In the storage world, a consistency group (CG) refers to the management of multiple storage objects as a single unit. For example, if you have a database, you might provision 8 LUNs, configure it as a single logical volume, and create the database. (The term CG is most often used when discussing SAN architectures, but it can apply to files as well.)
What if you want to use array-level replication to protect that database? You can’t just set up 8 individual LUN replication relationships. That won’t work, because the replicated data won’t be internally consistent across volumes. You need to ensure that all 8 replicas of the source LUN are consistent with one another, or the database will be corrupt.
This is only one aspect of CG data management. CGs are implemented in ONTAP in multiple ways. This shouldn’t be surprising – an ONTAP system can do a lot of different things. The need to manage datasets in a consistent manner requires different approaches depending on the chosen NetApp storage system architecture and which ONTAP feature we’re talking about.
Consistency Groups – ONTAP Volumes
The most basic consistency group is a volume. A volume hosting multiple LUNs is intrinsically a consistency group. I can’t tell you how many times I’ve had to explain this important concept to customers as well as NetApp colleagues simply because we’ve historically never used the term “consistency group.”
Here’s why a volume is a consistency group:
If you have a dataset and you put the dataset components (LUNs or files) into a single ONTAP volume, you can then create snapshots and clones, perform restorations, and replicate the data in that volume as a single consistent unit. A volume is a consistency group. I wish we could update every reference to volumes across all the ONTAP documentation in order to explain this concept, because if you understand it, it dramatically simplifies storage management.
Now, there are times where you can’t put the entire dataset in a single volume. For example, most databases use at least two volumes, one for datafiles and one for logs. You need to be able to restore the datafiles to an earlier point in time without affecting the logs. You might need some of that log data to roll the database forward to the desired point in time. Furthermore, the retention times for datafile backups might differ from log backups.
We have a solution for that, too, but first let’s talk about MetroCluster.
Consistency Groups & MetroCluster
While regular ol’ ONTAP volumes are indeed consistency groups, they’re not the only implementation of CGs in ONTAP. The need for data consistency appears in many forms. SyncMirrored aggregates are another type of CG that applies to MetroCluster.
MetroCluster is a screaming fast architecture, providing RPO=0 synchronous mirroring, mostly used for large-scale replication projects. If you have a single dataset that needs to be replicated to another site, MetroCluster probably isn’t the right choice. There would probably be simpler options.
If, however, you’re building an RPO=0 data center infrastructure, MetroCluster is awesome, because you’re essentially doing RPO=0 at the storage system layer. Since we’re replicating everything, we can do replication at the lowest level – right down at the RAID layer. The storage system doesn’t know or care about where changes are coming from, it just replicates each little write-to-drives to two different locations. It’s very streamlined, which means it’s faster and makes failovers easier to execute and manage, because you’re failing over “the storage system” in its entirety, not individual LUNs.
Here's a question, though. What if I have 20 interdependent applications and databases and datasets? If a backhoe cuts the connection between sites, is all that data at the remote site still consistent and usable? I don’t want one database to be ahead in time from another. I need all the data to be consistent.
As mentioned before, the individual volumes are all CGs unto themselves, but there’s another layer of CG, too – the SyncMirror aggregate itself. All the data on a single replicated MetroCluster aggregate makes up a CG. The constituent volumes are consistent with one another. That’s a key requirement to ensure that some of the disaster edge cases, such as rolling disasters, still yield a surviving site that has usable, consistent data and can be used for rapid data center failover. In other words, a MetroCluster aggregate is a consistency group, with respect to all the data on that aggregate, which guarantees data consistency in the event of sudden site loss.
Consistency Groups & API’s
Let’s go back to the idea of a volume as a consistency group. It works well for many situations, but what if you need to place your data in more than one volume? For example, what if you have four ONTAP controllers and want to load up all of them evenly with IO? You’ll have four volumes. You need consistent management of all four volumes.
We can handle that, too. We have yet another consistency group capability that we implement at the API level. We did this about 20 years ago, originally for Oracle ASM diskgroups. Those were the days of spinning drives, and we had some customers with huge Oracle databases that were both capacity-hungry and IOPS-hungry to the point they required multiple storage systems.
How do you get a snapshot of a set of 1000 LUNs spread across 12 different storage systems? The answer is “quite easily,” and this was literally my second project as a NetApp employee. You use our consistency group API’s. Specifically, you’d make an API call for “cg-start” targeting all volumes across various systems, then call “cg-commit” on all those storage systems. If all those cg-commit API calls report a success, you know you have a consistent set of snapshots that could be used for cloning, replication, or restoration.
You can do this with a few lines of scripting, and we have multiple management products, including SnapCenter, that make use of those APIs to perform data consistent operations.
These APIs are also part of the reason everyone, including NetApp personnel, often forget that an ONTAP volume is a consistency group. We had those APIs that had the letters “CG” in them, and everyone subconsciously started to think that this must be the ONLY way to work with consistency groups within ONTAP. That’s incorrect; the cg-start/cg-commit API calls are merely one way ONTAP delivers consistency group-based management.
Consistency Groups & SM-BC
SnapMirror Business Continuity (SM-BC) is similar to MetroCluster but provides more granularity. MetroCluster is probably the best solution if you need to replicate all or nearly all the data on your storage system, but sometimes you only want to replicate a small subset of total data.
SM-BC almost didn’t need to support any sort of “consistency group” feature. We could have scoped that feature to just single volumes. Each individual volume could have been replicated and able to be failed over as a single entity.
However, what if you needed a business continuity plan for three databases, one application server, and all four boot LUNs? Sure, you might be able to put all that data into a single volume, but it’s likely that your overall data protection, performance, monitoring, and management needs would require the use of more than one volume.
Here’s how that affects data consistency with SM-BC. Say you’ve provisioned four volumes. The key is that a business continuity plan requires all 4 of those volumes entering and exiting a consistent replication state as a single unit.
We don’t want to have a situation where the storage system is recovering from an interruption in site-to-site connectivity with one volume in an RPO=0 state, while the other three volumes are still synchronizing. A failure at that moment would leave you with mismatched volumes at the destination site. One of them would be later in time than others. That’s why we base your SM-BC relationships on CGs. ONTAP ensures those included volumes enter and exit an RPO=0 state as a single unit.
Native ONTAP Consistency Groups
Finally, ONTAP also allows you to configure advanced consistency groups within ONTAP itself. The results are similar to what you’d get with the API calls I mentioned above, except now you don’t have to install extra software like SnapCenter or write a script.
Here’s an example of how you might use ONTAP Consistency Groups:
In this example, I have an Oracle database with datafiles distributed across 4 volumes located on 4 different controllers. I often do that to ensure my IO load is guaranteed to be evenly distributed across all controllers in the entire cluster. I also have my logs in 3 different volumes, plus I have a volume for my Oracle binaries.
The point of the ONTAP Consistency Group feature is to enable users to manage applications and application components, and not worry about LUNs and individual volumes. Once I add this CG (which is composed of two child CGs), I can do things like schedule snapshots for the application itself. The result is a CG snapshot of the entire application. I can now use those snapshots for cloning, restoration or replication.
I can also work at a more granular level. For example, I could do a traditional Oracle hot backup procedure as follows:
“alter database begin backup;”
POST /application/consistency-groups/(Datafiles)/snapshots
“alter database end backup;”
“alter database archive log current;”
POST /application/consistency-groups/(Logs)/snapshots
The result of that is a set of volume snapshots, one of the datafiles and one of the logs, which are recoverable using a standard Oracle backup procedure.
Specifically, the datafiles were in backup mode when a snapshot of the first CG was taken. That’s the starting point for a restoration. I then removed the database from backup mode and forced a log switch before making the API call to create a snapshot of the log CG. The snapshot of the log CG now contains the required logs for making that datafile snapshot consistent.
(Note: You don’t really have to place an Oracle database in backup mode since 12cR1, but most DBA’s are more comfortable with that additional step)
Those two sets of snapshots constitute a restorable, clonable, usable backup. I’m not operating on LUNs or filesystems; I’m making API calls against CGs. It’s application-centric management. There’s no need to change my automation strategy as the application evolves over time and I add new volumes or LUNs, because I’m just operating on named CGs. It even works the same with SAN and file-based storage.
We’ve got all sorts of idea of how to keep expanding this vision of application-centric storage management, so keep checking in with us.
... View more
You'll want a support case for this. The answer is probably buried in the SCO job logs, and you'll need some to parse though them for you.
... View more
You're correct. They've normalized a bandwidth limit. If you read carefully, you'll see statements like a volume offers "100 IOPS per GB (8K IO size)" which really means 800KB/GB and then they divided by 8K to get 100 IOPS. They could have described it as 200 IOPS per GB (4K IO size) if they wanted to. It's the same thing. Normally this is all pretty unimportant, but with databases you have a mix of IO types. The random IO's are a lot more work to process, so it's nice to be able to limit actual IOPS. In addition, a small number of huge-block sequential IO's can consume a lot of bandwidth, so it's nice to be able to limit bandwidth independently. There's some more material on this at this link https://tv.netapp.com/detail/video/6211770613001/a-guide-to-databasing-in-the-aws-cloud?autoStart=true&page=4&q=oracle starting at about the 2:45 mark.
... View more
IOPS should refer to individual IO operations. A typical ERP database might reach 10K IOPS during random IO operations. That means 10,000 discrete, individual 8K block operations per second. If you do the math, that's also about 80MB/sec. When the database does a full table scan or an RMAN backup, the IO's are normally in 1MB chunks. That may get broken down by the OS, but it's trying to do 1MB IO's. That means only 80 IOPS will consume 80MB/sec of bandwidth. The end result is a true IOPS-based QoS control will throttle IO at a much lower total bandwidth than sequential IO. That's normally okay, because storage arrays have an easier time with large-block IO. It's okay to allow a host to consume lots of bandwidth. It's easy to make a sizing error with databases when QoS is involved. It happens on-premises too. A storage admin won't realize that during the day they need 10K IOPS at an 8K IO size (80MB/sec) but at night they need 800IOPS at a 1MB size (800MB/sec). If you have a pure bandwidth QoS limit, like 100MB/sec, you'll be safely under the limit most of the time when the database is doing normal random IO tasks, but sometimes those late-night reports and RMAN backups slam into that 100MB/sec limit. That's why you really, really need snapshots with a database in the public cloud. It's the only way to avoid bulk data movement. If you have to size for RMAN bulk data transfers, you end up with storage that is 10X more powerful and expensive than you need outside of the backup window. One neat thing one of our customers did to fix the Oracle full table scan problem in the public cloud was Oracle In-Memory. They spent more for the In-Memory licenses, and they spent more for the RAM in the VM, but the result was dramatically less load on the storage system. That saved money, but more importantly, they were able to meet their performance targets in the public cloud. It's a perfectly obvious use case for In-Memory, it was just nice to see proof that it worked as predicted.
... View more
FlashCache actually might still help with the backup situation. The basic problem here is the QoS limit that the cloud providers placed on their storage. They call it IOPS, but it's not, it's bandwidth. We've seen a lot of customers run into this exact problem. Let's say you had an ancient HDD array. You could do backups AND the usual database random IO pretty easily because those large-block backup IO's are easy for a storage array to process. They're nice big IO's, and the array can do readahead. When we used to size for HDD database storage, we'd always focus on the random IOPS because that controlled the number of drives that goes into the solution. The sequential IO, like backups and many reporting operations, was almost free. You pay for random IO, and we throw in the sequential IO for at no cost. If you looked at the numbers, a typical HDD array might be able to provide 250MB/sec of random IOPS, but could easily do 2.5GB/sec of large-block sequential IOPS. Public cloud storage doesn't give you that "free" sequential IO. They have strict bandwidth controls, and the result is that customers are often surprised that everything about their database works just fine with the exception of the backup or maybe that one late-night report that included a full table scan. The day-to-day random IOPS fit within the capabilities of the backend storage, but the sequential IO work slams into the limits relatively easily. FlashCache ought to ease the pressure there because the IOPS serviced by the FlashCache layer won't touch the backend disks. I'd recommend limiting the number of RMAN channels too, because some IO will still need to reach those backend disks.
... View more
I've seen synthetic IO tests with FlashCache on CVO, and the results are amazing, as they should be. FlashCache was a miracle for database workloads when if first came out (I was in NetApp PS at the time) because it brought down the average latency of HDD's. It works the same with CVO. The backend drives for AWS and Azure native storage are flash, but it's still a shared resource and it's not nearly as fast as an actual on-premises all-flash array. FlashCache on that block of NVMe storage on the CVO instance works the same - it brings down the average latency. I don't think there's a way to monitor the burst credits, but they publish the math of how it's calculated. You'll exhaust the burst credits pretty quickly with backups, so it's probably not going to help with that scenario. I did some tests a few years back where the burst credits were really confusing my results until I figured out what was happening. With respect to snapshots, check out TR-4591 which includes some material on how to use plain snapshots right on ONTAP itself. SnapCenter is often best option, but not always. What you want is a snapshot. There are multiple ways to get it.
... View more