As far as I know, the same API's should work on all platforms within reason. Obviously you can't make a call for MetroCluster switchovers if you're not using a MetroCluster in the first place, and an NFS-related API call shouldn't work on an ASA because there are no file protocols on an ASA. Other than that, ONTAP should be ONTAP. What API call did you encounter that failed? If you tried to disable TSSE, that might have been the problem because TSSE cannot be turned off in the C-Series systems. That's documented in the overall C-Series material, but that restriction should be reiterated in the API documentation. I'll file a doc update request on that.
... View more
Active-active data center (with Oracle!)
I've worked with Oracle customers on DR solutions for 15+ years. The perfect solution would, of course, be RPO=0 and RTO=0*, but not all applications can tolerate the write latency involved in an RPO=0 synchronous solution. Sometimes you have to settle for an RPO of 15 minutes or a slightly longer RTO.
Sometimes, however, RPO=0 and RTO=0 are required because the data is really that critical.
We've been able to do this with SnapMirror active sync (formerly known as SnapMirror Business Continuity) for a while, but now we can do it in symmetric active-active mode. You can now have two clusters in two completely different sites, each serving data, with identical performance characteristics, and you don't even need to extend the SAN across sites.
This is the foundation of what customers call "active-active data center". There is no primary site and DR site. There are just two sites. Half your database is running on site A, and the other half is running on site B. Each local storage system will service all read IO from its local copy of the data. Write IO will, of course, be replicated to the opposite site before being acknowledged, because that's how synchronous mirroring works. Symmetric storage IO means symmetric database responses and symmetric application behavior.
SnapMirror active sync in active-active mode is in tech preview now with select customers. Oracle RAC is not yet a supported configuration, but there's no technical reason it shouldn't work, and I wanted to be ready for this feature to become generally available. I've been cutting power and network links for the past couple weeks, and I haven't managed to crash my database yet.
*Note: There's really no such thing as RTO=0 because it takes a certain amount of time to know whether recovery procedures are even warranted. You don't want to have a total disaster failover triggered just because a single IO operation didn't complete on one second. I consider SnapMirror active sync to be an RTO=0 solution because the environment is already running at the opposite site. The lag time in resuming operations isn't because of failover itself, it's sometimes it takes at least 15-30 seconds, even under automation, to be sure that failover is required.
I'm developing reference architectures with and without a 3rd site Oracle RAC tiebreaker and plan to release some accompanying videos, but here's an overview of how it works. Take a look at the diagram and then continue reading to understand the value.
Architecture
This is a typical Oracle RAC configuration with a database called NTAP, with two instances, NTAP1 and NTAP2. The diagram might look complicated at first, but here's the key to understanding it:
SnapMirror active sync is invisible
From an Oracle and host point of view, this is just one set of LUNs on a single cluster. The replication is invisible. It's the same set of LUNs at both sites. I haven't even stretched the SAN across sites, although I could have done that if I wanted to. I'd rather not create a cross-site ISL if I don't have to.
When I installed RAC, I had a couple of hosts that each had a set of 3 LUNs to be used for quorum management. These hosts, jfs12 and jfs13, each see the same LUNs with the same serial numbers and the same data.
When I created the database, I created an 8-LUN ASM diskgroup for the datafiles and an 8-LUN ASM diskgroup for logs. It doesn't matter which host I use to make the database. They're both using the same LUNs.
Think of it as one single system with paths that happen to exist on two different sites. Any path either cluster leads to the same LUN.
SnapMirror active sync is symmetric
Database connections can now be made to either instance. If that instance needs to perform a read, the data will be retrieved from the local drives. Writes will be replicated to the opposite site before being acknowledged, of course, so site-to-site latency needs to be as low as possible.
It doesn't matter which site you're using. Database performance is the same, unless you intentionally used different controller models with differing performance limits. This is a valid choice. Maybe you want RPO=0/RTO=0 but one of your sites is designed to be just a temporary site, and doesn't require the same storage horsepower as the other site.
SnapMirror active sync is resilient
This is the part I'm still working on documenting. There's a mediator service that acts as a heartbeat to detect controller failures. The mediator isn't an active tiebreaker service, but it's the same idea. It works as an alternate communication channel for each cluster to check the health of the opposite cluster. For example, if the cluster on site B suddenly fails, the cluster on site A will lose the ability to contact cluster B either directly or via the mediator. That allows cluster A to release the mirroring and resume operations.
Overall, "it just works". For example, my initial tests involved simply cutting the power at one site. Here's what happened:
One set of paths ceased responding, while the other set of paths remained available
All write IO paused because it was no longer possible to replicate the writes
After about 30 seconds, the surviving site considered the site with the power failure truly dead and broke the mirroring so the surviving site could resume operations
The Oracle instance on the failed site continued to try to contact storage for a full 200 seconds. This is the default timeout setting with RAC. You can change it if required.
After the 200 second expiration, the Oracle instance performed a self-reboot. It does this to help protect data from corruption due to a lingering IO operation stuck in a retry loop on the host.
This also means that stalled transactions on the failed node were held for 200 seconds before being replayed on the surviving node. This is a good example of how the RTO of a storage system is not the only factor affecting failover times.
The recovery process was unexpectedly seamless:
Power was restored to the failed storage system
It took about 8 minutes to fully power up, self-test, boot, and resume clustered operations
The surviving site detected the return of the other site.
The mirror was asynchronously resynchronized to get the states of site A and site B really close together. This took about 5 minutes.
The mirror then transitioned to synchronized state
The Oracle server detected the presence of SAN paths
The Oracle RAC process, which had been delaying the boot process, found usable RAC quorum devices
The database instance came up again
That was an unexpected surprise. I expected more recovery work would be required, but it was just "turn the power back on" and everything went back to normal.
I've got more to do, including getting timings of various operations, collecting logs, tuning RAC, and especially writing up Oracle RAC quorum behavior. It's not complicated, but it's not well documented by Oracle.
Look for a lot more when the next version of ONTAP ships.
... View more
The serial_number property returned by the GET /storage/luns/{uuid} can be converted to hex. That ASCII and the hex are the same thing. For example: [root@jfs0 current]# echo -n '80A2z+UglTqg' | od -A n -t x1 38 30 41 32 7a 2b 55 67 6c 54 71 67
... View more
This is a post about how I expectedly needed to use an ONTAP feature in order to test a completely different ONTAP feature. If you haven't heard of it, SVM Migrate is a high availability feature that allows you to migrate a running storage environment from one cluster to a completely different cluster, nondisruptively.
ONTAP environment
The feature I wanted to test was SnapMirror active sync (SM-AS) running in symmetric active-active mode. We enhanced SM-AS last year to offer symmetric active-active replication. Here’s a basic diagram of what I was working with:
It’s a couple of A700 clusters with SM-AS enabled. I set up my Oracle RAC configuration, including databases and quorum drives, on the jfs_as1 and jfs_as2 SVMs. Oracle RAC is not yet supported with SM-AS in active/active mode, but I couldn’t think of a reason it shouldn’t work, and I wanted to give this a spin. The idea here is creating a single cross-site, ultra-available Oracle RAC cluster. I'll post on this later.
What’s an SVM again?
When you first set up an ONTAP system, it’s a little like VMware ESX. You’ll have an operational cluster, but it doesn’t do anything yet. You need to define an Storage Virtual Machine (SVM). It’s basically a self-contained storage personality. As with VMware, it’s about multitenancy and security and manageability. You might only have the one SVM on your cluster, but if you want to have different SVMs serving different types of data or managed by different teams, you can do that too. For example, maybe you have a production SVM that is treated extra-carefully, but then you have a development SVM where you give your developers more control over their storage environment.
SnapMirror active sync
This isn’t the point of this post, but SnapMirror active sync (SM-AS) is a zero-RPO replication solution. When operated in active-active mode, what you have is the same data and the same LUNs available on two different systems. All reads are serviced locally. Writes obviously must be replicated to the partner cluster to maintain consistency. The result is symmetric active-active access to the same dataset.
I know how it works internally, so I was sure that simply configuring replication would result in a perfectly usable solution. The question I had was about failover. When you configure SM-AS, you also have a mediator service that manages tiebreaking and failover.
The Problem
My first test to validate is what happens when Cluster2 fails. What SHOULD happen is the replication should fail and the mediator should signal to Cluster1 that is can resume operations unmirrored. After all, the point here is ultra high availability.
Here’s the issue – my Oracle RAC hosts are all running under VMware using VMDK files hosted on the SVM called jfs_esx. If I cut the power on Cluster2, I’m going to take out my hosts as well. I really, really didn’t want to take the time to configure a new ONTAP system and vMotion my VMDK files over.
SVM Migrate to the rescue!
I decided to give SVM Migrate a try. It’s been around since ONTAP 9.10, but I never used it before. The purpose of SVM Migrate is to replicate that entire SVM personality. There are some restrictions, but in my case I just had a 1TB NFS share hosting all my VMDKs.
Since I was working in a lab environment that I own, I figured I’d just give this a try. It was a good test of simplicity. I didn't shut anything down. All my VMs are operational and the RAC clusters are running. Will it all survive the migration? Let's find out! I don't need no documentation.
Caution: Please read the documentation. I didn’t read the documentation, but I’ve been working with ONTAP since ’95 and half my job is trying to break things.
Starting the migration
I knew the command was probably vserver something (an SVM is known as a vserver at the CLI) so I just started typing and using the tab key to see what arguments were required. It looked like I could just do this:
rtp-a700s-c01::> vserver migrate start -vserver jfs_esx -source-cluster rtp-a700s-c02 Info: To check the status of the migrate operation use the "vserver migrate show" command.
I was then pretty sure I was moving my jfs_esx SVM from cluster2 to cluster1. Then again, maybe I didn't provide a required argument or maybe there was some aspect of configuration that blocked the migration. Let's find out what happened...
Monitoring
The prior command told me to run vserver migrate show to monitor, so that's what I did. I ran it a couple times.
rtp-a700s-c01::> vserver migrate show Destination Source Vserver Cluster Cluster Status ---------------- ------------------- ------------------- --------------------- jfs_esx rtp-a700s-c01 rtp-a700s-c02 setup-configuration rtp-a700s-c01::> vserver migrate show Destination Source Vserver Cluster Cluster Status ---------------- ------------------- ------------------- --------------------- jfs_esx rtp-a700s-c01 rtp-a700s-c02 transferring
Looks like it's working. It appears to have configured the destination and commenced data transfer.
SnapMirror
The most important part of the SVM Migrate operation is moving the data itself, which happens via SnapMirror. That's what the word transferring means above. The SVM Migrate operation is transferring my data. How much data to I need to move?
rtp-a700s-c02::> vol show -vserver jfs_esx jfs_esx -fields used vserver volume used ------- ------- ------- jfs_esx jfs_esx 536.8GB
Looks like I'll need to transfer around a half terabyte of total data. I just have the one volume in this SVM. It's a 1TB volume, but after efficiency savings it's 536GB of data.
I was monitoring the status by repeatedly running snapmirror show when I saw something odd
rtp-a700s-c01::> snapmirror show -destination-path jfs_esx:jfs_esx -fields snapshot-progress source-path destination-path snapshot-progress --------------- ---------------- ----------------- jfs_esx:jfs_esx jfs_esx:jfs_esx 175.8GB rtp-a700s-c01::> snapmirror show -destination-path jfs_esx:jfs_esx -fields snapshot-progress source-path destination-path snapshot-progress --------------- ---------------- ----------------- jfs_esx:jfs_esx jfs_esx:jfs_esx 23.09GB
What happened? Why did I go from 175GB transferred to just 23GB? The reason is I'm looking at a different SnapMirror operation, and the reason that happened was snapshots.
Snapshot transfers
I guessed that SVM Migrate had initialized the mirror, and then was transferring the individual snapshots from the source. I checked the snapshots at the destination to confirm:
rtp-a700s-c01::> snapshot show -vserver jfs_esx
---Blocks---
Vserver Volume Snapshot Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
jfs_esx jfs_esx
snapmirror.ac509ea6-fa33-11ed-ae6e-00a098f7d731_2152057340.2023-05-25_180109
90.08GB 9% 20%
smas_testing_baseline 6.53GB 1% 2%
snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520100.2024-02-27_172427
133.6MB 0% 0%
snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184257
668KB 0% 0%
snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184300
5.35GB 1% 1%
nightly.2024-02-28_0105 14.40GB 1% 4%
6 entries were displayed.
rtp-a700s-c01::> snapshot show -vserver jfs_esx
---Blocks---
Vserver Volume Snapshot Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
jfs_esx jfs_esx
snapmirror.ac509ea6-fa33-11ed-ae6e-00a098f7d731_2152057340.2023-05-25_180109
90.08GB 9% 20%
smas_testing_baseline 6.53GB 1% 2%
snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520100.2024-02-27_172427
133.6MB 0% 0%
snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184257
668KB 0% 0%
snapmirror.668dbd39-d590-11ee-a161-00a098f7d731_2152057427.2024-02-27_184300
5.35GB 1% 1%
nightly.2024-02-28_0105 19.26GB 2% 5%
nightly.2024-02-29_0105 33.84MB 0% 0%
7 entries were displayed.
You can see I went from 6 snapshots to 7 snapshots in just a few moments. I asked engineering, "Hey, does SVM Migrate initialize a baseline transfer of my data, and then start transferring the deltas to copy the snapshots too?" and they said, "Yup".
There were 15 snapshots on this volume, so I'm halfway done moving them. My transfer been running for about 10 minutes at this point.
Monitoring, again
I went back to monitoring the status, but this time I used the show-volume rather than show argument.
rtp-a700s-c01::> vserver migrate show-volume
Volume Transfer
Vserver Volume State Healthy Status Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
jfs_esx online true Transferring
-
jfs_esx_root online true ReadyForCutoverPreCommit
Looks like one of my volumes is fully transferred, but there's a lot of data in that jfs_esx volume, so that's still running.
After another 5 minutes or so, I got to this:
rtp-a700s-c01::> vserver migrate show-volume
Volume Transfer
Vserver Volume State Healthy Status Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
jfs_esx online true ReadyForCutoverPreCommit
-
jfs_esx_root online true ReadyForCutoverPreCommit
Cool. All data is transferred. Ready for the cutover process. If I didn't want this to happen automatically, I could have deferred the cutover. There several other options available with the vserver migrate command that I didn't know about initially because, as mentioned before, I didn't actually read the documentation.
SnapMirror Synchronous
Once all the basic data is transferred, it's time for SVM Migrate to perform the cutover. Since this is an RPO=0 migration, the underlying data must be brought into an RPO=0 synchronous replication configuration. SVM Migrate orchestrates that process, and I saw that transition occur:
rtp-a700s-c01::> vserver migrate show-volume
Volume Transfer
Vserver Volume State Healthy Status Errors
-------- ---------------- -------- --------- --------- ----------------------
jfs_esx
jfs_esx online true InSync -
jfs_esx_root online true InSync -
2 entries were displayed.
Finalization
I then went back to watching the migrate-show output and saw these responses:
rtp-a700s-c01::> vserver migrate show
Destination Source
Vserver Cluster Cluster Status
---------------- ------------------- ------------------- ---------------------
jfs_esx rtp-a700s-c01 rtp-a700s-c02 post-cutover
rtp-a700s-c01::> vserver migrate show
Destination Source
Vserver Cluster Cluster Status
---------------- ------------------- ------------------- ---------------------
jfs_esx rtp-a700s-c01 rtp-a700s-c02 cleanup
rtp-a700s-c01::> vserver migrate show
Destination Source
Vserver Cluster Cluster Status
---------------- ------------------- ------------------- ---------------------
jfs_esx rtp-a700s-c01 rtp-a700s-c02 migrate-complete
Thoughts
I'm impressed. I was in some early conversations about the SVM Migrate feature, but I hadn't thought about it since then.
I successfully relocated all the storage for all my VMs, nondisruptively, with a single command, and without even reading the documentation (again, please read the documentation anyway).
It was simple, and it simply worked. As it should.
... View more
Fun with automation – ONTAP Consistency Groups
There's a lot to this post. I'll cover what the heck Consistency Groups (CGs) are all about, how to automate CG operations via the REST API, how to convert existing volume snapmirrors into a CG configuration without a requirement to retransfer the whole data set, and finally how to do it all via the CLI.
Some of the content below is copied directly from https://community.netapp.com/t5/Tech-ONTAP-Blogs/Consistency-Groups-in-ONTAP/ba-p/438567. I did that in order to have all the key concepts in the same place.
Consistency Groups in ONTAP
There’s a good reason you should care about CGs – it’s about manageability.
If you have an important application like a database, it probably involves multiple LUNs or multiple filesystems. How do you want to manage this data? Do you want to manage 20 LUNs on an individual basis, or would you prefer just to manage the dataset as a single unit?
Volumes vs LUNs
If you’re relatively new to NetApp, there’s a key concept worth emphasizing – volumes are not LUNs.
Other vendors use those two terms synonymously. We don’t. A Flexible Volume, also known as a FlexVol, or usually just a “volume,” is just a management container. It’s not a LUN. You put data, including NFS/SMB files, LUNs, and even S3 objects, inside of a volume. Yes, it does have attributes such as size, but that’s really just accounting. For example, if you create a 1TB volume, you’ve set an upper limit on whatever data you choose to put inside that volume, but you haven’t actually allocated space on the drives.
This sometimes leads to confusion. When we talk about creating 5 volumes, we don’t mean 5 LUNs. Sometimes customers think that they create one volume and then one LUN within that volume. You can certainly do that if you want, but there’s no requirement for a 1:1 mapping of volume to LUN. The result of this confusion is that we sometimes see administrators and architects designing unnecessarily complicated storage layouts. A volume is not a LUN.
Okay then, what is a volume?
If you go back about eighteen years, an ONTAP volume mapped to specific drives in a storage controller, but that’s ancient history now.
Today, volumes are there mostly for your administrative convenience. For example, if you have a database with a set of 10 LUNs, and you want to limit the performance for the database using a specific quality of service (QoS) policy, you can place those 10 LUNs in a single volume and slap that QoS policy on the volume. No need to do math to figure out per-LUN QoS limits. No need to apply QoS policies to each LUN individually. You could choose to do that, but if you want the database to have a 100K IOPS QoS limit, why not just apply the QoS limit to the volume itself? Then you can create whatever number of LUNs that are required for the workload.
Volume-level management
Volumes are also related to fundamental ONTAP operations, such as snapshots, cloning, and replication. You don’t selectively decide which LUN to snapshot or replicate, you just place those LUNs into a single volume and create a snapshot of the volume, or you set a replication policy for the volume. You’re managing volumes, irrespective of what data is in those volumes.
It also simplifies how you expand the storage footprint of an application. For example, if you add LUNs to that application in the future, just create the new LUNs within the same volume. They will automatically be included in the next replication update, the snapshot schedule will apply to all the LUNs, including the new ones, and the volume-level QoS policy will now apply to IO on all the LUNs, including the new ones.
You can selectively clone individual LUNs if you like, but most cloning workflows operate on datasets, not individual LUNs. If you have an LVM with 20 LUNs, wouldn’t you rather just clone them as a single unit than perform 20 individual cloning operations? Why not put the 20 LUNs in a single volume and then clone the whole volume in a single step?
Conceptually, this makes ONTAP more complicated, because you need to understand that volume abstraction layer, but if you look at real-world needs, volumes make life easier. ONTAP customers don’t buy arrays for just a single LUN, they use them for multiple workloads with LUN counts going into the 10’s of thousands.
There’s also another important term for a “volume” that you don’t often hear from NetApp. The term is “consistency group,” and you need to understand it if you want maximum manageability of your data.
What’s a Consistency Group?
In the storage world, a consistency group (CG) refers to the management of multiple storage objects as a single unit. For example, if you have a database, you might provision 8 LUNs, configure it as a single logical volume, and create the database. (The term CG is most often used when discussing SAN architectures, but it can apply to files as well.)
What if you want to use array-level replication to protect that database? You can’t just set up 8 individual LUN replication relationships. That won’t work, because the replicated data won’t be internally consistent across volumes. You need to ensure that all 8 replicas of the source LUN are consistent with one another, or the database will be corrupt.
This is only one aspect of CG data management. CGs are implemented in ONTAP in multiple ways. This shouldn’t be surprising – an ONTAP system can do a lot of different things. The need to manage datasets in a consistent manner requires different approaches depending on the chosen NetApp storage system architecture and which ONTAP feature we’re talking about.
Consistency Groups – ONTAP Volumes
The most basic consistency group is a volume. A volume hosting multiple LUNs is intrinsically a consistency group. I can’t tell you how many times I’ve had to explain this important concept to customers as well as NetApp colleagues simply because we’ve historically never used the term “consistency group.”
Here’s why a volume is a consistency group:
If you have a dataset and you put the dataset components (LUNs or files) into a single ONTAP volume, you can then create snapshots and clones, perform restorations, and replicate the data in that volume as a single consistent unit. A volume is a consistency group. I wish we could update every reference to volumes across all the ONTAP documentation in order to explain this concept, because if you understand it, it dramatically simplifies storage management.
Now, there are times where you can’t put the entire dataset in a single volume. For example, most databases use at least two volumes, one for datafiles and one for logs. You need to be able to restore the datafiles to an earlier point in time without affecting the logs. You might need some of that log data to roll the database forward to the desired point in time. Furthermore, the retention times for datafile backups might differ from log backups.
Native ONTAP Consistency Groups
ONTAP also allows you to configure advanced consistency groups within ONTAP itself. The results are similar to what you’d get with the API calls I mentioned above, except now you don’t have to install extra software like SnapCenter or write a script.
For example, I might have an Oracle database with datafiles distributed across 4 volumes located on 4 different controllers. I often do that to ensure my IO load is guaranteed to be evenly distributed across all controllers in the entire cluster. I also have my logs in 3 different volumes, plus I have a volume for my Oracle binaries.
I can still create snapshots, create clones, and replicate that entire 4-controller configuration. All I have to do is define a consistency group. I’ll be writing more about ONTAP consistency groups in the near future, but I’ll start with an explanation of how to take existing flat volumes replicated with regular asynchronous SnapMirror and convert it into consistency group replication without having to perform a new baseline transfer.
SnapMirror -> CG SnapMirror conversion
Why might you do this? Well, let’s say you have an existing 100TB database spread across 10 different volumes and you’re protecting it with snapshots. You might also be replicating those snapshots to a remote site via SnapMirror. As long as you’ve created those snapshots correctly, you have recoverability at the remote site. The problem is you might have to perform some snaprestore operations to make that data usable.
The point of CG snapmirror is to make a replica of a multi-volume dataset where all the volumes are in lockstep with another. That yields what I call “break the mirror and go!” recoverability. If you break the mirrors, the dataset is ready without a need for additional steps. It’s essentially the same as recovering from a disaster using synchronous mirroring. That CG snapmirror replica represents the state of your data at a single atomic point in time.
Critical note: when deleting existing SnapMirror relations be extremely careful with the API and CLI calls. If you use the wrong JSON with the API calls or the wrong arguments using the CLI, you will delete all common snapshots on the source and destination volumes. If this happens you will have to perform a new baseline transfer of all data.
SnapMirror and the all-important common snapshot.
The foundation of snapmirror is two volumes with the same snapshot. As long as you have two volumes with the exact same snapshot, you can incrementally update one of those volumes using the data in the other volume. The logic is basically this:
Create a new snapshot on the source.
Identify the changes between that new snapshot and the older common snapshot that exists in both the source and target volumes.
Ship the changes between those two snapshots to the target volume.
Once that’s complete, the state of the target volume now matches the content of that newly created snapshot at the source. There’s a lot of additional capabilities regarding storing and transferring other snapshots, controlling retention policies, and protecting snapshots from deletion. The basic logic is the same, though – you just need two volumes with a common snapshot.
Initial configuration - volumes
Here's my current 5 volumes being replicated as 5 ordinary snapmirror replicas:
rtp-a700s-c02::> snapmirror show -destination-path jfs_svm2:jfs3* Source Path Destination Mirror Path Status ------------------- ----------------------- -------------- jfs_svm1:jfs3_dbf1 jfs_svm2:jfs3_dbf1_mirr Snapmirrored jfs_svm1:jfs3_dbf2 jfs_svm2:jfs3_dbf2_mirr Snapmirrored jfs_svm1:jfs3_logs1 jfs_svm2:jfs3_logs1_mirr Snapmirrored jfs_svm1:jfs3_logs2 jfs_svm2:jfs3_logs2_mirr Snapmirrored jfs_svm1:jfs3_ocr jfs_svm2:jfs3_ocr_mirr Snapmirrored
Common snapshots
Here’s the snapshots I have on the source:
rtp-a700s-c01::> snapshot show -vserver jfs_svm1 -volume jfs3* Vserver Volume Snapshot -------- -------- ------------------------------------- jfs_svm1 jfs3_dbf1 snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520140.2024-02-23_190259 jfs3_dbf2 snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520141.2024-02-23_190315 jfs3_logs1 snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520142.2024-02-23_190257 jfs3_logs2 snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520143.2024-02-23_190258 jfs3_ocr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520139.2024-02-23_190256
And here’s the snapshots on my destination volumes:
rtp-a700s-c02::> snapshot show -vserver jfs_svm2 -volume jfs3* Vserver Volume Snapshot -------- -------- ------------------------------------- jfs_svm2 jfs3_dbf1_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520140.2024-02-23_190259 jfs3_dbf2_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520141.2024-02-23_190315 jfs3_logs1_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520142.2024-02-23_190257 jfs3_logs2_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520143.2024-02-23_190258 jfs3_ocr_mirr snapmirror.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520139.2024-02-23_190256
See the common snapshot in each volume? As long as those snapshots exist, I can do virtually anything I want to these volumes and I’ll still be able to resynchronize the replication relationships without a total retransfer of everything.
Do it with REST
The customer request was to automate the conversion process. The output below used a personal toolbox of mine to issue REST API calls and print the complete debug output. I normally script in Python.
The POC code used the following inputs:
Name of the snapmirror destination server
Pattern match for existing snapmirrored volumes
Name for the ONTAP Consistency Groups to be created
The basic steps are these:
Enumerate replicated volumes on the target system using the pattern match
Identify the name of the source volume and the source SVM hosting that volume
Delete the snapmirror relationships
Release the snapmirror destination at the source.
Define a new CG at the source
Define a new CG at the destination
Define a CG snapmirror relationship
Resync the mirror
Caution: Step 4 is the critical step. I'll keep repeating this warning in this post. By default, releasing a snapmirror relationship will delete all common snapshots. You need to use addition, non-default CLI/REST arguments to stop that from happening. If you make an error, you’ll lose your common snapshots.
In the following sections, I’ll walk you through my POC script and show you the REST conversation happening along the way.
The script
Here’s the first few lines:
#! /usr/bin/python3 import sys sys.path.append(sys.path[0] + "/NTAPlib") import doREST svm1='jfs_svm1' svm2='jfs_svm2'
The highlights are that I’m importing my doREST module and defining a couple of variables with the names of the svm’s I’m using. The svm jfs_svm1 is the source of the target SVM relationship, and jfs_svm2 is the destination SVM.
A note about doREST. It’s a wrapper for ONTAP APIs that is designed to package up the responses in a standard way. It also has a credential management system and hostname registry. I use this module to string together multiple calls and build workflows. It also handles calls synchronously. For calls such as a POST /snapmirror, which is asynchronous, the doREST module will read the job uuid and repeatedly poll ONTAP until the job is complete. It will then return the results. In the examples below, I’ll include the input/output of that looping behavior. If you want to know more, visit my github repo here.
You'll see I'm running it in debug mode where the API, JSON, and REST response are printed at the CLI. I've included that information to help you understand how to build your own REST workflows.
Enumerate the snapmirror relationships
If I'm going to convert a set of snapmirror relationships into a CG configuration, I'll obviously need to know which ones i'm converting.
api='/snapmirror/relationships' restargs='fields=uuid,' + \ 'state,' + \ 'destination.path,' + \ 'destination.svm.name,' + \ 'destination.svm.uuid,' + \ 'source.path,' + \ 'source.svm.name,' + \ 'source.svm.uuid' + \ '&query_fields=destination.path' + \ '&query=jfs_svm2:jfs3*' snapmirrors=doREST.doREST(svm2,'get',api,restargs=restargs,debug=2)
This code sets up the REST arguments that go with a GET /snapmirror/relationships. I’ve passed a query for a path of jfs_svm2:jfs3* which means the results will only contain the SnapMirror destinations I mentioned earlier in this post. It's a wildcard search.
Here’s the debug output that shows the REST conversation with ONTAP:
->doREST:REST:API: GET https://10.192.160.45/api/snapmirror/relationships?fields=uuid,state,destination.path,destination.svm.name,destination.svm.uuid,source.path,source.svm.name,source.svm.uuid&query_fields=destination.path&query=jfs_svm2:jfs3* ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "records": [ ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "26b40c82-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_ocr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_ocr_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/26b40c82-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "2759306a-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_logs1", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_logs1_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/2759306a-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "27fdd036-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_logs2", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_logs2_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/27fdd036-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "28a265e8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_dbf1", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_dbf1_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/28a265e8-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "320db78d-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:jfs3_dbf2", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:jfs3_dbf2_mirr", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/320db78d-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: ], ->doREST:REST:RESPONSE: "num_records": 5, ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships?fields=uuid,state,destination.path,destination.svm.name,destination.svm.uuid,source.path,source.svm.name,source.svm.uuid&query_fields=destination.path&query=jfs_svm2:jfs3*" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Highlights:
The uuid of the snapmirror relationship uuid are in red
Snapmirror sources are highlighted in in purple
Snapmirror destinations are in blue
Delete the snapmirror relationships
for record in snapmirrors.response['records']: delete=doREST.doREST(svm2,'delete','/snapmirror/relationships/' + record['uuid'] + '/? destination_only=true',debug=2)
This block extracts the records returned by the prior GET /snapmirror/relationships and extracts the uuid. It then deletes all 5 of the relationships.
Caution: the destination_only=true argument is required to stop ONTAP from deleting the common snapshots. Do not overlook this parameter.
->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/26b40c82-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "d905b4e3-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/d905b4e3-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/d905b4e3-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "d905b4e3-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/d905b4e3-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/2759306a-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "d9ad1f48-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/d9ad1f48-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/d9ad1f48-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "d9ad1f48-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/d9ad1f48-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/27fdd036-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "da546656-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/da546656-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/da546656-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "da546656-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/da546656-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/28a265e8-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "daf9c09a-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/daf9c09a-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/daf9c09a-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "daf9c09a-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/daf9c09a-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.45/api/snapmirror/relationships/320db78d-d27e-11ee-a514-00a098af9054/?destination_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "dba0429b-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dba0429b-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/dba0429b-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dba0429b-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dba0429b-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200
You can see in the above output that the actual DELETE /snapmirror/relationships operation was asynchronous. The REST call returned a status of 202, which means the operation was accepted, but is not yet complete.
The doREST module then captured the uuid of the job and polled ONTAP until complete.
Release the snapmirror relationships
The next part of the script is almost identical to the prior snippet, except this time it’s doing a snapmirror release operation.
The relationship itself was deleted in the prior step, and deletion of the relationship stops updates. That deletion operation was executed against the destination controller and it included an argument destination_only=true.
The next deletion operation will target the source and will include source_info_only=true. Asynchronous SnapMirror is a pull technology, so deleting the relationship in the prior step halted further updates. We still need to de-register the destination from the source, which is what the next step does.
Caution: the source_info_only=true argument is required to stop ONTAP from deleting the common snapshots. Do not overlook this parameter.
for record in snapmirrors.response['records']: delete=doREST.doREST(svm1,'delete','/snapmirror/relationships/' + record['uuid'] + '/? source_info_only=true',debug=2)
->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/26b40c82-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "dc4fcade-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dc4fcade-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dc4fcade-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dc4fcade-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dc4fcade-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/2759306a-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "dcfd165f-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dcfd165f-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dcfd165f-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dcfd165f-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dcfd165f-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/27fdd036-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "ddac905c-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/ddac905c-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/ddac905c-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "ddac905c-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/ddac905c-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/28a265e8-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "de9526a2-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/de9526a2-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/de9526a2-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "de9526a2-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/de9526a2-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: DELETE https://10.192.160.40/api/snapmirror/relationships/320db78d-d27e-11ee-a514-00a098af9054/?source_info_only=true ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "df43391f-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/df43391f-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/df43391f-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "df43391f-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/df43391f-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
At this point, the original snapmirror relationships are completely deconfigured, but the volumes still contain a common snapshot, which is all that is required to perform a resync.
Create a CG at the source
Assuming is hasn’t already been done before, we’ll need to define the source volumes as a CG. The process starts by creating a mapping of source volumes to destination volumes using the information obtained when the original snapmirror data was collected.
mappings={} for record in snapmirrors.response['records']: mappings[record['source']['path'].split(':')[1]] = record['destination']['path'].split(':')[1]
The mappings dictionary looks like this:
{'jfs3_ocr': 'jfs3_ocr_mirr', 'jfs3_logs1': 'jfs3_logs1_mirr', 'jfs3_logs2': 'jfs3_logs2_mirr', 'jfs3_dbf1': 'jfs3_dbf1_mirr', 'jfs3_dbf2': 'jfs3_dbf2_mirr'}
The next step is to create the consistency group using the keys from this dictionary, because the keys are the volumes at the source. Note that I’m naming the cg jfs3, which is the name of the host where this database resides.
vollist=[] for srcvol in mappings.keys(): vollist.append({'name':srcvol,'provisioning_options':{'action':'add'}}) api='/application/consistency-groups' json4rest={'name':'jfs3', \ 'svm.name':'jfs_svm1', \ 'volumes': vollist} cgcreate=doREST.doREST(svm1,'post',api,json=json4rest,debug=2)
->doREST:REST:API: POST https://10.192.160.40/api/application/consistency-groups ->doREST:REST:JSON: {'name': 'jfs3', 'svm.name': 'jfs_svm1', 'volumes': [{'name': 'jfs3_ocr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_logs1', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_logs2', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_dbf1', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_dbf2', 'provisioning_options': {'action': 'add'}}]} ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "running", ->doREST:REST:RESPONSE: "message": "Unclaimed", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "running", ->doREST:REST:RESPONSE: "message": "Unclaimed", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "running", ->doREST:REST:RESPONSE: "message": "Creating consistency group volume record - 3 of 5 complete.", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK ->doREST:REST:API: GET https://10.192.160.40/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "dfe481c8-d27e-11ee-a161-00a098f7d731", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/dfe481c8-d27e-11ee-a161-00a098f7d731" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Create a CG at the destination
The next step is to create a CG at the destination:
The list of volumes is also taken from the mappings dictionary, except rather than using the keys, I’ll use the values of the keys. Those are the snapmirror destination volumes discovered in the first step.
vollist=[] for srcvol in mappings.keys(): vollist.append({'name':mappings[srcvol],'provisioning_options':{'action':'add'}}) api='/application/consistency-groups' json4rest={'name':'jfs3', \ 'svm.name':'jfs_svm2', \ 'volumes': vollist} cgcreate=doREST.doREST(svm2,'post',api,json=json4rest,debug=2)
->doREST:REST:API: POST https://10.192.160.45/api/application/consistency-groups ->doREST:REST:JSON: {'name': 'jfs3', 'svm.name': 'jfs_svm2', 'volumes': [{'name': 'jfs3_ocr_mirr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_logs1_mirr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_logs2_mirr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_dbf1_mirr', 'provisioning_options': {'action': 'add'}}, {'name': 'jfs3_dbf2_mirr', 'provisioning_options': {'action': 'add'}}]} ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "e25c2f6f-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e25c2f6f-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/e25c2f6f-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "e25c2f6f-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e25c2f6f-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Create the consistency group mirror
To define the CG mirror, I need to built the CG snapmirror map. Order matters. I need a list of source volumes and destination volumes, and then ONTAP will match element X of the first list to element X of the second list. That’s how you control which volume in the source CG should be replicated to which volume in the destination CG.
for record in snapmirrors.response['records']: mappings[record['source']['path'].split(':')[1]] = record['destination']['path'].split(':')[1] srclist=[] dstlist=[] for srcvol in mappings.keys(): srclist.append({'name':srcvol}) dstlist.append({'name':mappings[srcvol]})
Now I can create the mirror of the jfs3 CG on the source to the jfs3 CG on the destination
api='/snapmirror/relationships' json4rest={'source':{'path':'jfs_svm1:/cg/jfs3', \ 'consistency_group_volumes' : srclist}, \ 'destination':{'path':'jfs_svm2:/cg/jfs3', \ 'consistency_group_volumes' : dstlist}, \ 'policy':'Asynchronous'} cgsnapmirror=doREST.doREST(svm2,'post',api,json=json4rest,debug=2)
->doREST:REST:API: POST https://10.192.160.45/api/snapmirror/relationships ->doREST:REST:JSON: {'source': {'path': 'jfs_svm1:/cg/jfs3', 'consistency_group_volumes': [{'name': 'jfs3_ocr'}, {'name': 'jfs3_logs1'}, {'name': 'jfs3_logs2'}, {'name': 'jfs3_dbf1'}, {'name': 'jfs3_dbf2'}]}, 'destination': {'path': 'jfs_svm2:/cg/jfs3', 'consistency_group_volumes': [{'name': 'jfs3_ocr_mirr'}, {'name': 'jfs3_logs1_mirr'}, {'name': 'jfs3_logs2_mirr'}, {'name': 'jfs3_dbf1_mirr'}, {'name': 'jfs3_dbf2_mirr'}]}, 'policy': 'Asynchronous'} ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "e304e8d8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e304e8d8-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/e304e8d8-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "e304e8d8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e304e8d8-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Retrieve the UUID
My final setup step will be to resync the relationship as a CG replica using the previously existing common snapshots, but in order to do that I need the uuid of the CG snapmirror I created. I’ll reuse the same query as before. Strictly speaking, I don’t need all these fields for this workflow, but for the sake of consistency and futureproofing, I’ll gather all the core information about the snapmirror relationship in a single call.
Note that I’ve changed my query to jfs_svm2:/cg/jfs3. This is the syntax for addressing a CG snapmirror.
svm:/cg/[cg name]
api='/snapmirror/relationships' restargs='fields=uuid,' + \ 'state,' + \ 'destination.path,' + \ 'destination.svm.name,' + \ 'destination.svm.uuid,' + \ 'source.path,' + \ 'source.svm.name,' + \ 'source.svm.uuid' + \ '&query_fields=destination.path' + \ '&query=jfs_svm2:/cg/jfs3' cgsnapmirror=doREST.doREST(svm2,'get',api,restargs=restargs,debug=2) cguuid=cgsnapmirror.response['records'][0]['uuid']
->doREST:REST:API: GET https://10.192.160.45/api/snapmirror/relationships?fields=uuid,state,destination.path,destination.svm.name,destination.svm.uuid,source.path,source.svm.name,source.svm.uuid&query_fields=destination.path&query=jfs_svm2:/cg/jfs3 ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "records": [ ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "e304e0fe-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "source": { ->doREST:REST:RESPONSE: "path": "jfs_svm1:/cg/jfs3", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ac509ea6-fa33-11ed-ae6e-00a098f7d731", ->doREST:REST:RESPONSE: "name": "jfs_svm1", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/peers/2fc4ddfd-fb05-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "destination": { ->doREST:REST:RESPONSE: "path": "jfs_svm2:/cg/jfs3", ->doREST:REST:RESPONSE: "svm": { ->doREST:REST:RESPONSE: "uuid": "ca77cf7f-fa33-11ed-993a-00a098af9054", ->doREST:REST:RESPONSE: "name": "jfs_svm2", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/svm/svms/ca77cf7f-fa33-11ed-993a-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: }, ->doREST:REST:RESPONSE: "state": "snapmirrored", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships/e304e0fe-d27e-11ee-a514-00a098af9054/" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: ], ->doREST:REST:RESPONSE: "num_records": 1, ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/snapmirror/relationships?fields=uuid,state,destination.path,destination.svm.name,destination.svm.uuid,source.path,source.svm.name,source.svm.uuid&query_fields=destination.path&query=jfs_svm2:/cg/jfs3" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Resync
Now I’m ready to resync with a PATCH operation. I’ll take the first record from the prior operation and extract the uuid. If I was doing this in production code, I’d validate the results to ensure that the query returned one and only one record. That ensures I really do have the CG uuid for the CG I created.
api='/snapmirror/relationships/' + cguuid json4rest={'state':'snapmirrored'} cgresync=doREST.doREST(svm2,'patch',api,json=json4rest,debug=2)
->doREST:REST:API: PATCH https://10.192.160.45/api/snapmirror/relationships/e304e0fe-d27e-11ee-a514-00a098af9054 ->doREST:REST:JSON: {'state': 'snapmirrored'} ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "job": { ->doREST:REST:RESPONSE: "uuid": "e3b577a8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e3b577a8-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 202 ->doREST:REASON: Accepted ->doREST:REST:API: GET https://10.192.160.45/api/cluster/jobs/e3b577a8-d27e-11ee-a514-00a098af9054?fields=state,message ->doREST:REST:JSON: None ->doREST:REST:RESPONSE: { ->doREST:REST:RESPONSE: "uuid": "e3b577a8-d27e-11ee-a514-00a098af9054", ->doREST:REST:RESPONSE: "state": "success", ->doREST:REST:RESPONSE: "message": "success", ->doREST:REST:RESPONSE: "_links": { ->doREST:REST:RESPONSE: "self": { ->doREST:REST:RESPONSE: "href": "/api/cluster/jobs/e3b577a8-d27e-11ee-a514-00a098af9054" ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:REST:RESPONSE: } ->doREST:RESULT: 200 ->doREST:REASON: OK
Done. I can now see a healthy CG snapmirror relationship.
rtp-a700s-c02::> snapmirror show -destination-path jfs_svm2:/cg/jfs3 Source Path: jfs_svm1:/cg/jfs3 Destination Path: jfs_svm2:/cg/jfs3 Relationship Type: XDP Relationship Group Type: consistencygroup SnapMirror Schedule: - SnapMirror Policy Type: mirror-vault SnapMirror Policy: Asynchronous Tries Limit: - Throttle (KB/sec): unlimited Mirror State: Snapmirrored Relationship Status: Idle File Restore File Count: - File Restore File List: - Transfer Snapshot: - Snapshot Progress: - Total Progress: - Percent Complete for Current Status: - Network Compression Ratio: - Snapshot Checkpoint: - Newest Snapshot: snapmirrorCG.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520139.2024-02-23_190812 Newest Snapshot Timestamp: 02/23 19:09:12 Exported Snapshot: snapmirrorCG.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520139.2024-02-23_190812 Exported Snapshot Timestamp: 02/23 19:09:12 Healthy: true Unhealthy Reason: - Destination Volume Node: - Relationship ID: e304e0fe-d27e-11ee-a514-00a098af9054 Current Operation ID: - Transfer Type: - Transfer Error: - Current Throttle: - Current Transfer Priority: - Last Transfer Type: resync Last Transfer Error: - Last Transfer Size: 99.81KB Last Transfer Network Compression Ratio: 1:1 Last Transfer Duration: 0:1:5 Last Transfer From: jfs_svm1:/cg/jfs3 Last Transfer End Timestamp: 02/23 19:09:17 Progress Last Updated: - Relationship Capability: 8.2 and above Lag Time: 3:24:1 Identity Preserve Vserver DR: - Volume MSIDs Preserved: - Is Auto Expand Enabled: true Backoff Level: - Number of Successful Updates: 0 Number of Failed Updates: 0 Number of Successful Resyncs: 1 Number of Failed Resyncs: 0 Number of Successful Breaks: 0 Number of Failed Breaks: 0 Total Transfer Bytes: 102208 Total Transfer Time in Seconds: 65 FabricLink Source Role: - FabricLink Source Bucket: - FabricLink Peer Role: - FabricLink Peer Bucket: - FabricLink Topology: - FabricLink Pull Byte Count: - FabricLink Push Byte Count: - FabricLink Pending Work Count: - FabricLink Status: -
I would still need to ensure I have the correct snapmirror schedules and policies, but that’s all essentially the same procedures used for regular volume-based asynchronous snapmirror. The primary difference is you reference the the paths, if necessary, using the svm:/cg/[cg name] syntax. Start here https://docs.netapp.com/us-en/ontap/data-protection/create-replication-job-schedule-task.html for those details.
CLI procedure
If you’re using ONTAP 9.14.1 or higher, you can do everything via the CLI or SystemManager too.
Delete the existing snapmirror relationships
rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_ocr_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_ocr_mirr". rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_dbf1_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_dbf1_mirr". rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_dbf2_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_dbf2_mirr". rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_logs1_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_logs1_mirr". rtp-a700s-c02::> snapmirror delete -destination-path jfs_svm2:jfs3_logs2_mirr Operation succeeded: snapmirror delete for the relationship with destination "jfs_svm2:jfs3_logs2_mirr".
Release the snapmirror destinations
Don’t forget the "-relationship-info-only true"!
rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_ocr_mirr -relationship-info-only true [Job 4984] Job succeeded: SnapMirror Release Succeeded rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_dbf1_mirr -relationship-info-only true [Job 4985] Job succeeded: SnapMirror Release Succeeded rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_dbf2_mirr -relationship-info-only true [Job 4986] Job succeeded: SnapMirror Release Succeeded rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_logs1_mirr -relationship-info-only true [Job 4987] Job succeeded: SnapMirror Release Succeeded rtp-a700s-c01::> snapmirror release -destination-path jfs_svm2:jfs3_logs2_mirr -relationship-info-only true [Job 4988] Job succeeded: SnapMirror Release Succeeded
Create a CG at the source
rtp-a700s-c01::> consistency-group create -vserver jfs_svm1 -consistency-group jfs3 -volumes jfs3_ocr,jfs3_dbf1,jfs3_dbf2,jfs3_logs1,jfs3_logs2 (vserver consistency-group create) [Job 4989] Job succeeded: Success
Create a CG at the destination
rtp-a700s-c02::> consistency-group create -vserver jfs_svm2 -consistency-group jfs3 -volumes jfs3_ocr_mirr,jfs3_dbf1_mirr,jfs3_dbf2_mirr,jfs3_logs1_mirr,jfs3_logs2_mirr (vserver consistency-group create) [Job 5355] Job succeeded: Success
Create the CG snapmirror relationships
rtp-a700s-c02::> snapmirror create -source-path jfs_svm1:/cg/jfs3 -destination-path jfs_svm2:/cg/jfs3 -cg-item-mappings jfs3_ocr:@jfs3_ocr_mirr,jfs3_dbf1:@jfs3_dbf1_mirr,jfs3_dbf2:@jfs3_dbf2_mirr,jfs3_logs1:@jfs3_logs1_mirr,jfs3_logs2:@jfs3_logs2_mirr Operation succeeded: snapmirror create for the relationship with destination "jfs_svm2:/cg/jfs3".
Perform the resync operation
rtp-a700s-c02::> snapmirror resync -destination-path jfs_svm2:/cg/jfs3 Operation is queued: snapmirror resync to destination "jfs_svm2:/cg/jfs3".
Done!
rtp-a700s-c02::> snapmirror show -destination-path jfs_svm2:/cg/jfs3 Source Path: jfs_svm1:/cg/jfs3 Destination Path: jfs_svm2:/cg/jfs3 Relationship Type: XDP Relationship Group Type: consistencygroup SnapMirror Schedule: - SnapMirror Policy Type: mirror-vault SnapMirror Policy: MirrorAndVault Tries Limit: - Throttle (KB/sec): unlimited Mirror State: Snapmirrored Relationship Status: Idle File Restore File Count: - File Restore File List: - Transfer Snapshot: - Snapshot Progress: - Total Progress: - Percent Complete for Current Status: - Network Compression Ratio: - Snapshot Checkpoint: - Newest Snapshot: snapmirrorCG.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520144.2024-02-26_005106 Newest Snapshot Timestamp: 02/26 00:52:06 Exported Snapshot: snapmirrorCG.ca77cf7f-fa33-11ed-993a-00a098af9054_2161520144.2024-02-26_005106 Exported Snapshot Timestamp: 02/26 00:52:06 Healthy: true Unhealthy Reason: - Destination Volume Node: - Relationship ID: 15f75947-d441-11ee-a514-00a098af9054 Current Operation ID: - Transfer Type: - Transfer Error: - Current Throttle: - Current Transfer Priority: - Last Transfer Type: resync Last Transfer Error: - Last Transfer Size: 663.3KB Last Transfer Network Compression Ratio: 1:1 Last Transfer Duration: 0:1:5 Last Transfer From: jfs_svm1:/cg/jfs3 Last Transfer End Timestamp: 02/26 00:52:11 Progress Last Updated: - Relationship Capability: 8.2 and above Lag Time: 0:0:21 Identity Preserve Vserver DR: - Volume MSIDs Preserved: - Is Auto Expand Enabled: true Backoff Level: - Number of Successful Updates: 0 Number of Failed Updates: 0 Number of Successful Resyncs: 1 Number of Failed Resyncs: 0 Number of Successful Breaks: 0 Number of Failed Breaks: 0 Total Transfer Bytes: 679208 Total Transfer Time in Seconds: 65 FabricLink Source Role: - FabricLink Source Bucket: - FabricLink Peer Role: - FabricLink Peer Bucket: - FabricLink Topology: - FabricLink Pull Byte Count: - FabricLink Push Byte Count: - FabricLink Pending Work Count: - FabricLink Status: -
... View more
If the data isn't already encrypted or compressed, then 3:1 is about the median. As Dave said, it's also a lot of "It depends". We evaluated some internal production datafiles here at NetApp, taken at random, and we found between 2:1 ands 6:1 efficiency. We've also had customers with a lot of datafiles that didn't have blocks, and that sort of data gets about 80:1 efficiency because that gets stored is the datafile block header/trailer. We also had a support case a while back where efficiency was basically 1:1. It wasn't compressed, it was just a massive index of flat files stored elsewhere. It was an extremely efficient way to store data that was also super-random. Conceptually, it was like compressed data, but it wasn't "compression" as we know it.
... View more
Could you provide a snippet of the REST API you're calling to do the commit? Also, have you considered autocommit, where you don't need to invoke SnapLock at all? The file just commits for the required time after being written? Obviously that wouldn't work if you use different retention times, I was only wondering if you'd considered it.
... View more
As a general rule, POST is for creating something or executing an operation, PATCH is for changing something, and GET is just retrieving information. I tested this now: GET https://10.192.160.45/api/storage/volumes?fields=uuid,size,svm.name,svm.uuid,clone.split_initiated,clone.split_complete_percent,clone.split_estimate,nas.path,aggregates,type&name=*&svm.name=jfs_svm2 and it returned what you'd expect on a system without any current split operations for each volume: ->doREST:REST:RESPONSE: "clone": { ->doREST:REST:RESPONSE: "split_estimate": 3266568192, ->doREST:REST:RESPONSE: "split_initiated": false The split_estimate is misleading. That's the amount of used space on the volume to be split. It does NOT mean that amount of data will be copied or consumed after the split. The space consumption after a split depends on what the space allocation polices are for the volume. If you're in fully thin provisioned configuration, splitting a clone requires no additional space. If you're thick provisioned, the split clone would allocate its full size on the aggregate for itself.
... View more
"volume clone split show" is the CLI command. For REST, you can do a GET /storage/volumes and retrieve fields like: clone.split_complete_percent clone.split_estimate clone.split_initiated
... View more
Certification just came through a few days ago. In principle, A-Series and C-Series should perform about the same because SAP HANA is all about sequential IO. You'll see a random read latency increase with C-Series, but there's only a very small difference between A-Series and C-Series with sequential IO. It's not zero difference, but it's negligible.
It might take a few more days until the certification will be listed at SAP’s webpage https://www.sap.com/dmc/exp/2014-09-02-hana-hardware/enEN/#/solutions?filters=storage.
... View more
The C-Series will definitely have higher latency. A C250 should be able to deliver more total IOPS than the A200, but the latency of the individual IO operations would be higher. You're correct, it's the differing media.
Whether you notice the difference depends on the workload. A lot of virtualization projects really don't need 150us of latency from an A-Series. A lot of VMware footprints are still using FAS with spinning drives, and they work well. If, however, you're hosting a database on those VMDK's, then C-Series latency might cause problems.
... View more
A-Series and C-Series
I've seen an extraordinary amount of interest in the C-Series systems for all sorts of workloads. I'm the Big Scary Enterprise Workload guy, which means I want proof before I recommend anything to a customer. So, I reached out to the Workload Engineering team and got some realistic test data that demonstrates what it can do.
If you want to real all the low-level details about the A-Series and C-Series architectures and their performance characteristics, there's a couple of hyperlinks below.
Before I get into the numbers, I wanted recap my personal view of "What is C-Series?, or to phrase it better…
Why is C-Series?
For a long time, solid-state storage arrays just used "Flash". For a brief time, it appeared the solid state market was going to divide into Flash media and 3DCrosspoint media, which was a much faster solid state technology, but 3DXP couldn't break out of its niche.
Instead, we've seen the commercial Flash market divide into SLC/TLC drives and QLC drives. Without going into details, SLC/TLC are the faster and more durable options aimed at high-speed storage applications, whereas QLC is somewhat slower and less durable* but also less expensive and are aimed at capacity-centric, less IO-intensive storage applications.
*Note: Don't get misled on this topic. The durability difference between TLC and QLC might be important if you're purchasing a drive for your personal laptop, but the durabilty is essentially identical when drives are inserted into ONTAP. ONTAP RAID technology is still there protecting data against media failures. Furthermore, ONTAP WAFL technology distributes inbound write data to free blocks across multiple drives. This minimizes overwrites of the individual cells within the drive, which maximizes the drive's useful life. Also, NetApp support agreements that cover drive failures also include drive replacement for SSDs that have exhausted their write cycles.
The result of the market changes is that NetApp now offers the A-Series for high-speed, latency sensitive databases or IO-intensive VMware estates, while the C-Series is for less latency-sensitive and more capacity-centric workloads.
That's easy enough to understand in principle, but it's not enough for DBAs, virtualization admins, and storage admins to make decisions. They want to see the numbers…
The Numbers
What makes this graph so compelling is its simplicity.
The IOPS capabilities are comparable. Yes, the C-Series is saturating a touch quicker than A-Series, but it's very close.
As expected, the C-Series is showing higher latency, but it's very consistent and also much closer to A-Series performance than it is to hybrid/spinning disk solutions.
The workload is an Oracle database issuing a roughly 80%/20% random-read, random-write split. We graph total IOPS against read latency because read latency is normally the most important factor with real-world workloads.
The reason we use Oracle databases is twofold. First, we've got thousands and thousands of controllers servicing Oracle databases, so it's an important market for us. Second, Oracle databases are an especially brutal, latency sensitive workloads with lots of interdependencies between different IO operations. If anything is wrong with storage behavior, you should see it in the charts. You can also extrapolate these results to lots of different workloads beyond databases.
We are also graphing the latency as seen from the database itself. The timing is based on the elapsed time between the application submitting an IO, and the IO being reported as complete. The means we're measuring the entire storage path from the OS, through the network, into the storage system, back across the network to the OS, and up to the application layer.
I'd also like to point out that the performance of the A-Series is amazing. 900K IOPS without even breaching the 1ms latency runs circles around the competition, but I've posted about that before. This post is focusing on C-Series.
Note: These tests all used ONTAP 9.13.1, which includes some significant performance improvements for both A-Series and C-Series.
Write latency
Obviously write latency is also important, but the A-Series and C-Series both use the same write logic from the point of view of the host. Write operations commit to the mirrored, nonvolatile NVRAM journal. Once the data is in NVRAM, the write is acknowledged and host continues. The write to the drive layer comes much later.
Want proof?
This is a graph of total IOPS versus the write latency. Note that the latency is reported in microseconds.
You can see your workload's write latency is mostly unaffected by the choice of A-Series or C-Series. As the IOPS increase toward the saturation point, the write latency on C-Series increases more quickly than A-Series as a result of the somewhat slower media in use, but keep this in perspective. Most real-world workloads run as expected so long as write latency remains below 500µs. Even 1ms of write latency is not necessarily a problem, even with databases.
Caching
These tests used 10TB of data within the database (the database itself was larger, but we're accessing 10TB during the test runs). This means the test results above do include some cache hits on the controller, which reflects how storage is used in the real world. There will be some benefit from caching, but nearly all IO in these tests is being serviced from the actual drives.
We also run these tests with storage efficiency features enabled, using a working set with a reasonable level of compressibility. If you use unrealistic test data, you can get outrageously unreasonable amounts of caching that skew the results in the same way that running a test with a tiny working set that is unrealistically cacheable can skew results.
The reason I want to point this out is that customers considering C-Series need to understand that not all IO latency is affected. The higher latencies only appear with read operations that actually require a disk IO. Reads that can be serviced by onboard cache should be measurable in microseconds, as is true with the A-Series. This is important because all workloads, especially databases, include hot blocks of data that require ultra-low latency. Access times for cached blocks should be essentially identical between A-Series and C-Series.
Sequential IO
Sequential IO is much less affected by the drive type than random IO. The reason is sequential IO involves both readahead and larger blocks. That means the storage system can start performing read IO to the drives before the host even requests the data, and there are much fewer (but larger) IO operations happening on the backend drives.
On the whole, if you're doing sequential IO you should see comparable performance with A-Series, C-Series and even old FAS arrays if they have enough drives.
We compared A-Series to C-Series and saw a peak sequential read throughput of about 30GB/sec with the A-Series and about 27GB/sec with the C-Series. These numbers were using synthetic tools. It's difficult to perform a truly representative sequential IO tests from a database because of the configuration requirements. You'd need an enormous number of FC adapters to run a single A800 or C800 controller to its limit, and it's difficult to get a database to try to read 30GB/sec in the first place.
As a practical matter, few workloads are purely sequential IO, but tasks such as Oracle RMAN backups or database full table scans should perform about the same on both A-Series and C-Series. The limiting factor should normally be the available network bandwidth, not the storage controller.
Summary: A-Series or C-Series?
It's all ONTAP, it's just about the media. If you have a workload that genuinely requires consistent latency down in the 100's of µs range, then choose A-Series. If your workload can accept 2ms (or so) of read latency (and remember, cache hits and write IO latency is much faster) then look at C-Series.
As a general principle, think about whether your workload is about computational speed or is about the end-user experience. A bank performing end-of-day account reconciliation probably needs the ultra-low latency of the A-Series. In contrast, a CRM database is usually about the end users. If you're updating customer records, you probably don't care if it takes an extra 2ms to retrieve a block of data that contains the customer contact information.
You can also build mixed clusters and tier your databases between A-Series and C-Series as warranted. It's all still ONTAP, and you can nondisruptively move your workloads between controllers.
Finally, the C-Series is an especially good option for upgrading legacy spinning drive and hybrid arrays. Spinning drive latencies are typically around 8-10ms, which means C-Series is about 4X to 5X faster in terms of latency. If you're looking at raw IOPS, there's no comparison. A spinning drive saturates around 120 IOPS/drive. You would need about 6000 spinning drives to reach the 800,000 IOPS delivered by just 24 QLC drives as shown in these tests. It's a huge improvement in terms of performance, power/cooling requirements, and costs, but at a lower price point than arrays using SLC or TLC drives.
If you want to learn more, we published the following two technical reports today:
Oracle Performance on AFF A-Series and C-Series
Virtualized Oracle Performance on AFF A-Series and C-Series
Bonus Chart:
If you're wondering how A-Series and C-Series compare under virtualization, here it is. It's not easy building a truly optimized 4-node virtualized RAC environment, and we probably could have tuned this better to reduce the overhead from ESX and VMDK's, but the results are still outstanding and more importantly consistent with the bare metal tests. The latency is higher with C-Series, but the IOPS levels are comparable to A-Series.
We also did this test with FCP, not NVMe/FC, because most VMware customers are using traditional FCP. The protocol change is the primary reason for the lower maximum IOPS level.
... View more
That KB article is incorrect. I just put in a request to remove it. SM-S operates in one of two modes - sync and strict sync. The target customer for the "sync" option is customers who want RPO=0, but they do NOT want operations to screech to a halt if the replication link is lost. That applies to most customers. Some customers, however, want 100% guaranteed RPO=0. That's why we also included the option for strict sync. If a write cannot be replicated, an IO error is returned to the host OS, which typically results in an application shutdown. In both cases, the timeout is around 15 seconds, but the precise number varies a little and I believe there are some options for tuning. In normal operations, you have RPO=0 synchronous mirroring. If you lose connectivity to the remote site for more than 15 seconds, then regular SM-S will carry on accepting writes in broken state and resync when it gets the opportunity, and StrictSync will throw an error. I believe that field you've asked about indicates the time since a SM-S Synchronous (not Strict) replica lost sync.
... View more
NFS has been around for decades as the premier networked, clustered filesystem. If you're a unix/linux user, and you're storing a lot of files, you're probably using NFS right now, especially if you need multiple hosts accessing the same data.
If you're looking for high-performance NFS, NetApp's implementation is the best in the business. A lot of NetApp's market share was built on ONTAP's unique ability to deliver fast, easy-to-manage NFS storage for Oracle database workloads. It's an especially nice solution for Oracle RAC because it's an inherently clustered filesystem. The connected hosts are just reading and writing files. The actual filesystem management lives on the storage system itself. All the NFS clients on the hosts see the same logical data.
The NFSv3 specification was published in 1995, and that's still the version almost everyone is using today. You can store a huge number of files, it's easy to configure, and it's super-fast. There really wasn't much to improve, and as a result v3 has been the dominant version for decades.
Note: I originally wrote this post for Oracle database customers moving from NFSv3 to NFSv4, but it morphed into a more general explanation of the practical difference between managing NFSv3 storage and managing NFSv4 storage. Any sysadmin using NFS should understand the differences in protocol behavior.
Why NFSv4?
So, why is everyone increasingly looking at NFSv4?
Sometimes it's just perception. NFSv4 is newer, and 'newer' is often seen as 'better'. Most customers I see who are either migrating to NFSv4 or choosing NFSv4 for a new project honestly could have used either v3 or v4 and wouldn't notice a difference between the two. There are exceptions, though. There are subtle improvements in NFSv4 that sometimes make it a much better option, especially in cloud deployments.
This post is about the key practical differences between NFSv3 and NFSv4. I'll cover security improvements, changes in networking behavior, and changes in the locking model. It's especially critical you understand the section NFSv4.1 Locks and Leases. NFSv4 is significantly different from NFSv3. If you're running an application like an Oracle database over NFSv4, you need to change your management practices if you want to avoid accidentally crashing your database.
What this post is not
This is not a re-hash of the ONTAP NFS best practices. You can find that information here, https://www.netapp.com/media/10720-tr-4067.pdf.
NFSv4 versions
If someone says “NFSv4” they're usually referring to NFSv4.1. That’s almost certainly the version you’ll be using.
The first release of NFSv4, which was version 4.0, worked fine, but the NFSv4 protocol was designed to expand and evolve. The primary version you’ll see today is NFSv4.1. For the most part, you don't have to think about the improvements in NFSv.1. It just works better than NFSv4.0 in terms of performance and resiliency.
For purposes of this post, when I write NFSv4 just assume that I’m talking about NFSv4.1 It’s the most widely adopted and supported version. (NetApp has support for NFSv4.2, but the primary difference is we added support for labelled NFS, which is a security feature that most customers haven’t implemented.)
NFSv4 features
The most confusing part about the NFSv4 specification is the existence of optional features. The NFSv3 spec was quite rigid. A given client or server either supported NFSv3 or did not support NFSv3. In contrast, the NFSv4 spec is loaded with optional features.
Most of these optional NFSv4 features are disabled by default in ONTAP because they're not commonly used by sysadmins. You probably don't need to think about them, but there are some applications on the market that specifically require certain capabilities for optimum performance. If you have one of these applications, there should be a section in the documentation covering NFS that will explain what you need from your storage system and which options should be enabled.
If you plan to enable one of the options (delegations is the most commonly used optional feature), test it first and make sure your OS's NFS client fully supports the option and it's compatible with the application you're using. Some of the advanced features can be revolutionary, but only if the OS and application make use of those features. For more information on optional features, refer to the TR referenced above.
Again, it's rare you'll run into any issues. For most users, NFSv4 is NFSv4. It just works.
Exception #1
NFSv4.1 introduced a feature called parallel NFS (pNFS) which is a significant feature with broad appeal for a lot of customers. It separates the metadata path from the data path, which can simplify management and improve performance in very large scale environments.
For example, let's say you have a 20-node cluster. You could enable the pNFS feature, configure a data interface on all 20 nodes, and then mount your NFSv4.1 filesystems from one IP in the cluster. That IP becomes the control path. The OS will then retrieve the data path information and choose the optimal network interface for data traffic. The result is you can distribute your data all over the entire 20-node cluster and the OS will automatically figure out the correct IP address and network interface to use for data access. The pNFS feature is also supported by Oracle's direct NFS client.
pNFS is not enabled by default. NetApp has supported it for a long time, but at the time of the release some OS's had a few bugs. We didn't want customers to accidentally use a feature that might expose them to OS bugs. In addition, pNFS can silently change the network paths in use to move data around, which could also cause confusion for customers. It was safer to leave pNFS disabled so customers know for sure whether it's being used within their storage network.
"Upgrading" from NFSv3 to NFSv4
Don't think of this as an upgrade. NFSv4 isn't better than NFSv3, NFSv4 is merely different. Whether you get any benefits from those differences depends on the application.
For example - locking. NFSv3 has some basic locking capabilities, but it's essentially an honor system lock. NFSv3 locks aren't enforced by the server. NFSv3 clients can ignore locks. In contrast, NFSv4 servers, including ONTAP, must honor and enforce locks.
That opens up new opportunities for applications. For example, IBM WebSphere and Tibco offer clusterable applications where locking is important. There's nothing stopping those vendors from writing application-level logic that tracks and controls which parts of the application are using which files, but that requires work. NFSv4 can do that work too, natively, right on the storage system itself. NFSv4 servers track the state of open and locked files, which means you can build clustered applications where individual files can be exclusively locked for use by a specific process. When that process is done with the file, it can release the lock and other processes can acquire the lock. The storage system enforces the locking.
That's a cool feature, but do you need any of that? If you have an Oracle database, it's mostly just doing reads and write of various sizes and that's all. Oracle databases already manage locking and file access synchronization internally. NetApp does a lot of performance testing with real Oracle databases, and we're not seeing any significant performance difference between NFSv3 and NFSv4. Oracle simply hasn't coded their software to make use of the advanced NFSv4 features.
NFS through a firewall
While the choice of NFS version rarely matters to the applications you're running, it does affect your network infrastructure. In particular, it's much easier to run NFSv4 across a firewall.
With NFSv4, you have a single target port (2049) and the NFSv4 clients are required to renew leases on files and filesystems on regular basis. (more on leases below) This activity keeps the TCP session active. You can normally just open port 2049 through the firewall and NFSv4 will work reliably.
In contrast, NFSv3 is often impossible to run through a firewall. Among the problems experienced by customers trying to make it work is NFSv3 filesystems hanging for up to 30 minutes or more. The problem is that firewalls are almost universally configured to drop a network packet that isn't part of a known TCP session. If you have a lot of NFSv3 filesystems, one of them will probably have quiet periods where the TCP session has low activity. If your TCP session timeout limit on the firewall is set to 15 minutes, and an NFSv3 filesystem is quiet for 15 minutes, the firewall will make the TCP session stale and cease passing packets.
Even worse, it will probably drop them.
If the firewall rejected the packets, that would prompt the client to open a new session, but that's not how firewalls normally work. They'll silently drop the packets. You don't usually want a firewall rejecting a packet because that tells an intruder that the destination exists. Silently dropping an invalid packet is safer because it doesn't reveal anything about the other side of the firewall.
The result of silent packet drops with NFSv3 is the client will hang while it tries to retransmit packets over and over and over. Eventually it gives up and will open a fresh TCP session. The firewall will register the new TCP session and traffic will resume, but in the interim your OS might have been stalled out for 5, 10, 20 minutes or more. Most firewalls can't be configured to avoid this situation. You can increase the allowable timeout for an inactive TCP session, but there has to be some kind of timeout with fixed number of seconds.
We've had a few customers write scripts that did a repeated "stat" on an NFSv3 mountpoint in order to ensure there's enough network activity on the wire to prevent the firewall from closing the session. This is okay as a one-off hack, but it's not something I'd want to rely on for anything mission-critical and it doesn't scale well.
Even if you could increase the timeouts for NFSv3, how do you know which ports to open and ensure they're correctly configured on the firewall? You've got 2049 for NFS, 111 for portmap, 635 for mountd, 4045 for NLM, 4046 for NSM, 4049 for rquota…
NFSv4 works better because there's just a single target port, plus the "heartbeat" of lease renewal would keep the TCP/IP session alive.
NFS Security
NFSv4 is inherently more secure than NFSv3. For example, NFSv4 security is normally based on usernames, not user ID's. The result is it's more difficult for an intruder to spoof credentials to gain access to data on an NFSv4 server. You can also easily tell which clients are actively using an NFSv4. It's often impossible to know for sure with NFSv3. You might know a certain client mounted a filesystem at some point in the past, but are they still using the files? Is the filesystem still mounted now? You can't know for sure with NFSv3.
NFS Security - Kerberos
NFSv4 also includes options to make it even more secure. The primary security feature is Kerberos. You have three options -
krb5 - secure authentication
krb5i - data integrity
krb5p - privacy
In a nutshell, basic krb5 security means better, more secure authentication for NFS access. It's not encryption per se, but it uses an encrypted process to ensure that whoever is accessing an NFS resource is who they claimed to be. Think of it as a secure login process where the NFS client authenticates to the NFS server.
If you use krb5i, you add a validation layer to the payload of the NFS conversation. If a malicious middleman gained access to the network layer and tried to modify the data in transit, krb5i would detect and stop it. The intruder may be able to read data from the conversation, but they won't be able to intercept and tamper with the data.
If you're concerned about an intruder being able to read network packets on the wire, you can go all the way to krb5p. The letter p in krb5p means privacy. It delivers complete encryption.
In the field, few administrators use these options for a simple reason - what are the odds a malicious intruder is going to gain access to data center and start snooping on IP packets on the wire? If someone was able to do that, they'd probably be able to get actual login credentials to the database server itself. They'd then be able to freely access data as an actual user.
With increased interest in cloud, some customers are demanding that all data on the wire be encrypted, no exceptions, ever, and they're demanding krb5p. They don't necessarily use it across all NFS filesystems, but they want the option to turn it on. This is also an example of how NFSv4 security is superior to NFSv3. While some of NFSv3 could be krb5p encrypted, not all NFSv3 functions could be "kerberized". NFSv4, however, can be 100% encrypted.
NFSv4 with krb5p is still not generally used because the encryption/decryption work has overhead. Latency will increase and maximum throughput will drop. Most databases would not be affected to the point users would notice a difference, but it depends on the IO load and latency sensitivity. Users of a very active database would probably experience a noticeable performance hit with full krb5p encryption. That's a lot of CPU work for both the OS and the storage system. CPU cycles are not free.
NFS Security - Private VLANs
If you're genuinely concerned about network traffic being intercepted and decoded in-transit, I would recommend looking at all available options. Yes, you could turn on krb5p, but you could also isolate certain NFS traffic to a dedicated switch. Many switches support private VLANs where individual network ports can communicate with the storage system, but all other port-to-port traffic is blocked. An outside intruder wouldn't be able to intercept network traffic because there would be no other ports on the logical network. It's just the client and the server. This option mitigates the risk of an intruder intercepting traffic without imposing a performance overhead.
NFS Security - IPSec
In addition, you may want to consider IPSec. Any network administrator should know IPSec already, and it's been part of OSs for years. It's a lot like the VPN client you have on your PC, except it's used by server OSs and network devices.
As an ONTAP example, you can configure an IPSec endpoint on a linux OS and an IPSec endpoint on ONTAP and subsequently all IP traffic will use that IPSec tunnel for communication. The protocol doesn't really matter (although I wouldn't recommend using krb5p over IPsec. You don't really need to re-encrypt already encrypted traffic). NFS should perform about the same under IPSec as it would with krb5p, and in some environments IPSec is easier to configure than krb5p.
Note: You can also use IPsec with NFSv3 if you need to secure an NFS connection and NFSv4 is not yet an option for you.
NFS Security - Application Layer
Applications can encrypt data too.
For example, if you're an Oracle database user, consider encryption at the database layer. That also delivers encryption of data on the wire, plus one additional benefit - the backups are encrypted. A lot of the data leaks you read about are a result of someone leaving an unprotected backup in an insecure location. Oracle's Transparent Data Encryption (TDE) encrypts the tablespaces themselves, which means a breach of the backup location will yield access to a data that is still encrypted. As long as the Oracle Wallet data, which contains the decryption keys, is not stored with the backups themselves, that backup data is still secured.
Additionally, TDE scales better. The encryption/decryption work is distributed across all your database servers, which means more CPU's sharing in the work. In addition, and unlike krb5p encryption, TDE incurs zero overhead on the storage system itself.
NFSv4.1 Locks and Leases
In my opinion, this is the most important section of this post. If you don't understand this topic, you're likely to accidentally crash your database.
NFSv3 is stateless. That effectively means that the NFS server (ONTAP) doesn't keep track of which filesystems are mounted, by whom, or which locks are truly in place. ONTAP does have some features that will record mount attempts so you have an idea which clients may be accessing data, and there may be advisory locks present, but that information isn't guaranteed to be 100% complete. It can't be complete, because tracking NFS client state is not part of the NFSv3 standard.
In contrast, NFSv4 is stateful. The NFSv4 server tracks which clients are using which filesystems, which files exist, which files and/or regions of files are locked, etc. This means there needs to be regular communication between an NFSv4 server to keep the state data current.
The most important states being managed by the NFS server are NFSv4 Locks and NFSv4 Leases, and they are very much intertwined. You need to understand how each works by itself, and how they relate to one another.
Locking
With NFSv3, locks are advisory. An NFS client can still modify or delete a "locked" file. An NFSv3 lock doesn't expire by itself, it must be removed. This creates problems. For example, if you have a clustered application that creates NFSv3 locks, and one of the nodes fails, what do you do? You can code the application on the surviving nodes to remove the locks, but how do you know that's safe? Maybe the "failed" node is operational, but isn't communicating with the rest of the cluster?
With NFSv4, locks have a limited duration. As long as the client holding the locks continues to check in with the NFSv4 server, no other client is permitted to acquire those locks. If a client fails to check in with the NFSv4, the locks eventually get revoked by the server and other clients will be able to request and obtain locks.
Now we have to add a layer - leases. NFSv4 locks are associated with an NFSv4 lease.
Leases
When an NFSv4 client establishes a connection with an NFSv4 server, it gets a lease. If the client obtains a lock (there are many types of locks) then the lock is associated with the lease.
This lease has a defined timeout. By default, ONTAP will set the timeout value to 30 seconds:
EcoSystems-A200-B::*> nfs server show -vserver jfsCloud4 -fields v4-lease-seconds vserver v4-lease-seconds --------- ---------------- jfsCloud4 30
This means that an NFSv4 client needs to check in with the NFSv4 server every 30 seconds to renew its leases.
The lease is automatically renewed by any activity, so if the client is doing work there's no need to perform addition operations. If an application becomes quiet and is not doing real work, it's going to need to perform a sort of keep-alive operation (called a SEQUENCE) instead. It's essentially just saying "I'm still here, please refresh my leases."
Question: What happens if you lose network connectivity for 31 seconds?
NFSv3 is stateless. It's not expecting communication from the clients. NFSv4 is stateful, and once that lease period elapses, the lease expires, and locks are revoked and the locked files are made available to other clients.
With NFSv3, you could move network cables around, reboot network switches, make configuration changes, and be fairly sure that nothing bad would happen. Applications would normally just wait patiently for the network connection to work again. Many applications would wait until the end of time, but even an application like Oracle RAC allowed for a 200 second loss of storage connectivity by default. I've personally powered down and physically relocated NetApp storage systems that were serving NFSv3 shares to various applications, knowing that everything would just freeze until I completed my work and work would resume when I put the system back on the network.
With NFSv4, you have 30 seconds (unless you've increased the value of that parameter within ONTAP) to complete your work. If you exceed that, your leases time out. Normally this results in application crashes.
Example: Network failure with an Oracle Database using NFSv4
If you have an Oracle database, and you experience a loss of network connectivity (sometimes called a "network partition") that exceeds the lease timeout, you will crash the database.
Here's an example of what happens in the Oracle alert log if this happens:
2022-10-11T15:52:55.206231-04:00 Errors in file /orabin/diag/rdbms/ntap/NTAP/trace/NTAP_ckpt_25444.trc: ORA-00202: control file: '/redo0/NTAP/ctrl/control01.ctl' ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 1 Additional information: 4294967295 2022-10-11T15:52:59.842508-04:00 Errors in file /orabin/diag/rdbms/ntap/NTAP/trace/NTAP_ckpt_25444.trc: ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: '/redo1/NTAP/ctrl/control02.ctl' ORA-27061: waiting for async I/Os failed
If you look at the syslogs, you should see several of these errors:
Oct 11 15:52:55 jfs0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed! Oct 11 15:52:55 jfs0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed! Oct 11 15:52:55 jfs0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
The log messages are usually the first sign of a problem, other than the application freeze. Typically, you see nothing at all during the network outage because processes and the OS itself are blocked attempting to access the NFS filesystem.
The errors appear after the network is operational again. In the example above, once connectivity was reestablished, the OS attempted to reacquire the locks, but it was too late. The least had expired and the locks were removed. That results in an error that propagates up to the Oracle layer and causes the message in the alert log. You might see variations on these patterns depending on the version and configuration of the database.
There's nothing stopping vendors from writing software that detect loss of locks and reacquires the file handles, but I'm not aware of any vendor who has done that.
In summary, NFSv3 tolerates network interruption, but NFSv4 is more sensitive and imposes a defined lease period.
Now, what if a 30 second timeout isn't acceptable? What if you manage a dynamically changing network where switches are rebooted or cables are relocated and the result is the occasional network interruption? You could choose to extend the lease period, but whether you want to do that requires an explanation of NFSv4 grace periods.
NFSv4 grace periods
Remember how I said that NFSv3 is stateless, while NFSv4 is stateful? That affects storage failover operations as well as network interruptions.
If an NFSv3 server is rebooted, it's ready to serve IO almost instantly. It was not maintaining any sort of state about clients. The result is that an ONTAP takeover operation often appears to be close to instantaneous. The moment a controller is ready to start serving data it will send an ARP to the network that signals the change in topology. Clients normally detect this almost instantly and data resumes flowing.
NFSv4, however, will produce a brief pause. Neither NetApp nor OS vendors can do anything about it - it's just part of how NFSv4 works.
Remember how NFSv4 servers need to track the leases, locks, and who's using what? What happens if an NFS server panics and reboots, or loses power for a moment, or is restarted during maintenance activity? The lease/lock and other client information is lost. The server needs to figure out which client is using what data before resuming operation. This is where the grace period comes in.
Let's say you suddenly power cycle your NFSv4 server. When it comes back up, clients that attempt to resume IO will get a response that essentially says, "Hi there, I have lost lease/lock information. Would you like to re-register your locks?"
That's the start of the grace period. It defaults to 45 seconds on ONTAP:
EcoSystems-A200-B::> nfs server show -vserver jfsCloud4 -fields v4-grace-seconds vserver v4-grace-seconds --------- ---------------- jfsCloud4 45
The result is that, after a restart, a controller will pause IO while all the clients reclaim their leases and locks. Once the grace period ends, the server will resume IO operations.
Lease timeouts vs grace periods
The grace period and the lease period are connected. As mentioned above, the default lease timeout is 30 seconds, which means NFSv4 clients must check in with the server at least every 30 seconds or they lose their leases and, in turn, their locks. The grace period exists to allow an NFS server to rebuild lease/lock data, and it defaults to 45 seconds. ONTAP requires the grace period to be 15 seconds longer than the lease period. This ensures that an NFS client environment that is designed to renew leases at least every 30 seconds will have the ability to check in with the server after a restart. A grace period of 45 seconds ensures that all those clients that expect to renew their leases at least every 30 seconds definitely have the opportunity to do so.
As asked mentioned above:
Now, what if a 30 second timeout isn't acceptable? What if you manage a dynamically changing network where switches are rebooted or cables are relocated and the result is the occasional network interruption? You could choose to extend the lease period, but whether you want to do that requires an explanation of NFSv4 grace periods.
If you want to increase the lease timeout to 60 seconds in order to withstand a 60 second network outage, you're going to have to increase the grace period to at least 75 seconds. ONTAP requires it to be 15 seconds higher than the lease period. That means you're going to experience longer IO pauses during controller failovers.
This shouldn't normally be a problem. Typical users only update ONTAP controllers once or twice per year, and unplanned failovers due to hardware failures are extremely rare. Also, let's be realistic, if you had a network where a 60-second network outage was a concerning possibility, and you needed to the lease timeout to 60 seconds, then you probably wouldn't object to rare storage system failovers resulting in a 75 second pause either. You've already acknowledged you have a network that's pausing for 60+ seconds rather frequently.
You do, however, need to be aware that the NFSv4 grace period exists. I was initially confused when I noted IO pauses on Oracle databases running in the lab, and I thought I had a network problem that was delaying failover, or maybe storage failover was slow. NFSv3 failover was virtually instantaneous, so why isn't NFSv4 just as quick? That's how I learned about the real-world impact of NFSv4 lease periods and NFSv4 grace periods.
Deep Dive - ONTAP lease/lock monitoring
If you really, really want to see what's going on with leases and locks, ONTAP can tell you.
The commands and output can be confusing because there are two ways to look at NFSv4 locks:
The NFSv4 server needs to know which NFSv4 clients currently own NFSv4 locks
The NFSv4 server needs to know which NFSv4 files are currently locked by an NFSv4 client.
The end result is the networking part of ONTAP needs to maintain a list of NFSv4 clients and which NFSv4 locks they hold. Meanwhile, the data part of ONTAP also needs to maintain a list of open NFSv4 files and which NFSv4 locks exist on those files. In other words, NFSv4 locks are indexed by the client that holds them, and NFSv4 locks are indexed by the file they apply to.
Note: I've simplified the screen shots below a little so they're not 200 characters wide and 1000 lines long. Your ONTAP output will have extra columns and lines.
If you want to get NFSv4 locking data from the NFSv4 client point of view, you use the vserver locks show command. It accepts various arguments and filters.
Here's an example of what's locked on one of the datafile volumes on one of my Oracle lab systems:
EcoSystems-A200-A::vserver locks*> vserver locks show -volume jfs0_oradata0 -fields volume,path,lockid,client-id volume path lockid client-id ------------- -------------------------------- ------------------------------------ ---------------- jfs0_oradata0 /jfs0_oradata0/NTAP/system01.dbf 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/system01.dbf 721e4ce8-e6e3-4011-b8cc-7cea6e53661b 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/sysaux01.dbf bb7afcdb-6f8c-4fea-b47d-4a161cd45ceb 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/sysaux01.dbf 2eacf804-7209-4678-ada5-0b9cdefceee0 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/users01.dbf 693d3bb8-aed5-4abd-939b-2fdb8af54ae6 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/users01.dbf a7d24881-b502-40b6-b264-7414df8a98f5 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS000.dbf 1a33008c-573b-4ab7-ae87-33e9b5891e6a 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS000.dbf b6ef3873-217a-46e3-bdc7-5703fb6c82f4 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS004.dbf fef3204b-406c-4f44-a02b-d14adaba807c 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS004.dbf 9f9f737b-52de-4d7a-b169-3ba15df8bcc5 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS008.dbf b322f896-1989-43ab-9d83-eaa2850f916a 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS008.dbf cd33d350-ff79-4e29-8e13-f64ed994bc4e 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS012.dbf e4a54f25-5290-4da3-9a93-28c4ea389480 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS012.dbf f3faed7f-3232-46f4-a125-4d2ad8059bc4 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS016.dbf be7ad0d4-bb70-45a8-85b5-45edcb626487 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS016.dbf ce26918c-8a44-4d02-8c41-fafb7e5d2954 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS020.dbf 47517938-b944-4a0b-a9e8-960b721602f4 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS020.dbf 2808307d-46c9-4afa-af2a-bb13f0908ea3 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS024.dbf f21b6f26-0726-4405-9bac-d9e680baa4df 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS024.dbf 0a95f55b-3dfa-45db-8713-c5ad717441ae 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS028.dbf a0196191-4012-4615-b2fd-dda0ce2d7c3f 0100000028aa6c80 jfs0_oradata0 /jfs0_oradata0/NTAP/IOPS028.dbf fc769b9d-0fff-4e74-944a-068b82702fd1 0100000028aa6c80
The first time I used this command, I immediately asked, "Hey, where's the lease data? How many seconds are left on the lease for those locks?" That information is held elsewhere. Since an NFSv4 file might be the target of multiple locks with different lease periods, and the NFSv4 server needs to enforce locks, then the NFSv4 server needs to track the detailed locking data down at the file level. You get that data with vserver locks nfsv4 show . Yes, it's almost the same command.
In other words, the vserver locks show command tells you which locks exist. The vserver locks nfsv4 show command tells you the details about a lock.
Let's take the first line in the above output:
EcoSystems-A200-A::vserver locks*> vserver locks show -volume jfs0_oradata0 -fields volume,path,lockid,client-id volume path lockid client-id ------------- -------------------------------- ------------------------------------ ---------------- jfs0_oradata0 /jfs0_oradata0/NTAP/system01.dbf 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 0100000028aa6c80
If I want to know how many seconds are left on that lock, I can run this command:
EcoSystems-A200-A::*> vserver locks nfsv4 show -vserver jfsCloud3 -lock-uuid 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 There are no entries matching your query.
Wait, why didn't that work?
The reason is I'm using 2-node cluster. The NFSv4 client-centric command ( vserver locks show) shows me locking information up at the network layer. The NFSv4 server spans all ONTAP controllers in the cluster, so this command will look the same on all controllers. Individual file management is based on the controller that owns the drives. That means the low-level locking information is available only on a particular controller.
Here are the individual controllers in my HA pair:
EcoSystems-A200-A::*> network int show Logical Status Network Current Current Is Vserver Interface Admin/Oper Address/Mask Node Port Home ----------- ---------- ---------- ------------------ ------------- ------- ---- EcoSystems-A200-A A200-01_mgmt1 up/up 10.63.147.141/24 EcoSystems-A200-01 e0M true A200-01_mgmt2 up/up 10.63.147.142/24 EcoSystems-A200-02 e0M true
If I ssh into the cluster, and the management IP is currently hosted on EcoSystems-A200-01, then the command vserver locks nfsv4 show will only look at NFSv4 locks that exist on the files that are owned by that controller.
If I open an ssh connection to 10.63.147.142 then I'll be able to view the NFSv4 locks for files owned by EcoSystems-A200-02:
EcoSystems-A200-A::*> vserver locks nfsv4 show -lock-uuid 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 Logical Interface Lock UUID Lock Type ----------- --------------------------------- ------------ jfs3_nfs2 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 share-level
This is where I can see the lease data:
EcoSystems-A200-A::*> vserver locks nfsv4 show -lock-uuid 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 -fields lease-remaining lif lock-uuid lease-remaining --------- ------------------------------------ --------------- jfs3_nfs1 72ae1cb3-7a7c-48c6-aaa5-5b3ba5b78ae2 9
This particular system is set to a lease-seconds of 10. There's an active Oracle database, which means it's constantly performing IO, which in turn means it's constantly renewing the lease. If I cut the power on that how. you'd see the lease-remaining field count down to 0 and then disappear as the leases and associated locks expire.
The chance of anyone needing to go into these diag-level details is close to zero, but I was troubleshooting an Oracle dNFS bug related to leases and got to know all these commands. I thought it was worth writing up in case someone else ended up working on a really obscure problem.
So, that's the story on NFSv4. The top takeaways are:
NFSv4 isn't necessarily any better than NFSv3. Use whatever makes sense to you.
NFSv4 includes multiple security enhancements, but there are also other ways to secure an NFS connection.
NFSv4 is way WAY easier to run through a firewall than NFSv3
NFSv4 is much more sensitive to network interruptions than NFSv3, and you may need to tune the ONTAP NFS server.
NFSv4 Bonus Tip:
If you're playing with NFSv4, don't forget the domain. This is also documented in the big NFS TR linked above, but I missed it the first time through, and I've seen customers miss this as well. It's confusing because if you forget it, there's a good chance that NFSv4 will MOSTLY work, but you'll have some strange behavior with permissions.
Here's how my ONTAP systems are configured in my lab:
EcoSystems-A200-A::> nfs server show -vserver jfsCloud3 -fields v4-id-domain vserver v4-id-domain --------- ------------ jfsCloud3 jfs.lab
and this is on my hosts:
[root@jfs0 ~]# more /etc/idmapd.conf | grep Domain Domain = jfs.lab
They match. If I'd forgotten to update the default /etc/idmap.conf, weird things would have happened.
... View more
Sorry about the delay responding, I was out of the office for a few weeks. The queuing behavior is an internal architectural detail of the version of ONTAP and the controller I was using. The behavior does vary from version to version. I could probably find more details from the engineering team, but it changes often. The reason is ONTAP is designed to service multiple workloads in a manageable, predictable way. They're always tuning and improving something. Sometimes when you run a single, synthetic, IO-intensive workload you run into unusual patterns, and this happens to be one of them. Spreading my writes across two volumes slightly improved parallelism, which meant the write latency was a little lower. I was using SLOB to drive the database, which meant many of the read IO's could not happen until the preceding write IO completed. That almost never happens with a regular database. Datafile reads and datafile writes are largely independent of one another. With SLOB, I have that dependency. The result is the tiny different in write latency gets magnified, and it makes the graphs look odd. The performance difference between 1 and 2 volumes isn't nearly as significant in the real world as that graph suggests. I almost didn't add that graph, but we do have a lot of customers who host just one Big Huge Database on a single storage system, and they truly want the maximum possible performance. Every microsecond counts. Almost nobody will notice a 10µs write latency differences, but if you do, it's worth spreading a database across volumes.
... View more
How many LUNs do I need?
(notes on Consistency Groups in ONTAP)
This post is part 2 of 2. In the prior post, I explained consistency groups and how they exist in ONTAP in multiple forms. I’ll now explain the connection between consistency groups and performance, then show you some basic performance envelopes of individual LUNs and volumes.
Those two things may not seem connected at first, but they are. It all comes down to LUNs.
LUNs are accessed using the SCSI protocol, which has been around for over 40 years and is showing its age. The tech industry has worked miracles improving LUN technology over the years, but there are a lot of limits related to host OSs, drivers, HBAs, storage systems, and drives that limit the performance of a single LUN.
The end result is this – sometimes you need more than one LUN to host your dataset in order to get optimum performance. If you want to take advantage of the advanced features of a modern storage array, you’ll need to manage those multiple LUNs together, as a unit. You’ll need a consistency group.
My previous post explained how ONTAP delivers consistency group management. This post explains how you figure out just how many LUNs you might need in that group, and how to ensure you have the simplest, most easily managed configuration.
Note: There are a lot of performance numbers shown below. They do NOT represent maximum controller performance. I did not do any tuning at all beyond implementing basic best practices. I created some basic storage configurations, configured a database, and ran some tests to illustrate my point. That's it.
ONTAP < 9.9.1
First, let’s go back a number of years to the ONTAP versions prior to the 9.9.1 release. There were some performance boundaries and requirements that contributed to the need for multiple LUNs in order to get optimum performance.
Test Procedure
To compare performance with differing LUN/volume layouts, I wrote a script that built one Oracle database each, using the following configurations:
1 LUN in a single volume
2 LUNs in a single volume
4 LUNs in a single volume
8 LUNs in a single volume
16 LUNs in a single volume
16 LUNs in a single volume
16 LUNs across 2 volumes
16 LUNs across 4 volumes
16 LUNs across 8 volumes
16 LUNs across 16 volumes
Warning: I normally ban anyone from using the term “IOPS” in my presence without providing a definition, because “IOPS” has a lot of different meanings. What’s the block size? Sequential or random ratio? Read/write mix? Measured from where? What’s my latency cutoff? All that matters.
In the graphs below, IOPS refers to random reads, using 8K blocks, as measured from the Oracle database. Most tests used 100% reads.
I used SLOB2 for driving the workload. The results shown below are not the theoretical storage maximums, they're the result of a complicated test using an actual Oracle database where a lot of IO has interdependencies on other IO. If you used a synthetic tool like fio, you’d see higher IOPS.
The question was - “How many LUNs do I need?” These tests used *one* volume. Multiple LUNs, but one volume. Let’s say you have a database. How many LUNs did you need in that LVM or Oracle ASM volume group to support your workload? What is the expected performance? Here’s the answer to that question when using a single AFF8080 controller prior to ONTAP 9.9.1.
There are three important takeaways from that test:
A single LUN hit the wall at about 35K IOPS.
A single volume hit the wall at about 115K IOPS
The sweet spot for LUN count in a single volume was about 8, but there was some benefit going all the way to 16.
To rephrase that:
If you had a single workload that didn’t need more than 35K IOPS, just drop it on a single LUN.
If you had a single workload that didn’t need more than 115K IOPS, just drop it on a single volume, but distribute it across 8 LUNs.
If you had more than 115K IOPS, you would have needed more than one volume.
That’s all <9.9.1 performance data, so let see what improved in 9.9.1 and how it erased a lot of those prior limitations and vastly simplified consistency group architectures.
ONTAP >= 9.9.1
Threading is important for modern storage arrays, because they are primarily used to support multiple workloads. On occasion, we’ll see a single database hungry enough to consume the full performance capabilities of an A900 system, but usually we see dozens of databases hosted per storage system.
We have to strike a balance between providing good performance to individual workloads while also supporting lots of independent workloads in a predictable, manageable way. Without naming names, there are some competitors out there whose products offer impressive performance with a single workload but suffer badly from poor and unpredictable performance with multiple workloads. One database starts stepping on another, different LUNs outcompete others for attention from the storage OS, and things get bad. One of the ways storage systems manage multiple workloads is through threading, where work is divided into queues that can be processed in parallel.
ONTAP 9.9.1 included many improvements to internal threading. In prior versions, SAN IO was essentially being serviced in per-volume queues. Normally, this was not a problem. Controllers would be handling multiple workloads running with a lot of parallelism, all the queues stayed busy all the time, and it was easy for customers to reach platform maximum performance.
Most of my work is in the database space, and we’d often have the One Big Huge Giant Database challenge. I’ve architected systems where a single database ate the maximum capabilities of 12, yes twelve controllers. If you only have one workload, it can be difficult trying to create a configuration that ensures all those threads are busy and processing IO all the time. You had to be careful to avoid having one REALLY busy queue, while others would be idle. The result is leaving potential performance on the table, and you would not get maximum controller performance.
Those concerns are 99% gone as of 9.9.1. There are still threaded operations, of course, but overall, the queues that led to those performance concerns don’t exist anymore. ONTAP services SAN IO more like a general pool of FC operations, spread across all the CPUs all the time.
To illustrate, let’s start with the same set of tests I showed for <9.9.1, with a single volume and varying numbers of LUN in the diskgroup:
I see four important takeaways here:
A single LUN yields about 4X more IOPS than before.
A single LUN not only delivers 4X more IOPS, but the latency is also about 40% lower.
A single volume (with 8 LUNs) yields about 2X more IOPS
A single volume (with 8 LUNs) delivers 2X more IOPS and with 40% lower latency.
That might seem simple, but there are a lot of implications to those four points. Here are some of the things you need to understand.
ONTAP is only part of the picture
The graph above does show that two LUNs are faster than one LUN, but it doesn’t say why. It’s not really ONTAP that is the limiting factor, it’s the SCSI protocol itself. Even if ONTAP was infinitely fast with FC LUNs that delivered 0µs of latency, it can’t service IO it hasn’t received.
You also have to think about the host-side limits. Hosts also have queues, like per-LUN queues, and per-path queues, and HBA queues. You still need some parallelism up at the host level to get maximum performance.
In the tests above, you can see incremental improvements in performance as we bring more LUNs into play. I’m sure some of the benefits are a result of ONTAP parallelizing work better, but that’s only a small part of it. Most of the benefits flow from having more LUNs driven by the OS itself.
The reason I wanted to explain this is because we have a lot of support cases about performance that aren’t exactly complaints, but are instead more like “Why isn’t my database faster than it is?” There’s always a bottleneck somewhere. If there wasn’t, all storage operations would complete in 0 microseconds, database queries would complete in 0 milliseconds, and servers would boot in 0 seconds.
We often discover that whatever the performance bottleneck might be, it ain’t ONTAP. The performance counters show the controller is nowhere near any limits, and in many cases, ONTAP is outright bored. The limit is usually up at the host. In my experience, the #1 cause of SAN performance complaints is an insufficient number of LUNs at the host OS layer. We therefore advise the customer to add more LUNs, so they can increase the parallelism though the host storage stack.
Yes, LUNS simply got faster
A lot of customers had single-LUN workloads that suddenly became a lot faster, because they updated to 9.9.1 or higher. Maybe it was a boot LUN that got faster and now patching is peppier. Maybe there was an application on a single LUN that included an embedded database, and now that application is suddenly a lot more responsive.
A volume of LUNs got faster too
Previously, I maxed out SAN IOPS in a single volume at about 110K IOPS. The limit roughly doubled to 240K IOPS in 9.9.1. That’s a big increase. IO-intensive workloads that previously required multiple volumes can be consolidated to a single volume. That means simpler management. You can create a single snapshot, clone a single volume, set a single QoS policy, or configure a single SnapMirror replication relationship.
Even if you don’t need the extra IOPS, you still get better performance
The latency dropped, too. Even a smaller database that only required 25K IOPS and was happily running on a single volume prior to 9.9.1 should see noticeably improved performance, because the response times of those individual 25K IOPS got better. Application response times get better, queries complete faster, and end users get happier.
How Many Volumes Do I Need?
I’d like to start by saying there is no best practice suggesting the use of one LUN per volume. I don’t know for sure where this idea originated, but I think it came from a very old performance benchmark whitepaper that included a 1:1 LUN:Volume ratio.
As mentioned above, it used to be important to distribute a workload across volumes in some cases, but it mostly only applied to single-workload configuration. If we were setting up a 10-node Oracle RAC cluster, and we wanted to push performance to the limit, and we wanted to get every possible IOP with the lowest possible latency, then we’d need perhaps 16 volumes per controller. There were often only a small number of LUNs on the system as a whole, so we may have used a 1:1 LUN:Volume ratio.
We didn’t HAVE to do that, and it’s in no way a best practice. We often just wanted to squeeze out a few extra percentage points of performance.
Also, don’t forget that there’s no value in unneeded performance. Configure what you need. If you only need 80K IOPS, do yourself a favor and configure a 2-LUN or perhaps 4-LUN diskgroup. It’s not hard to create more LUNs if you need them, but why do that? Why create unnecessary storage objects to manage? Why clutter up the output of commands like “lun show” with extra items that aren’t providing value? I often use the post office as an analogy – a 200MPH vehicle is faster than a 100MPH vehicle, but the neighborhood postal carrier won’t get any benefit from that extra performance.
If you have an unusual management need where one-LUN-per-volume makes more sense, that’s fine, but you have more things to manage, too. Look at the big picture and decide what’s best for you.
Want proof that multiple volumes don’t help? Check this out.
It’s the same line! In this example, I created a 16-LUN volume group and compared performance between configurations where those 16 LUNs were in a single volume, 2 volumes, 4, 8, and 16. There’s literally no difference, nor should there be. As mentioned above, ONTAP SAN processing as of 9.9.1 does not care if the underlying LUNs were located in different volumes. The FC IO was processed as a common pool of FC IO operations.
Things get a little different when you introduce writes, because there still is some queuing behavior related to writes that may be important to you.
Write IO processing
If you have heavy write IO, you might want more than one volume. The graphs below illustrate the basic concepts, but these are synthetic tests. In the real world, especially with databases, you get different patterns of IO interdependencies.
For example, picture a banking database used to support online banking activity by customers. That will be mostly concurrent activity where a little extra latency doesn’t matter. If you need to withdraw money at the ATM, would you care if it took 2.001 seconds rather than the usual 2 seconds?
In contrast, if you have a banking database used for end-of-day processing, you have dependencies. Read #352 might only occur after read #351 has completed. A small increase in latency can have a ripple effect on the overall workload.
The graphs below show what happens when one IO depends on a prior IO and latency increases. It’s also a borderline worst-case scenario.
First, let’s look at a repeat of my first 9.9.1 test, but this time I’m doing 70% reads and 30% writes. What happens?
The maximum measured IOPS dropped. Why? The reason is that writes are more expensive to complete than reads for a storage array. Obviously, platform maximums will be reduced as write IO becomes a larger and larger percentage of the IO, but this is just one volume. I’m nowhere near controller maximums. Performance remains awesome. I’m at about 150us latency for most of the curve, and even at 100K IOPS, I’m only at 300us of latency. That’s great, but it is slower than the 100% read IOPS test.
What you’re seeing is the result of read IOPS getting held back by the write IOPS. There were more IOPS available to my database from this volume, but they weren’t consumed, because my database was waiting on write IO to complete. The result is that the total IOPS dropped quite a bit.
Multi-Volume write IOPS
Here’s what happens when I spread these LUNs across two volumes.
Looks weird, doesn’t it? Why would 2 volumes be 2X as fast as a single volume, and why would 2, 4, 8, and 16 volumes perform about the same?
The reason is that ONTAP is establishing queues for writes. If I want to maximize write IOPS, I’m going to need more queues, which will require more volumes. The exact behavior can change between configurations and platforms, so there’s no true best practice here. I’m just calling out the potential need to spread your database across more than one volume.
Key takeaways:
If I have 16 LUNs, there is literally no benefit to splitting them amongst multiple volumes with a 100% read workload. Look at that earlier graph. The datasets all graphed as a single line.
Two volumes with a 70% read workload showed a big improvement going from 1 volume to 2, but then nothing further. That’s because, in my configuration, there are two queues for write processing within ONTAP. Two volumes are no different than 3 or 4 or 5 in terms of keeping those queues busy.
I also want to repeat – the graphs are the worst-case scenario. A real database workload shouldn’t be affected nearly as much, because reads and writes should be largely decoupled from one another. In my test, there are about two reads for each write with limited parallelization, and those reads do not happen until the write completes. That does happen with real-world database workloads, but very rarely. For the most part, real database read operations do not have to wait for writes to complete.
Summary
To recap:
If you’re using ONTAP <9.9.1 with FC SAN, upgrade. We’ve observed LUNs deliver 4X more IOPS at 40% lower latency.
Once you get to ONTAP 9.9.1 (or higher):
A single LUN is good for around 100K IOPS on higher-end controllers. That’s not an ONTAP limit, it’s an “all things considered” limit that is a result of ONTAP limits, host limits, network limits, typical IO sizes, etc. I’ve seen much, much better results in certain configurations, especially ESX. I’m only suggesting 100K as a rule-of-thumb.
For a single workload, a 4-LUN volume group on a single volume can hit 200K with no real tuning effort. More LUNs in that volume are desirable in some cases (especially with AIX due to its known host FC behavior), but it’s probably not worth the effort for typical SAN workloads.
If you know you’ve got a very, very write-heavy workload, you might want to split your workload into two volumes. If you’re that concerned about IOPS, you probably did that anyway, simply because you probably chose to distribute your LUNs across controllers. That’s a common practice – split each workload evenly across all controllers to achieve maximum performance, as well as guaranteed even loading across the entire cluster.
Lastly, don’t lose perspective.
It’s nice to have an AFF system with huge IOPS capabilities for the sake of consolidating lots of workloads, but I find admins obsess too much about individual workloads and targeting hypothetical performance levels that offer no real benefits.
I look at a lot of performance stats, and virtually every application and database workload I see plainly shows no storage performance bottleneck whatsoever. The performance limits are almost universally the SQL code, the application code, available raw network bandwidth, or Oracle RAC cluster contention. Storage is usually less than 5% of the problem. The spinning-disk days of spending your way out of performance problems are over.
Storage performance sizing should be about determining actual performance requirements, and then architecting the simplest, most manageable solution possible. The SAN improvements introduced in ONTAP 9.9.1 noticeably improve manageability as well as performance.
... View more
Consistency Groups in ONTAP
There’s a good reason you should care about CGs – it’s about manageability.
If you have an important application like a database, it probably involves multiple LUNs or multiple filesystems. How do you want to manage this data? Do you want to manage 20 LUNs on an individual basis, or would you prefer just to manage the dataset as a single unit?
This post is part 1 of a 2. First, I will explain what we mean when we talk about consistency groups (CGs) within ONTAP.
Part II covers the performance aspect of consistency groups, including real numbers on how your volume and LUN layout affects (and does not affect) performance. It will also answer the universal database storage question, “How many LUNs do I need?” Part II will be of particular interest to long-time NetApp users who might still be adhering to out-of-date best practices surrounding performance.
Volumes vs LUNs
If you’re relatively new to NetApp, there’s a key concept worth emphasizing – volumes are not LUNs.
Other vendors use those two terms synonymously. We don’t. A Flexible Volume, also known as a FlexVol, or usually just a “volume,” is just a management container. It’s not a LUN. You put data, including NFS/SMB files, LUNs, and even S3 objects, inside of a volume. Yes, it does have attributes such as size, but that’s really just accounting. For example, if you create a 1TB volume, you’ve set an upper limit on whatever data you choose to put inside that volume, but you haven’t actually allocated space on the drives.
This sometimes leads to confusion. When we talk about creating 5 volumes, we don’t mean 5 LUNs. Sometimes customers think that they create one volume and then one LUN within that volume. You can certainly do that if you want, but there’s no requirement for a 1:1 mapping of volume to LUN. The result of this confusion is that we sometimes see administrators and architects designing unnecessarily complicated storage layouts. A volume is not a LUN.
Okay then, what is a volume?
If you go back about eighteen years, an ONTAP volume mapped to specific drives in a storage controller, but that’s ancient history now.
Today, volumes are there mostly for your administrative convenience. For example, if you have a database with a set of 10 LUNs, and you want to limit the performance for the database using a specific quality of service (QoS) policy, you can place those 10 LUNs in a single volume and slap that QoS policy on the volume. No need to do math to figure out per-LUN QoS limits. No need to apply QoS policies to each LUN individually. You could choose to do that, but if you want the database to have a 100K IOPS QoS limit, why not just apply the QoS limit to the volume itself? Then you can create whatever number of LUNs that are required for the workload.
Volume-level management
Volumes are also related to fundamental ONTAP operations, such as snapshots, cloning, and replication. You don’t selectively decide which LUN to snapshot or replicate, you just place those LUNs into a single volume and create a snapshot of the volume, or you set a replication policy for the volume. You’re managing volumes, irrespective of what data is in those volumes.
It also simplifies how you expand the storage footprint of an application. For example, if you add LUNs to that application in the future, just create the new LUNs within the same volume. They will automatically be included in the next replication update, the snapshot schedule will apply to all the LUNs, including the new ones, and the volume-level QoS policy will now apply to IO on all the LUNs, including the new ones.
You can selectively clone individual LUNs if you like, but most cloning workflows operate on datasets, not individual LUNs. If you have an LVM with 20 LUNs, wouldn’t you rather just clone them as a single unit than perform 20 individual cloning operations? Why not put the 20 LUNs in a single volume and then clone the whole volume in a single step?
Conceptually, this makes ONTAP more complicated, because you need to understand that volume abstraction layer, but if you look at real-world needs, volumes make life easier. ONTAP customers don’t buy arrays for just a single LUN, they use them for multiple workloads with LUN counts going into the 10’s of thousands.
There’s also another important term for a “volume” that you don’t often hear from NetApp. The term is “consistency group,” and you need to understand it if you want maximum manageability of your data.
What’s a Consistency Group?
In the storage world, a consistency group (CG) refers to the management of multiple storage objects as a single unit. For example, if you have a database, you might provision 8 LUNs, configure it as a single logical volume, and create the database. (The term CG is most often used when discussing SAN architectures, but it can apply to files as well.)
What if you want to use array-level replication to protect that database? You can’t just set up 8 individual LUN replication relationships. That won’t work, because the replicated data won’t be internally consistent across volumes. You need to ensure that all 8 replicas of the source LUN are consistent with one another, or the database will be corrupt.
This is only one aspect of CG data management. CGs are implemented in ONTAP in multiple ways. This shouldn’t be surprising – an ONTAP system can do a lot of different things. The need to manage datasets in a consistent manner requires different approaches depending on the chosen NetApp storage system architecture and which ONTAP feature we’re talking about.
Consistency Groups – ONTAP Volumes
The most basic consistency group is a volume. A volume hosting multiple LUNs is intrinsically a consistency group. I can’t tell you how many times I’ve had to explain this important concept to customers as well as NetApp colleagues simply because we’ve historically never used the term “consistency group.”
Here’s why a volume is a consistency group:
If you have a dataset and you put the dataset components (LUNs or files) into a single ONTAP volume, you can then create snapshots and clones, perform restorations, and replicate the data in that volume as a single consistent unit. A volume is a consistency group. I wish we could update every reference to volumes across all the ONTAP documentation in order to explain this concept, because if you understand it, it dramatically simplifies storage management.
Now, there are times where you can’t put the entire dataset in a single volume. For example, most databases use at least two volumes, one for datafiles and one for logs. You need to be able to restore the datafiles to an earlier point in time without affecting the logs. You might need some of that log data to roll the database forward to the desired point in time. Furthermore, the retention times for datafile backups might differ from log backups.
We have a solution for that, too, but first let’s talk about MetroCluster.
Consistency Groups & MetroCluster
While regular ol’ ONTAP volumes are indeed consistency groups, they’re not the only implementation of CGs in ONTAP. The need for data consistency appears in many forms. SyncMirrored aggregates are another type of CG that applies to MetroCluster.
MetroCluster is a screaming fast architecture, providing RPO=0 synchronous mirroring, mostly used for large-scale replication projects. If you have a single dataset that needs to be replicated to another site, MetroCluster probably isn’t the right choice. There would probably be simpler options.
If, however, you’re building an RPO=0 data center infrastructure, MetroCluster is awesome, because you’re essentially doing RPO=0 at the storage system layer. Since we’re replicating everything, we can do replication at the lowest level – right down at the RAID layer. The storage system doesn’t know or care about where changes are coming from, it just replicates each little write-to-drives to two different locations. It’s very streamlined, which means it’s faster and makes failovers easier to execute and manage, because you’re failing over “the storage system” in its entirety, not individual LUNs.
Here's a question, though. What if I have 20 interdependent applications and databases and datasets? If a backhoe cuts the connection between sites, is all that data at the remote site still consistent and usable? I don’t want one database to be ahead in time from another. I need all the data to be consistent.
As mentioned before, the individual volumes are all CGs unto themselves, but there’s another layer of CG, too – the SyncMirror aggregate itself. All the data on a single replicated MetroCluster aggregate makes up a CG. The constituent volumes are consistent with one another. That’s a key requirement to ensure that some of the disaster edge cases, such as rolling disasters, still yield a surviving site that has usable, consistent data and can be used for rapid data center failover. In other words, a MetroCluster aggregate is a consistency group, with respect to all the data on that aggregate, which guarantees data consistency in the event of sudden site loss.
Consistency Groups & API’s
Let’s go back to the idea of a volume as a consistency group. It works well for many situations, but what if you need to place your data in more than one volume? For example, what if you have four ONTAP controllers and want to load up all of them evenly with IO? You’ll have four volumes. You need consistent management of all four volumes.
We can handle that, too. We have yet another consistency group capability that we implement at the API level. We did this about 20 years ago, originally for Oracle ASM diskgroups. Those were the days of spinning drives, and we had some customers with huge Oracle databases that were both capacity-hungry and IOPS-hungry to the point they required multiple storage systems.
How do you get a snapshot of a set of 1000 LUNs spread across 12 different storage systems? The answer is “quite easily,” and this was literally my second project as a NetApp employee. You use our consistency group API’s. Specifically, you’d make an API call for “cg-start” targeting all volumes across various systems, then call “cg-commit” on all those storage systems. If all those cg-commit API calls report a success, you know you have a consistent set of snapshots that could be used for cloning, replication, or restoration.
You can do this with a few lines of scripting, and we have multiple management products, including SnapCenter, that make use of those APIs to perform data consistent operations.
These APIs are also part of the reason everyone, including NetApp personnel, often forget that an ONTAP volume is a consistency group. We had those APIs that had the letters “CG” in them, and everyone subconsciously started to think that this must be the ONLY way to work with consistency groups within ONTAP. That’s incorrect; the cg-start/cg-commit API calls are merely one way ONTAP delivers consistency group-based management.
Consistency Groups & SM-BC
SnapMirror Business Continuity (SM-BC) is similar to MetroCluster but provides more granularity. MetroCluster is probably the best solution if you need to replicate all or nearly all the data on your storage system, but sometimes you only want to replicate a small subset of total data.
SM-BC almost didn’t need to support any sort of “consistency group” feature. We could have scoped that feature to just single volumes. Each individual volume could have been replicated and able to be failed over as a single entity.
However, what if you needed a business continuity plan for three databases, one application server, and all four boot LUNs? Sure, you might be able to put all that data into a single volume, but it’s likely that your overall data protection, performance, monitoring, and management needs would require the use of more than one volume.
Here’s how that affects data consistency with SM-BC. Say you’ve provisioned four volumes. The key is that a business continuity plan requires all 4 of those volumes entering and exiting a consistent replication state as a single unit.
We don’t want to have a situation where the storage system is recovering from an interruption in site-to-site connectivity with one volume in an RPO=0 state, while the other three volumes are still synchronizing. A failure at that moment would leave you with mismatched volumes at the destination site. One of them would be later in time than others. That’s why we base your SM-BC relationships on CGs. ONTAP ensures those included volumes enter and exit an RPO=0 state as a single unit.
Native ONTAP Consistency Groups
Finally, ONTAP also allows you to configure advanced consistency groups within ONTAP itself. The results are similar to what you’d get with the API calls I mentioned above, except now you don’t have to install extra software like SnapCenter or write a script.
Here’s an example of how you might use ONTAP Consistency Groups:
In this example, I have an Oracle database with datafiles distributed across 4 volumes located on 4 different controllers. I often do that to ensure my IO load is guaranteed to be evenly distributed across all controllers in the entire cluster. I also have my logs in 3 different volumes, plus I have a volume for my Oracle binaries.
The point of the ONTAP Consistency Group feature is to enable users to manage applications and application components, and not worry about LUNs and individual volumes. Once I add this CG (which is composed of two child CGs), I can do things like schedule snapshots for the application itself. The result is a CG snapshot of the entire application. I can now use those snapshots for cloning, restoration or replication.
I can also work at a more granular level. For example, I could do a traditional Oracle hot backup procedure as follows:
“alter database begin backup;”
POST /application/consistency-groups/(Datafiles)/snapshots
“alter database end backup;”
“alter database archive log current;”
POST /application/consistency-groups/(Logs)/snapshots
The result of that is a set of volume snapshots, one of the datafiles and one of the logs, which are recoverable using a standard Oracle backup procedure.
Specifically, the datafiles were in backup mode when a snapshot of the first CG was taken. That’s the starting point for a restoration. I then removed the database from backup mode and forced a log switch before making the API call to create a snapshot of the log CG. The snapshot of the log CG now contains the required logs for making that datafile snapshot consistent.
(Note: You don’t really have to place an Oracle database in backup mode since 12cR1, but most DBA’s are more comfortable with that additional step)
Those two sets of snapshots constitute a restorable, clonable, usable backup. I’m not operating on LUNs or filesystems; I’m making API calls against CGs. It’s application-centric management. There’s no need to change my automation strategy as the application evolves over time and I add new volumes or LUNs, because I’m just operating on named CGs. It even works the same with SAN and file-based storage.
We’ve got all sorts of idea of how to keep expanding this vision of application-centric storage management, so keep checking in with us.
... View more
You'll want a support case for this. The answer is probably buried in the SCO job logs, and you'll need some to parse though them for you.
... View more
You're correct. They've normalized a bandwidth limit. If you read carefully, you'll see statements like a volume offers "100 IOPS per GB (8K IO size)" which really means 800KB/GB and then they divided by 8K to get 100 IOPS. They could have described it as 200 IOPS per GB (4K IO size) if they wanted to. It's the same thing. Normally this is all pretty unimportant, but with databases you have a mix of IO types. The random IO's are a lot more work to process, so it's nice to be able to limit actual IOPS. In addition, a small number of huge-block sequential IO's can consume a lot of bandwidth, so it's nice to be able to limit bandwidth independently. There's some more material on this at this link https://tv.netapp.com/detail/video/6211770613001/a-guide-to-databasing-in-the-aws-cloud?autoStart=true&page=4&q=oracle starting at about the 2:45 mark.
... View more
IOPS should refer to individual IO operations. A typical ERP database might reach 10K IOPS during random IO operations. That means 10,000 discrete, individual 8K block operations per second. If you do the math, that's also about 80MB/sec. When the database does a full table scan or an RMAN backup, the IO's are normally in 1MB chunks. That may get broken down by the OS, but it's trying to do 1MB IO's. That means only 80 IOPS will consume 80MB/sec of bandwidth. The end result is a true IOPS-based QoS control will throttle IO at a much lower total bandwidth than sequential IO. That's normally okay, because storage arrays have an easier time with large-block IO. It's okay to allow a host to consume lots of bandwidth. It's easy to make a sizing error with databases when QoS is involved. It happens on-premises too. A storage admin won't realize that during the day they need 10K IOPS at an 8K IO size (80MB/sec) but at night they need 800IOPS at a 1MB size (800MB/sec). If you have a pure bandwidth QoS limit, like 100MB/sec, you'll be safely under the limit most of the time when the database is doing normal random IO tasks, but sometimes those late-night reports and RMAN backups slam into that 100MB/sec limit. That's why you really, really need snapshots with a database in the public cloud. It's the only way to avoid bulk data movement. If you have to size for RMAN bulk data transfers, you end up with storage that is 10X more powerful and expensive than you need outside of the backup window. One neat thing one of our customers did to fix the Oracle full table scan problem in the public cloud was Oracle In-Memory. They spent more for the In-Memory licenses, and they spent more for the RAM in the VM, but the result was dramatically less load on the storage system. That saved money, but more importantly, they were able to meet their performance targets in the public cloud. It's a perfectly obvious use case for In-Memory, it was just nice to see proof that it worked as predicted.
... View more
FlashCache actually might still help with the backup situation. The basic problem here is the QoS limit that the cloud providers placed on their storage. They call it IOPS, but it's not, it's bandwidth. We've seen a lot of customers run into this exact problem. Let's say you had an ancient HDD array. You could do backups AND the usual database random IO pretty easily because those large-block backup IO's are easy for a storage array to process. They're nice big IO's, and the array can do readahead. When we used to size for HDD database storage, we'd always focus on the random IOPS because that controlled the number of drives that goes into the solution. The sequential IO, like backups and many reporting operations, was almost free. You pay for random IO, and we throw in the sequential IO for at no cost. If you looked at the numbers, a typical HDD array might be able to provide 250MB/sec of random IOPS, but could easily do 2.5GB/sec of large-block sequential IOPS. Public cloud storage doesn't give you that "free" sequential IO. They have strict bandwidth controls, and the result is that customers are often surprised that everything about their database works just fine with the exception of the backup or maybe that one late-night report that included a full table scan. The day-to-day random IOPS fit within the capabilities of the backend storage, but the sequential IO work slams into the limits relatively easily. FlashCache ought to ease the pressure there because the IOPS serviced by the FlashCache layer won't touch the backend disks. I'd recommend limiting the number of RMAN channels too, because some IO will still need to reach those backend disks.
... View more
I've seen synthetic IO tests with FlashCache on CVO, and the results are amazing, as they should be. FlashCache was a miracle for database workloads when if first came out (I was in NetApp PS at the time) because it brought down the average latency of HDD's. It works the same with CVO. The backend drives for AWS and Azure native storage are flash, but it's still a shared resource and it's not nearly as fast as an actual on-premises all-flash array. FlashCache on that block of NVMe storage on the CVO instance works the same - it brings down the average latency. I don't think there's a way to monitor the burst credits, but they publish the math of how it's calculated. You'll exhaust the burst credits pretty quickly with backups, so it's probably not going to help with that scenario. I did some tests a few years back where the burst credits were really confusing my results until I figured out what was happening. With respect to snapshots, check out TR-4591 which includes some material on how to use plain snapshots right on ONTAP itself. SnapCenter is often best option, but not always. What you want is a snapshot. There are multiple ways to get it.
... View more
Ah, that explains a lot. The IOPS limits on the underlying disks for CVO are based on bandwidth, not actual IO operations. It's fairly easy to size CVO for the day-to-day needs of an Oracle database, but when it's time for backups, you can easily overwhelm those disks. It's not a CVO limitation, the same thing happens with databases directly on Azure or AWS drives. 5000 IOPS with an 8K block size might be plenty for normal database operations, but an RMAN operation can consume all 5K IOPS and starve the database. The best option is really to stop moving backup data around and rely on snapshots for in-place backup and restore instead. If that's not an option, you might be stuck increasing the IOPS capabilities of the backend disks to handle the streaming backup workloads.
... View more
Most DBA's just use ASM's native ability. They make the new LUNs of equivalent size, add them to the current diskgroup, and then start dropping the old LUNs. Once the rebalance is complete, disconnect the old LUNs. If any of those LUNs manage grid services, you may need to run an additional command or two.
... View more