How to Get VOL CLONE SPLIT Done Faster?

ntc_netapp · ‎2013-04-12

We have 2 FAS 3270s. One is PROD and the other is BACKUP/NON-PROD.

Our primary use of the 3270's is for Database Storage (Oracle 11GR2 - ASM Disks -- Fibre Channel LUNS)

We have FLEXCLONE and SnapMirror Licenses.

On PROD - we take hourly snapshots of the DB VOLs. We also take daily SnapMirrors Updates so snapshots on PROD will be on the BACKUP array,

Our DBs range in size from 2 ro 12 TB in size.

A Gigabit Link serves as SnapMirror Pipe between the NetApps.

1.) We have Database Clone Requirements wherein one set needs to be refreshed/updated from PROD on the NON-PROD Netapp every weekend.

2.) We also have another DB Clone Set that needs a refresh/update every month.

3.) And we also have sometimes adhoc Database Clone requests that could remain for a few days to a few months.

All the clones need to be on the BACKUP/NON-PROD Netapp.

We currently address 1 and 2 by creating FLEXCLONES of the replicated PROD snapshots that's on the SnapMirror Volume on the NON-PROD?BACKUP NetApp.. We have an automated script that currently does:

snapmirror update

vol clone create TESTW_SAPDBVOL01 -b SAPDBVOL01_M SAPDBVOL01_SNP_REF

luns of TESTW_SAPDBVOL01 presented to iGroup of TESTW DB Server

vol clone split start TESTW_SAPDBVOL01

We currently have 2 DBs (total VOL Sizes is arounf 6 TB) moved to this pair of NetApps and 10 more are scheduled. Eventually there will be over 30TB on PROD that will need to be cloned.

Currently the 6TB of VOlumes (3 TB each on each controller) takes 2 days to finish.

If I do 2 VOL CLONE SPLITS operations on each controller at the same time on the current DBs, completion takes 4 days!

The solution works beautifully but my problem is as more and more Databases are added to this NetApp Solution, the VOL CLONE SPLIT process takes a very long time to complete that if I have 2 VOL CLONE SPLIT process running at the same time on each controller, the time it takes to complete increases tremendously . We have to VOL CLONE SPLIT so we can delete/remove the reference snapshot from PROD so our snapshot reserves there will not fill up. We cannot also delete the reference snapshot of the clones on the PROD Netapp as doing so will mean snapmirror updates will fail since the reference snapshot still exists on the BACKUP/NON-PROD NetApp. And the reference snapshot on the Non-Prod NetApp cannot be deleted until my VOL CLONE SPLIT Process is done.

Our PROD NetApp is SSD so we can't have a very large snapshot reserve allocated to capture all the snapshot deltas and allow for continued snapmirror updates whilst the vol clone split processes go on on the BACKUP/Non-Prod NetApp.

So my question is: Is there a way to have the VOL CLONE SPLIT process done faster?

Could I use VOL COPY or NDMPCOPY Instead? (I can probably ask the client that their DBs will now need to be down until a VOL COPY / NDMPCOPY (to a different aggregate???) is done. They will loose the luxury of have their DB Clones immediately available. We're also planning to have a 10GBe SnapMirror Pipe between the PROD and BACKUP/Non-PROD NetApps and just use SnapMirror to do the above -- but then again, the CLONES will no longer be instantly mountable... There will now have to be a wait until after a SnapMirror Update/Break is done or VOL COPY of the SNapMirror Volume to an independent FlexVol likely to a different Aggregate...

Any advice?

spence · ‎2013-04-15

vol copy and ndmpcopy will be slower. Split will still be fastest.

Based upon the above it sounds like Backup\NonProd is all SATA and we are likely spindle bound. I would suggest gathering some perf data during your split operations. If you just want to start small, then start a logged ssh session in advanced mode during a split operation and gather:

statit -b

sysstat -x 1 (for 15 minutes)

statit -e

From there I assume we'll see that there just aren't enough spindles to handle the request, but if not then we can start to narrow it down a little and move onto perstats with NGS assistance.

The 10GE snapmirror upgrade will be nice in that it will allow you to start the split process much sooner than previous, but until the split it complete you will still be limited and not able to remove the source snapshot.

I made the mistake of putting my personal information out here once, so send me a private message and I'll send my personal contact data if you would like.

ntc_netapp · ‎2013-04-15

We were told by our Post-Sales VOL CLONE SPLIT was meant to be slow and gingerly do its time to become a FlexVol so the mounted LUNS under the volume will continue to perform optimally as if there's nothing going on on the background. I have already tried prioritisation of the VOL but no dice. It is really slow. The LArger the VOL and the more concurrent VOL CLone SPlits one have -- the slower the time to complete. So I thought vol copy/ndmcopy or even doing fresh snapmirrors/breaks would be a whole lot faster (with sacrifices of course -- meaning the clones will not be instanly usable)

spence · ‎2013-04-15

Well vol copy might be faster, but I'd have to try it out first. ncmpcopy will not be faster. The reasons for my statements earlier was just what you pointed out again. The more splits you have going on the slower it is, leading me to think there is a spindle limit there.

ntc_netapp · ‎2013-04-16

Thanks Spence...

But even with just 1 VOL CLONE SET being split -- it still takes a fair amount of time that doing a fresh snapmirror and breaking that relatinship will still be considerably faster than a vol clone split.

I have not done this but do you think it is possible to vol copy the snapmirror volume to another aggregate?

Can you recommend other solutions/possibilities? It is a dfeinite now that we're getting a 10GBE link between the 2 3270s.

One of the options to our Weekly Clone requirements would be to no longer do a vol clone split but to have those DB copies remain as flexvols (and just deal with the snapshot reserves on the production NetApp). The other DB Copies which will not be re-cloned/refreshed for a month can either be done via vol copy as proposed (if it works) OR snapmirror/break using the 10GbE pipe -- "and just accept the fact" there will now be WAIT times for some of the DB Clones to be available...

Thanks for thine responses so far.

spence · ‎2013-04-16

So it is possible to perform a vol copy off of a snapmirror destination (once the init finishes). I would suggest that you base your vol copy upon a snapshot that is present in the destination snapmirror vol since you are using LUNs. You can use vol copy between aggrs or controllers (though I would suggest intra controller for speed) if you need to as well.

The Snapmirror init \ break cycle over 10GE is a solution, but instead of having a single source and destination for each database in mind is to have a one to many by where DB1 is mirrored twice. Once for your actual DR relationship and once for your clone process. The DR relationship would remain in tact at all times so that in case of actual disaster you don't impact the RTO/RPO requirements. What will also work is if you setup a cascade and then break\resync your cascade. So DB1 on prod would mirror to DB1_DR on backup and then another mirror would go from DB1_DR on backup to DB1_clone on backup. The DB1_DR would always remain in tact and the DB1_clone would be the one broken off all the time. Both of these solutions will mean that the clones will not be available until the init or resync completes as you already understand. What I'm worried about for both of these solutions is the scalability of the solution as you add more 3TB DBs. Since 3TB is much more than 10GE, I might suggest that if you go down this road, go with the cascade solution as it will likely be faster and reduce the impact on your 10GE link.

As you stated for the weekly clones, if you can absorb the change in the snap reserve\frac reserve on the prod system, that would offer you the greatest storage efficiency and speed, otherwise I think you'll have to go back to the snapmirror or vol copy scenario.

thomas_glodde · ‎2013-04-16

I have a customer with the same problem, 3TB SAP ERP on a FAS3240AE MetroCluster. Vol clone split takes almost a week. We tried priv set diag; wafl scan speed 1000" but it wont go over a few megs per second.

Is there any hidden setting to speed up the process? We tried a filer internal vol copy/snapmirror which gave us way better speeds.

spence · ‎2013-04-16

Looking around it looks like this is intentional so as not to overwhelm kahuna and overall system.

Looks like the vol copy or snapmirror process would be the only way to speed it up.

thomas_glodde · ‎2013-04-16

Sure, not a bad idea to do so, but like with snapmirror, we have a throttle option and for usual wafl scanners we can increase wafl scan speed, but seems we can do nothing for vol clone split. Imagine a volume clone split restore after a restore due to inconsistent database "hey, everything is fine now, it will just take a week to have everything back to 100% normal ..."

HSSMRANDREWDH · ‎2014-05-25

I have found the best way around this is as follows:

1. Create snapshot to clone from on Primary storage controller.

2. Flexclone on primary storage controller.

3. create flexvol on designated controlller/aggregate

4. restrict this volume, and snapmirror from the flexclone created on primary storage

5. wait for snapmirror to finish...break the mirror, mount the luns or create the exports

6. snapmirror release on primary, offline flexclone, destroy flexclone and its snapshot.

The only problem with this is that you cannot access the luns immediately - however for our purposes it is fine as we create the clones days in advance of when we need to use them.