ONTAP Discussions

Snapmirror Destination Performance Expectations?

GregKorten
10,405 Views

If I have the following...  A Source production Site A and Disaster Recovery destination Site B

·         sufficient bandwidth between Site A and Site B to replicate changed data

·         supported documented replication interval for Snapmirror

·         identical Netapp Array configurations in each Site –  FAS Model – Disk Layout – Aggregates, Etc.

·         Like for Like Host hardware configurations and access protocols in both Sites

·        only read-only flexvolumes mirrors of Site A in the destination Site B (Cold site)

Can and should the Site B Array  be expected to perform the same as  what was experienced in  Site A for the mirrored Flexvolumes, when brought online and subject to with the same production workload minutes after issueing a  “snapmirror break”?

9 REPLIES 9

radek_kubka
10,406 Views

Can and should the Site B Array  be expected to perform the same as  what was experienced in  Site A

Hi Greg,

I would say that this statement is *almost* true.

Oddly enough, Site B should perform slightly better, because the filer won't be using CPU cycles for SnapMirror updates (unless you reverse the SnapMirror relationship after the fail-over).

Regards,

Radek

amiller_1
10,405 Views

Complete agreement with Radek.

The only case where I'd expect performance to be different is if you're using lower hardware and/or disk config at the destination site (as many customers do given it's a good way to save money -- a 2040 is often a "DR partner" for a 3140 for instance...especially given that all the external disk on a 2040 could be placed on another model later if an upgrade were required).

GregKorten
10,405 Views

My reason for posing this question is in line with your comments...

I was replicating 4 of 10 Virtual Machines from a FAS2050 (underlying Aggregate of 17 x 300 GB SAS 15K drives) to a destination FAS2050 (underlying aggregate of 10 x 500 GB SATA.7.2K). We were only replicating a portion of the workload and performance was acceptable during initial fail over tests.  The 4 replicated Virtual Machines are all configured with vmware RDM luns.  Each VM has all of it's Windows Luns in a dedicated netapp flexvol. Replication was using the default 1  minute snapmirror  interval... Everything seemed perfectly fine... the majority of snapmirror snaps were transferring in under 20 Seconds.  Then after a good 3 months of what seemed to be trouble free replication we did a full test and brought up all 4 of the DR virtual machines minutes after issuing a snapmirror break command.  Performance was horrible, the 4th VM took 45 plus minutes to boot.  I came to learn  later after hours working with support that one of my larger flex volumes was "deswizzling".  What was disturbing was that after keeping the snapmirror broken and letting the deswizzler run for 24 + hours the performance was again acceptable on the destination which lead me to believe the underlying physical SATA storage with normal fast-path reads was adequate for the workload.  My own forensic work found that the "deswizzliing" scan was always working on the oldest snap in  the replicated snap schedule. The deswizzler scan would never finish on a larger 1.99 TB flexvol and as snaps were aged and deleted destination became more "swizzled" and subject to slow-path reads. I believe this was the root cause of  the degraded performance experienced that day

I should add that Netapp during the sales process boasted about replicating  from prime  FC/SAS to SATA,  just hope you don't really need to use the destination any time soon after a failing over.

We are now working on a plan to increase the replication interval and match disk type so the aggregates are identical and I am trying to get confirmation that the dirty little deswizzler will not rear it's ugly head again.

radek_kubka
10,405 Views

Hmm, the whole story looks a bit worrying from my perspective, as FC to SATA replication is something I used to recommend on regular basis

I've found this on NOW:
https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb16746

The message at the bottom on the page says: “Only online volumes will be deswizzled.” Does it mean deswizzler will kick in only after SnapMirror relationship is broken off, hence real life fail-over time should include contingency for deswizzling to start finish its job?

Can anyone more experienced in this matter add their 2 cents?

Regards,
Radek

GregKorten
10,405 Views

No... Snapmirror'd targets are onlin, they are just Read-only.  I see plenty of deszwizzling on the Destination regularly without issueing a break.

Netapp has told me many things impact the time it take the length of time for a Deszwizzler scan to complete..

  • Size of Volume
  • Number of Snaps
  • Differences between Deltas
  • Bigget file Size onf the Volume ( I am using LUNs so my file sizes are big)
  • Value of Maxfiles

Thing to do to lesson the amount of work that dezwizzling Needs to do

  • Delete unescessary snapshots
  • Disable Scheduled snaps until dezswizzling finishes ( in my case this take over 24 hours )
  • Let frequent Snapmirror update - I change to 1 Hour and and we are still behind)

Check the destination volumes using the following match the snap being "deswizzled" with the snaps in your destination snap list.

filer*> priv set advanced

filer*> wafle scan status

VolumeVolume <YOUR Destination FLEXVOL>:
Scan id                   Type of scan     progress
1438042    container block reclamation     block 6556 of 22845
1438043             volume deswizzling     snap 202,inode 100 of 33554409. level 1 of normal files. Totals: Normal files: L1:712/115608 L2:0/139570 L3:0/95607 L4:0/0     Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0 L4:0/0

Match this snap up with your Oldest snap on the Destination Filer.

filer*>  snap status <YOUR  Destination FLEXVOL>

90     complete   Nov 26 23:30    100135     7.3 19744 snap_s1.12
26     complete   Nov 26 22:30    115842     7.3 19744 snap_s1.13
234     complete   Nov 26 21:30    106510     7.3 19744 snap_s1.14
201     complete   Nov 26 20:31    383143     7.3 19744 snap_s1.15
190     complete   Nov 26 20:01     99687     7.3 19744 snap_s1.16
138     complete   Nov 26 19:01    145444     7.3 19744 snap_s1.17
107     complete   Nov 26 18:30    184205     7.3 19744 snap_s1.18
49     complete   Nov 26 17:30    263308     7.3 19744 snap_s1.19
254     complete   Nov 26 16:30    275845     7.3 19744 snap_s1.20
202     complete   Nov 26 15:30    271557     7.3 19744 snap_s1.21

If deswizzling is alwasys running on the oldest snap in your schedule I would bet you have a flexvolume that is "swizzled" and subject to slow-path reads.

Take a look at this bug report...

Bug Severity 3 - Serious inconvenience 

184892

The deswizzle scanner is run on flexible volume snapmirror destination volumes.

Its time-to-completion is highly dependent upon volume configuration, the dataset

contained, and concurrent system load.  It may be found that the snapmirror

schedule is too fast for the scanner to keep up, resulting in a failure to

reconstruct metadata necessary to serve data quickly and efficiently.

The only effect of failing to complete the scanner is reduced performance.

alapati
10,406 Views

All,

First of all, 1-minute updates while technically possible are not recommended at all. If you need low RPO under a minute or under 3 minutes, I would recommend SnapMirror Semi-Sync.

On to deswizzling..

Deswizzling is like any other WAFL background scanner. In ideal scenarios, it does its job and then goes away. It re establishes the metadata between PVBN and VVBN (see SnapMirror Advanced Topics training on the field portal for more info). It may affect read performance on the VSM destination. It does not impact the source system in any way. The scanner is kicked off after a SnapMirror updates and it works from the oldest snapshot. If you have a very frequent schedule, AND if you have large volumes and lots of files (many indirects), it may seem like deswizzling is never complete.

I always advise the following on deswizzling:

1. If the concern is primarily the fact that it is always running but not causing any performance issues, then leave it alone. Its like any other background WAFL scanner.

2. IF deswizzler scanner is determined to be the culprit causing read performance issues on the VSM destination, first make sure you are on 7.2.4 or later. If you are still having issues caused by deswizzling, then please contact me (Srinath Alapati), Jean Banko (SnapMirror product manager), and Quinn Summers (WAFL product manager).

Srinath

radek_kubka
10,406 Views

Hi Srinath,

Thanks a million for your response - this makes the whole story much clearer!

1-minute updates while technically possible are not recommended at all. If you need low RPO under a minute or under 3 minutes, I would recommend SnapMirror Semi-Sync.

Full agreemeent on this one - somehow I missed that the environment in question does so frequent SnapMirror updates & this obviously may impact the overall end-to-end efficiency.

I personally think that whenever RPO below 15 minutes is required, then this is a good justification for MetroCluster setup. And actually not that many customers will weep if they lose 15 minutes of their data in an actual disaster scenario - typically they will have more important things to be worried about, like access to their premises, contacting staff, sorting out remote connectivity, etc.

Regards,
Radek

GregKorten
10,406 Views

Srinath,

Thank you for the reply.

We are having no performance issues with the source only the destination.  During pre-sales we were told that the 1 minute snapmirror schedule was achievable with adequate bandwidth for the change rate. In my case I am able to replicate and transfer deltas without issue to the destination on all 4 my replicated source SMVs in less then 20 seconds.  3 Of the 4 volumes deszswizzle very quickly (in a second or two). We have discontentedly come to terms with the increased RPO.  I am currently replicating the largest SMV (1.99TB) every hour and the deswizzler is still always working on the oldest snap in the schedule and unable to work up the schedule before the next replicated snap is applied.  I am not concerned that the scan is running, my concern is that if the scan is never able to complete my destination SMVwill become more and more “swizzled” as the data changes over time.  There is evidence to support this. If I use the destination SMV shortly after issuing s snapmirror break the performance is horrible yet if I issue the break and wait 24+ hours for the deszwizzler scan to complete the performance is noticeably better. 

What is the "official" shortest supported snapmirror interval?  If one minute is not recommended it should not be the default and netapp should be more transparent with expected snapmirror RPO's.

Your help would be greatly appreciated... I need to KNOW that minutes after issuing a break my snapmirrored workload will perform similarly to what was experienced in production.

FAS2050 ‘s running Ontap 7.3.1 P3

Greg

alapati
10,406 Views

Greg,

We support 1-min updates. As volumes get larger, and data change rates get larger, keeping the SnapMirror schedule on target might not be possible. For these reasons, we do not recommend 1-min updates. I cover some of these aspects in TR3446 (SnapMirror Best Practices Guide). I will make sure to make it clearer. In general:

0 RPO: SnapMirror Sync (or MetroCluster)

< 3-5 minutes: SnapMirror Semi-Sync or SnapMirror Async in some scenarios

> 3-5 minutes: SnapMirror Async

Hope that answers your question.

About deswizzling on the large volumes, are there any bottlenecks on the destinations that is potentially limiting the resources deswizzling is getting? BTW deswizzling only affects Flexbile volumes AND volume SnapMirror. It does not affect qtree SnapMirror.

Public