ONTAP Discussions
ONTAP Discussions
If I have the following... A Source production Site A and Disaster Recovery destination Site B
· sufficient bandwidth between Site A and Site B to replicate changed data
· supported documented replication interval for Snapmirror
· identical Netapp Array configurations in each Site – FAS Model – Disk Layout – Aggregates, Etc.
· Like for Like Host hardware configurations and access protocols in both Sites
· only read-only flexvolumes mirrors of Site A in the destination Site B (Cold site)
Can and should the Site B Array be expected to perform the same as what was experienced in Site A for the mirrored Flexvolumes, when brought online and subject to with the same production workload minutes after issueing a “snapmirror break”?
Can and should the Site B Array be expected to perform the same as what was experienced in Site A
Hi Greg,
I would say that this statement is *almost* true.
Oddly enough, Site B should perform slightly better, because the filer won't be using CPU cycles for SnapMirror updates (unless you reverse the SnapMirror relationship after the fail-over).
Regards,
Radek
Complete agreement with Radek.
The only case where I'd expect performance to be different is if you're using lower hardware and/or disk config at the destination site (as many customers do given it's a good way to save money -- a 2040 is often a "DR partner" for a 3140 for instance...especially given that all the external disk on a 2040 could be placed on another model later if an upgrade were required).
My reason for posing this question is in line with your comments...
I was replicating 4 of 10 Virtual Machines from a FAS2050 (underlying Aggregate of 17 x 300 GB SAS 15K drives) to a destination FAS2050 (underlying aggregate of 10 x 500 GB SATA.7.2K). We were only replicating a portion of the workload and performance was acceptable during initial fail over tests. The 4 replicated Virtual Machines are all configured with vmware RDM luns. Each VM has all of it's Windows Luns in a dedicated netapp flexvol. Replication was using the default 1 minute snapmirror interval... Everything seemed perfectly fine... the majority of snapmirror snaps were transferring in under 20 Seconds. Then after a good 3 months of what seemed to be trouble free replication we did a full test and brought up all 4 of the DR virtual machines minutes after issuing a snapmirror break command. Performance was horrible, the 4th VM took 45 plus minutes to boot. I came to learn later after hours working with support that one of my larger flex volumes was "deswizzling". What was disturbing was that after keeping the snapmirror broken and letting the deswizzler run for 24 + hours the performance was again acceptable on the destination which lead me to believe the underlying physical SATA storage with normal fast-path reads was adequate for the workload. My own forensic work found that the "deswizzliing" scan was always working on the oldest snap in the replicated snap schedule. The deswizzler scan would never finish on a larger 1.99 TB flexvol and as snaps were aged and deleted destination became more "swizzled" and subject to slow-path reads. I believe this was the root cause of the degraded performance experienced that day
I should add that Netapp during the sales process boasted about replicating from prime FC/SAS to SATA, just hope you don't really need to use the destination any time soon after a failing over.
We are now working on a plan to increase the replication interval and match disk type so the aggregates are identical and I am trying to get confirmation that the dirty little deswizzler will not rear it's ugly head again.
Hmm, the whole story looks a bit worrying from my perspective, as FC to SATA replication is something I used to recommend on regular basis
I've found this on NOW:
https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb16746
The message at the bottom on the page says: “Only online volumes will be deswizzled.” Does it mean deswizzler will kick in only after SnapMirror relationship is broken off, hence real life fail-over time should include contingency for deswizzling to start finish its job?
Can anyone more experienced in this matter add their 2 cents?
Regards,
Radek
No... Snapmirror'd targets are onlin, they are just Read-only. I see plenty of deszwizzling on the Destination regularly without issueing a break.
Netapp has told me many things impact the time it take the length of time for a Deszwizzler scan to complete..
Thing to do to lesson the amount of work that dezwizzling Needs to do
Check the destination volumes using the following match the snap being "deswizzled" with the snaps in your destination snap list.
filer*> priv set advanced
filer*> wafle scan status
VolumeVolume <YOUR Destination FLEXVOL>:
Scan id Type of scan progress
1438042 container block reclamation block 6556 of 22845
1438043 volume deswizzling snap 202,inode 100 of 33554409. level 1 of normal files. Totals: Normal files: L1:712/115608 L2:0/139570 L3:0/95607 L4:0/0 Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0 L4:0/0
Match this snap up with your Oldest snap on the Destination Filer.
filer*> snap status <YOUR Destination FLEXVOL>
90 complete Nov 26 23:30 100135 7.3 19744 snap_s1.12
26 complete Nov 26 22:30 115842 7.3 19744 snap_s1.13
234 complete Nov 26 21:30 106510 7.3 19744 snap_s1.14
201 complete Nov 26 20:31 383143 7.3 19744 snap_s1.15
190 complete Nov 26 20:01 99687 7.3 19744 snap_s1.16
138 complete Nov 26 19:01 145444 7.3 19744 snap_s1.17
107 complete Nov 26 18:30 184205 7.3 19744 snap_s1.18
49 complete Nov 26 17:30 263308 7.3 19744 snap_s1.19
254 complete Nov 26 16:30 275845 7.3 19744 snap_s1.20
202 complete Nov 26 15:30 271557 7.3 19744 snap_s1.21
If deswizzling is alwasys running on the oldest snap in your schedule I would bet you have a flexvolume that is "swizzled" and subject to slow-path reads.
Take a look at this bug report...
Bug Severity 3 - Serious inconvenience
184892 |
The deswizzle scanner is run on flexible volume snapmirror destination volumes. Its time-to-completion is highly dependent upon volume configuration, the dataset contained, and concurrent system load. It may be found that the snapmirror schedule is too fast for the scanner to keep up, resulting in a failure to reconstruct metadata necessary to serve data quickly and efficiently. The only effect of failing to complete the scanner is reduced performance. |
All,
First of all, 1-minute updates while technically possible are not recommended at all. If you need low RPO under a minute or under 3 minutes, I would recommend SnapMirror Semi-Sync.
On to deswizzling..
Deswizzling is like any other WAFL background scanner. In ideal scenarios, it does its job and then goes away. It re establishes the metadata between PVBN and VVBN (see SnapMirror Advanced Topics training on the field portal for more info). It may affect read performance on the VSM destination. It does not impact the source system in any way. The scanner is kicked off after a SnapMirror updates and it works from the oldest snapshot. If you have a very frequent schedule, AND if you have large volumes and lots of files (many indirects), it may seem like deswizzling is never complete.
I always advise the following on deswizzling:
1. If the concern is primarily the fact that it is always running but not causing any performance issues, then leave it alone. Its like any other background WAFL scanner.
2. IF deswizzler scanner is determined to be the culprit causing read performance issues on the VSM destination, first make sure you are on 7.2.4 or later. If you are still having issues caused by deswizzling, then please contact me (Srinath Alapati), Jean Banko (SnapMirror product manager), and Quinn Summers (WAFL product manager).
Srinath
Hi Srinath,
Thanks a million for your response - this makes the whole story much clearer!
1-minute updates while technically possible are not recommended at all. If you need low RPO under a minute or under 3 minutes, I would recommend SnapMirror Semi-Sync.
Full agreemeent on this one - somehow I missed that the environment in question does so frequent SnapMirror updates & this obviously may impact the overall end-to-end efficiency.
I personally think that whenever RPO below 15 minutes is required, then this is a good justification for MetroCluster setup. And actually not that many customers will weep if they lose 15 minutes of their data in an actual disaster scenario - typically they will have more important things to be worried about, like access to their premises, contacting staff, sorting out remote connectivity, etc.
Regards,
Radek
Srinath,
Thank you for the reply.
We are having no performance issues with the source only the destination. During pre-sales we were told that the 1 minute snapmirror schedule was achievable with adequate bandwidth for the change rate. In my case I am able to replicate and transfer deltas without issue to the destination on all 4 my replicated source SMVs in less then 20 seconds. 3 Of the 4 volumes deszswizzle very quickly (in a second or two). We have discontentedly come to terms with the increased RPO. I am currently replicating the largest SMV (1.99TB) every hour and the deswizzler is still always working on the oldest snap in the schedule and unable to work up the schedule before the next replicated snap is applied. I am not concerned that the scan is running, my concern is that if the scan is never able to complete my destination SMVwill become more and more “swizzled” as the data changes over time. There is evidence to support this. If I use the destination SMV shortly after issuing s snapmirror break the performance is horrible yet if I issue the break and wait 24+ hours for the deszwizzler scan to complete the performance is noticeably better.
What is the "official" shortest supported snapmirror interval? If one minute is not recommended it should not be the default and netapp should be more transparent with expected snapmirror RPO's.
Your help would be greatly appreciated... I need to KNOW that minutes after issuing a break my snapmirrored workload will perform similarly to what was experienced in production.
FAS2050 ‘s running Ontap 7.3.1 P3
Greg
Greg,
We support 1-min updates. As volumes get larger, and data change rates get larger, keeping the SnapMirror schedule on target might not be possible. For these reasons, we do not recommend 1-min updates. I cover some of these aspects in TR3446 (SnapMirror Best Practices Guide). I will make sure to make it clearer. In general:
0 RPO: SnapMirror Sync (or MetroCluster)
< 3-5 minutes: SnapMirror Semi-Sync or SnapMirror Async in some scenarios
> 3-5 minutes: SnapMirror Async
Hope that answers your question.
About deswizzling on the large volumes, are there any bottlenecks on the destinations that is potentially limiting the resources deswizzling is getting? BTW deswizzling only affects Flexbile volumes AND volume SnapMirror. It does not affect qtree SnapMirror.