ONTAP Discussions

Deswizzling & SnapMirror

ashleycook
10,624 Views

Hi all,

Firstly, sorry that this is a bit longwinded but I wanted to demonstrate the process that I’ve been following. I’ve got an ongoing query with NetApp support at the moment but am making little progress, what I’m really looking to do is confirm whether there is an issue here or not.

To give you a bit of background, the ‘issue’ as I perceive as is a slow SnapMirror transfer. Initial investigations focused on the vif/ifgrp & switches, the configuration consists of two interfaces running in multimode. Through engagement with our network team I’m fairly content that the vif configuration and supporting configuration on the switches is correct.

Prior to engaging NetApp support, I ran though some tests in conjunction with third party support. The purpose of these tests was aimed at identifying whether there was a bottleneck on the Network/CPU/Disk. In order to test this data was copied between filers and locally using ‘vol copy’. Essentially, I was looking for variance in the copy rate both inter and intra filers, which may indicate slowness from disk:

The findings of this exercise were as follows:

  • Vault-FilerA to Vault-FilerB = 80 MB/s
  • Prod-FilerA to Vault-FilerA = 8 MB/s
  • DR-FilerA to Vault-FilerA = 2 MB/s
  • Prod-FilerA to Prod-FilerA = 8 MB/s
  • DR-FilerA to DR-FilerA = 2MBs

I’ve rounded those figures off a little and there was some variance between filers, but that is the gist of it.

Now, the thinking is that the ‘slowness’ is due to the DR filers. The question is what is causing this? Initial suspicions are the deswizzling workload as we do run quite an aggressive SnapMirror schedule (Asynchronous, 3 minute schedule).

I’m in a bit of a quandary at this point as information appears thin on the ground, hence the ticket with NetApp support.

My queries at the moment are essentially:

  • Has the deswizzling ever completed on our large volumes?
  • Are checkpoints functioning during the wafl scan? When are they taken? What is the trigger?
  • How will the system perform if we invoke DR? I fear it may have a great deal of work to complete the deswizzling and until it does so we won’t achieve comparable performance to the production environment.
  • How do we overcome this without a reduction in the RPO? Would semi-synchronous SnapMirror win us anything?

Focusing on one of the SnapMirror destination volumes for the moment, it looks something like this:

SnapMirror is currently idle

DR-FILERA*> snapmirror status sm_myvolume

Snapmirror is on.

Source                               Destination                  State          Lag        Status

PROD-FILERA:netappvol_myvolume  DR-FILERA:sm_myvolume  Snapmirrored 00:01:13   Idle

Current snapshots (note that the oldest snapshot at this point is approaching 24 hours)

DR-FILERA*> snap status sm_myvolume

Volume sm_myvolume (cleaning summary map)

snapid  status     date           ownblks release fsRev name

------  ------     ------------   ------- ------- ----- --------

116 creating   May 02 15:54         0 8.0 21057 DR-FILERA(1574215916)_sm_myvolume.69540 (no map)

115 creating   May 02 15:51      1096 8.0 21057 DR-FILERA(1574215916)_sm_myvolume.69539 (no map)

37     creating May 02 12:01    118643     8.0 21057 sv_hourly.0    (no map)

207 complete   May 02 08:00    118914 8.0 21057 sv_hourly.1

215 complete   May 01 20:01    140135 8.0 21057 sv_hourly.2

131 complete   May 01 16:01     73680 8.0 21057 sv_hourly.3

WAFL scan is running container block reclamation and volume deswizzling. The deswizzling is running on the oldest snapshot.

DR-FILERA*> wafl scan status sm_myvolume

Volume sm_myvolume:

Scan id                   Type of scan     progress

5081724    container block reclamation     block 106 of 4909

5081725             volume deswizzling     snap 131, inode 97 of 32781. level 1 of normal files. Totals: Normal files: L1:0/14245 L2:0/38304 L3:0/38521 L4:0/38521   Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0 L4:0/0

SnapMirror update is running.

DR-FILERA*> snapmirror status sm_myvolume

Snapmirror is on.

Source                               Destination                  State          Lag        Status

PROD-FILERA:netappvol_myvolume  DR-FILERA:sm_myvolume  Snapmirrored 00:03:28   Transferring

SnapMirror update has completed

DR-FILERA*> snapmirror status sm_myvolume

Snapmirror is on.

Source                               Destination                  State          Lag        Status

PROD-FILERA:netappvol_myvolume  DR-FILERA:sm_myvolume  Snapmirrored 00:00:49   Idle

Has container block reclamation restarted?

DR-FILERA*> wafl scan status sm_myvolume

Volume sm_myvolume:

Scan id                   Type of scan     progress

5081772    container block reclamation     block 17 of 4909

5081773             volume deswizzling     snap 131, inode 97 of 32781. level 1 of normal files. Totals: Normal files: L1:0/2966 L2:0/17973 L3:0/18086 L4:0/18086    Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0 L4:0/0

Again, sorry for so much information but I’m at a loss and would like to understand what is going on here. To me it looks as though deswizzling isn’t making any progress as it is still crunching a snapshot which by this point is nearly 24 hours old. Information on this is so scarce that I’m finding it hard to confirm or debunk this.

I’d welcome any thoughts you may have.

Thanks,

Ashley

Further Reading

What is deswizzler or deswizzling?

https://kb.netapp.com/support/index?page=content&id=3011866

Deswizzle – how to read status

https://communities.netapp.com/message/55878

Snapmirror Destination Performance Expectations?

https://communities.netapp.com/message/19931#19931

3 REPLIES 3

ashleycook
10,624 Views

Thought I would update this as I’ve received some feedback from NetApp support. I won’t quote verbatim, hopefully the summary will suffice. My summary of the feedback from NetApp support is presented in quotes, my comments follow.

Has the deswizzling ever completed on our large volumes?

“After a snapmirror update the deswizzler will start from the oldest snapshot. However, it will skip all the snapshots already deswizzled. Checkpointing is on snapshot granularity. Partially-deswizzled snapshots will be deswizzled somewhat faster because there is less work to be done.”

“With an aggressive snapmirror schedule and a large number of snapshots in a volume, there is very low likelihood of deswizzler ever completing.”

On the volumes which actually contain real data, I've only ever seen it processing the oldest snapshot. This leads to me to the conclusion that it is still deswizzling this. At any given time, the oldest snapshot is going to be approaching at least an age of 20 hours.

We don't have what I consider to be a particularly large number of snapshots per volume, there are 4 snapshots taken throughout the day at 8,12,16,20.

Are checkpoints functioning during the wafl scan? When are they taken? What is the trigger?

“Reaffirmed that checkpoints are used during the WAFL scan, though the information on how/when they are taken is not public facing.”

How will the system perform if we invoke DR? I fear it may have a great deal of work to complete the deswizzling and until it does so we won’t achieve comparable performance to the production environment.

“The deswizzling process will continue running until it completes. The process runs as a low priority, as such it will give up CPU cycles to higher priority processes. Whilst this process is running it may appear that the CPU is running at 100%.”

What concerns me about this is that it won’t have fast path access to the data until it does complete.

How do we overcome this without a reduction in the RPO? Would semi-synchronous SnapMirror win us anything?

Still an outstanding query.

My current line of thinking based on the feedback I’ve received so far is that the deswizzling process it not completing on our oldest snapshots (20 to 24 hours) before they are deleted. This raises the question, why is this process running if it is never going to complete? Once the oldest snapshot is deleted, processing must start from scratch on the next in line.

If we aren’t really winning anything by deswizzling the snapmirror destination, then we would possibly stand to gain more by stopping this process and potentially gaining an increase in throughput on our snapmirror transfers. I appreciate that the process will need to be run eventually, but what if we manually started it when we broke a snapmirror, would we be any worse off? Is it even possible to stop it?

I’d like to understand whether we would gain anything by reducing the frequency of the snapmirror schedule, or even if we were to move to semi-synchronous. TR-3446 does imply that 3 minutes is an acceptable schedule for asynchronous, though it is on the cusp.

Would we gain anything by having a break in our replication schedule? For example, if we maintained 3 minute scheduling during the working day, but stopped snapmirror updates during the evening?

ashleycook
10,624 Views

Just an update for completeness in case anyone else runs into this situation.

The SnapMirror schedule on our volumes was modified so that for two weeks the overnight the frequency was reduced to an hourly schedule, this appeared to alleviate the pressure somewhat and the result was that the most of the SnapMirror post-processing backlog which had developed cleared relatively quickly.

In our case the issue seems to be that the deswizzling process is unable to make progress quickly enough, so as to actually progress. For example:

   %/used       %/total  date          name
----------  ----------  ------------  --------  
0% ( 0%)    0% ( 0%)  Jun 29 14:00  FILERA(1234567890)_myvolume.12345 (snapmirror)  
0% ( 0%)    0% ( 0%)  Jun 29 12:00  hourly.0  
1% ( 0%)    0% ( 0%)  Jun 29 08:00  hourly.1  
2% ( 1%)    1% ( 1%)  Jun 28 20:00  hourly.2  
2% ( 0%)    1% ( 0%)  Jun 28 16:00  hourly.3

Deswizzling etc. is running on hourly.3, every 3 minutes it has to pause to allow a SnapMirror update. After this update the deswizzling will resume from the last checkpoint, in my case I think by the time it gets back to the checkpoint and continues processing it doesn't actually make enough headway to make any real progress. This whole cycle begins again when the oldest snapshot is deleted, because insufficient progress was made previously.

What also helped was retaining more snapshots, so 2 nightly snapshots which allowed greater time for processing by means of the extended retention.

Generally, once the backlog was reduced there was a drop in CPU load and disk utilisation on the SnapMirror destination.

I'm reassured in that the backlog did clear quite quickly when the schedule was reduced, as such I'm not too concerned at the moment in that if it is necessary to run production from the SnapMirror destination the backlog will clear itself. While there maybe some performance degradation in this situation, it is expected to clear relatively quickly.

It seems to be very much a balance of performance vs. RPO. With a reduced SnapMirror schedule, there is a noticeable performance improvement on the storage controller. The read latency improves substantially (>40ms to 15ms) and the effect appears to be reflected in the end user experience.

craigbeckman
10,624 Views

Hi Ashley,

Thanks for sharing your experience.

A 3 minute Snapmirror schedule is VERY aggressive indeed!!!

I ran into the same deswizzling issue when we upgraded our FAS6210's to DOT 8.1.

Our Snapmirror schedule is every 2 hours so deswizzling completed an hour later.

I was noticing much higher CPU (100%) and slower console than we I was on 8.0.2.

Since I upgrade to DOT 8.1.1, deswizzling has been slightly quicker to finish and appears to consume a little less CPU.

Only been a few days so maybe too early to tell.

Craig

Public