Solved: performance issue with snapmirror and snapshots on target aggregate

Stefan-Reitmeier · ‎2015-07-29

Hi all,

does anyone of you has experience with snapmirror an larger amount of data.

At the moment we do a snapmirror of about 100tb data distributed over about 10 volumes to a sata aggreagate on a second filer with 85 4tb sata disks (5x17disk raidgroup). Source is FAS 8040, target is FAS 8020 both with cdot 8.3p1. We already moved all workload from the target aggregate, so it hosts only snapmirror targets.

On the source side we do 1 snapshot per day and keep 14 snapshots. Snapmirror is done once per day. From counting snapshots I would say daily change rate is 2-2,5tb for all volumes.

Snapmirror is working fine and finished in less than 2-3h, but container block reclamation and deswizzling is totally killing the aggregate on the target side. We do see continous load of 30MB read and disk util for all disks except parity disks is 90-100%.

At first we planned 4h snapshot but that is just not possible. At the moment we disabled deswizzle and get to a point where if we are lucky the target aggregate load drops in the night just before next snapmirror kicks in.

We are quite new to Netapp but it sounds ridiciolous, that you need so much io for just a plain replication and some snaps.

Do you have any experience with snapshots and snapmirror using sata disks? I think snapshots and snapmirror on Netapp are very resource demanding. It is true that the creation of snapshots on Netapp is super efficient and instant but as soon as snapshot has to be deleted container block reclamation kicks in and takes large amount of disk resource. Same for snapmirror, it is really cool and stable, but deswizzling for logical to physical block mapping with large data affects snapmirror target performance heavily.

Best wishes,

Stefan

Darkstar · ‎2015-07-31

This is not related to anything with reallocation. It's simply the deswizzler which has to run through the full volume and update the PVBN references, which involves a lot of Metadata reads which can result in severe cache thrashing. PAM cards on the SnapMirror destination help A LOT with such a workload.

An alternative would be to disable the deswizzler but then access to the secondary data will be potentially slow(er) (because every read of a block has to go through the VVBN->PVBN mapping, which is one metadata file) which isn't a big deal usually, but in a DR or migration scenario, when the destination becomes active at some point, you probably don't want to have that extra level of indirection. You can later re-start the deswizzler manually but then again it will take a loooong time to complete.

View solution in original post

deepuj · ‎2015-07-30

Hi Stefan,

Couple of questons from my side:

1)Is the disk utilization high always or only during the snapmirror scedule?
2)Is there any volume reallocation schedules running?
3)Is aggregate free space reallocation "ON" on the DR site aggregate
4)Are you taking any backups from the target site?

Thanks

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Stefan-Reitmeier · ‎2015-07-31

Hi deepuj,

thanks for your answer.

Disk util is high after snapmirror when block reclamation and deswizzle kicks in. Depending on the size of deleted snapshots and snapmirror sometimes it takes until next snapmirror and then we have continous high disk load for multiple days.

Reallocate is not running. We use aggregate reallocate no_redirect on source volumes.

Backups are only taken on source side.

Best wishes,

Stefan

Darkstar · ‎2015-07-31

This is not related to anything with reallocation. It's simply the deswizzler which has to run through the full volume and update the PVBN references, which involves a lot of Metadata reads which can result in severe cache thrashing. PAM cards on the SnapMirror destination help A LOT with such a workload.

An alternative would be to disable the deswizzler but then access to the secondary data will be potentially slow(er) (because every read of a block has to go through the VVBN->PVBN mapping, which is one metadata file) which isn't a big deal usually, but in a DR or migration scenario, when the destination becomes active at some point, you probably don't want to have that extra level of indirection. You can later re-start the deswizzler manually but then again it will take a loooong time to complete.

Stefan-Reitmeier · ‎2015-08-07

Hi AdvUniMD,

we tested flashcache this week, and yes it is a hugh improvement. Snapmirror and Snapshot are still performance killers with volumes 30tb+ due to deswizzle and container block reclamation, but flashcache helps.

Best wishes,

Stefan

RPHELANIN · ‎2015-08-10

Whats the replication schedule? Deszwilling may not have completed between mirrors...

Stefan-Reitmeier · ‎2015-08-10

Hi RPHELANIN,

schedule is 24h. Yes from time to time dezwizzle does not finish, but 24h is our max, we planned with 4h. But I think this is just impossible with nl-sas disks unless you do not change data on source :-). We crossed checked deswizzle and container block reclamation with disabling each of them. Most of the load is produced by container block reclamation.

The positive impact of flashcache is on deswizzle lower than on container block reclamation.

I think most of NetApps internal workload is sized for 10k and 15k drives. If we compare io per gb of 10k 900gb drive with 7k 4tb drive we have a ratio of nearly 10:1. Mechanics like reclaim blocks or map virtual to physical blocks seem to produce to much load for nl-sas drives. On the othe hand deduplication and compression works fine and produces acceptable disk load. Nevertheless we disabled it cause it produces to much changed blocks for snapmirror.

Best wishes,

Stefan