I am currently using a 35mbs dedicated pipe to replicate my SnapMirror traffic. I have 3 NearStores (3070s) at the source and their partners at their destination.
When I have multiple SnapMirror transfers occurring, I am able to fill the pipe. Each of the 3 filers use about one-third of the available bandwidth.
Storage Guy is happy.
However when I only have jobs running on one filer, it still only uses that one-third (about 10 mbs) of available bandwidth. There is about 20mbs doing nothing.
I have no bandwidth throttling turned on.
In short, why is SnapMirror not "throttling up" when there is the available bandwidth to handle it? And, if other jobs from other filers start up, how can I ensure that one filer is not hogging the pipe?
It's not just bandwidth, it's latency too that can slow things down. The TCP window size for SnapMirror is tweakable and may help. Here's an article that describes how to set it properly for an environment.
The "problem" you are trying to solve is one that, in essence, doesn't exist. By that I mean there is a total speed limit for any single snapmirror transfer. There are some things on the edges you can do to maximize available speed, but whether in 7-mode or in cDot any given transfer is not going to max out any given pipe.
Consider - first, there is a maximum speed that can be achieved. A snapmirror transfer is, for all intents, a single threaded long sequential read and write - has to happen at both ends. You could suspend all other accesses from the source and still not exceed the max disk sequential read speed. What is the great equalized among storage arrays? Long sequential reads. Ultimately they all drop to the level of disk performance you can get.
Then too you have to push that across a wire. Snapmirror isn't a streaming protocol - it's a standard TCP exchange protocol. It isn't as bad as, say writing to a tape drive, but it is still a standard handshake based protocol with TCP windowing in play. I cite the tape drive example because it's real world - a difference of 0.1ms (yes 100 microseconds) can drop maximum LTO tape drive performance by 300% due to waiting for packet handshakes. TCP is a little more forgiving in that you can have multiple packets in flight before the acknowledgement comes back, but don't drop one or the retransmit penalty is huge. Streaming protocols by design don't care if they drop a little, or have alternate retry mechanisms that don't depend on the TCP layer.
Consider as well a typical server. Without specialized software and hardware, a Windows server for instance used to be doing really well for itself when it hit around 35% of a 1Gb link. Granted all the hardware and such has improved, but maxing out any link steadily still takes a well tuned system around that particular workload. Snapmirror is similar. It isn't the highest priority workload on your source or target nodes - direct service to user data gets higher priority. If you are compressing the stream, slow it down even more.
One more real world equivalent case. If you look at any decent sized or enterprise environment, what traditionally is the workload that will fill a network or SAN link to capacity? Backup. What backup servers have in common with this discussion? Highly tuned single purpose application that multiple end points are each sending to in smaller amounts. Multiple threads combine to fill the pipes.
Sure I suppose if NetApp would let you make Snapmirror be the top dog, you'd get better single stream performance. But at what cost to the rest of the environment?
So what to do? Understand that there is a speed limit for any single snapmirror transaction, due to prioritization, network switching, routes between storage nodes at both ends, all that stuff. Tune everything in the middle to ensure that storage is the only governor at both ends. Tune storage to ensure that single workloads that have high change rates are optimized to balance between seequential reads (snapmirror) and standard daily workload. Then find out how much a single thread can get, and do the math. Perhaps it's best to break one relaly big LUN into two or three on separate volumes so you can mirror with multiple threads for more total bandwidth, if possible. Perhaps you need to run multiple CIFS/NFS shared volumes so it's not one really big volume push down the daily change rate. Perhaps you'll need to run updates more frequently so you don't get behind a massive update but can parse it out in smaller chunks - if you miss one of those scheduled events, maybe it doesn't matter so much as missing a whole day's event. Perhaps you just have to tune other workloads off the source to push just a little bit more. And finally to fill that pipe, if that's the overriding concern, know that you'll have to plan to run multiple threads somehow, someway, if a single thread doesn't hit it.
One last real world example. Cluster mode (which does better than 7-mode) 6290 pushing 1000 miles on a 10Gb link to an 8060. Initial replication on a single thread from/to high performance disk at both ends, best I could do was 1.5Gbs per link. Which was fine. I can fill the pipe with ease when my updates fire - but then I'm replicating several hundred volumes from 8 separate nodes across that 10Gb link. My "capacity" disks on lower class nodes actually fill the pipe better - because they push more threads in total, and they aren't as performance bound to clients as are my performance clusters.
I hope this helps.
Lead Storage Engineer
Huron Legal | Huron Consulting Group
NCDA, NCIE - SAN Clustered, Data Protection
Kudos and accepted solutions are always appreciated.