snapmirror performance over WAN

cstrickland · ‎2008-09-11

I have multiple clients that are experienceing slow replication over WAN links despite having large pipes.

One client has a 150Mbit MPLS network. We are only able to push about 4MB/sec for each snapmirror relationship.

We have installed a Riverbed WAN optimizer but that did not help.

The latency on the link is about 45ms.

The network team reports the link is less than 30% utilized most of the time.

I'm looking for examples of snapmirror performance with large pipes, > 100MBit and high latency > 30ms for comparison.

We've been chasing this for a couple of months and are not closer to figuring out why the replication is slow.

thanks,

chuck

friea · ‎2008-09-11

Hi Chuck ... this reminds me of a customer we interviewed for Tech OnTap years ago, but in that case using WAN accelerator technology addressed their issue.

I"m not in engineering or support, but is there any possiblity the WAN routing distance might be part of the issue? See below:

"We had adequate bandwidth, but because of the physical cabling limitations with the WAN routing it was actually a longer distance from Reno to Las Vegas than it would be from San Francisco to New York," explains Defilippi. "We were experiencing an average latency of 28 milliseconds, which caused the mirroring process to

take upward of three hours and consume more than half of our bandwidth."

Case study: http://media.netapp.com/documents/igt.pdf Article: http://www.netapp.jp/techontap/newsletter/november2005/1105tot_10reasons.html

BrendonHiggins · ‎2008-09-22

Hi ~ I know this one as we had the problem two months ago.

1). We sanpmirror over a 1Gb WAN link and throughput died once we turned on the Riverbed "Steelhead" boxes at each side of the link. The network team did not tell me they had done this and I spent many happy hours with pktt and wireshark working out what happened. The steelhead was compressing the data by about 70-80% but was killing throughput due to a TCP 'windowing' problem. The filers could 'detect' the altered packet headers and rejected the packet. - Turned off steelhead for snapmirror traffic, issue solved.

2). Slow throughput Mk II. At another site with a different set of filers, we had enabled "Link aggregation / IEEE 802.3ad" on the filer heads put cabled them into two different core switches. iSCSI and FC did not have issues but CIFS traffic was also slow. No one noticed this, only the snapmirror issue. When we have a closer look at the LAN ~ 4,000 errors per second on the Cisco logs! Changed the VIF to active / passive. Issue solved

3). Slow throughput Mk III. At another site with yet another set of filers. The problem was two routes available between the files. A 2Mb pipe and a 90Mb pipe. The traffic was going down the 2Mb pipe even though we believed and 'confirmed by the network team' it was going down the 90Mb pipe. Again pktt and wireshark showed the truth. Had to edit the snapmirror.conf file to route the traffic to the correct ports.

Let me know which is your issue and I will send more details on the solution. Good luck

cstrickland · ‎2008-09-22

Situation 1is the configuration.

The snapmirror had been running successfuly for several months. I was called in when the volumes lost synchronization.

In reviewing the logs it appeared the transfer rate had taken a significant drop a few months earlier. The only change anyone could point to was the installation of the steelheads.

We put a rule to bypass the snapmirror traffic in the steelheads but that did not help.

The steelheads are installed in-path so the option to remove them is not available.

Any additional details you can offer is appreciated. Did you have a case with Riverbed that we can reference?

thanks,

chuck

BrendonHiggins · ‎2008-09-22

Happy to send details to your email address. Can you please email me via the community message system and I will send over the case number.

The issue was solved by adding a rule to the Steelhead to pass snapmirror traffic for source {filer IP} to destination IP over port TCP 10566. This needs to be on the destination site steelhead. We mirror both ways so have the rule on both units. The rule will only be applied to NEW snapmirror traffic. We run async so we quiesced the 'stuck' snapmirrors and then set them running again. All good.

If you use pktt on the file you can get a trace of the packets. Use wireshark to analysis the trace log. Look for packets being duplicated "[TCP Dup ACK] indicators" and an issue with 'TCP Windowing.

Bren

anthony_baldini · ‎2008-09-22

I see that Bendon's woes were caused by issue with Steelheads and SnapMirror. Does anyone out there have any first hand experience with how well (or not well) snapmirror does with SilverPeak WAN boxes?

--Tony

BrendonHiggins · ‎2008-09-22

Sorry I have not come accross them.

dandillharsch · ‎2010-01-18

Has anyone done testing of Riverbeds with and without the snapmirror compression feature enabled that's in ontap 7.3.2?

If you have I'd love to hear from you or hear results...

rgraves2572 · ‎2010-01-18

We recently upgraded to 7.3.2P3 from 7.3.1.1 and did not notice any issues with our steelhead and snapmirror traffic. We haven't turned on compression yet and don't plan to as we run FAS2050 and CPU is already highly utilized.

-Robert