ONTAP Discussions

Snapmirror running slow - 2MB/s over 10GbE

ed_symanzik
30,788 Views

I am seeing unhealthy snapmirror times; transfer rates on the order of 2MB/s over 10GbE.

 

last-transfer-size     last-transfer-duration

      631.6GB            86:35:23              

      11.16GB            2:12:48               

      755.5GB            91:32:48            

 

Schedule is to run every 10 minutes.

 

The network is clean.  ifstat has not recorded an error in months. 

No duplex mismatches and negotiated speed is 10000.

Network is not maxed out, hits 80% at times.

Jumbo frames are enabled throughout. 

CPU load is low on a pair of FAS6220's.

Systems using the storage get good performance.

Throttle is unlimited.

 

What else do I check?

30 REPLIES 30

NetApp_SEAL
13,329 Views

Also - did I read that correctly? Three switches?

So what's the data path? Node A --> Switch 1 --> Switch 2 --> Switch 3 --> Node B?

For SnapMirror, are you using dedicated Intercluster ports or Intercluster LIFs on Data ports? Using interface group with VLAN tagging?

You're going from Node A to Node B in the same cluster, correct? So that would be INTRAcluster vs. INTERcluster?

ed_symanzik
11,039 Views

Did a test over the weekend of INTRAcluster snapmirror, node1 volume snapmirrored to node2.  1.1TB transferred in 3.5 days  ~3.6MB/s.

mark_schuren
13,329 Views

You say it now works like a charm - may I ask what snapmirror throughput do you get over 10GE?

I'm having problems getting snapmirror dp or xdp to be fast in 10gigE environments. Would like to hear what speeds others are seeing.

And maybe you have another idea regarding my post on same topic: https://communities.netapp.com/thread/33682 ...? Thanks.

NetApp_SEAL
12,803 Views

So in my most recent scenario, and after some additional testing, here's what I can see. Mind you, I'm still waiting on the client's network team to validate the results from the latest transfer.And keep in mind that this testing is over a 10 Gbps WAN link, not local LAN. Latency is around 40 ms. Last test was done with MTU 9000 from end-to-end (a test earlier in the day with MTU 1500 yielded approximately the same transfer rates / times, give or take a few minutes, and granted, the bump from 1500 MTU to 9000 MTU is going to be leveraged far more efficiently by things that take better advantage of the larger overhead, like iSCSI, so I don't expect a drastic difference with 1500 MTU vs. 9000 MTU SnapMirror traffic).

Topology: Site A Cluster --> 5K --> 7K --> 7K --> 5K -- Site B Cluster

Using InterCluster SnapMirror over intercluster LIFs (one per node) on a 2-port 10 Gb ifgrp (data ports) VPC using LACP

Separate VLAN per site (so routed VLAN, not extended via OTV)

As a test earlier, I set up a SnapMirror for a 1 TB volume from Site A (Prod) to Site B (DR).

If I'm calculating things correctly (and again, if someone can validate if this is not correct, please call it out),

Theoretical limit of a 10 Gbps links is 1.25 GBps
Using the calculation of 10,000 bits / 8 = 1,250 bits

Transferred 1 TB of data from Site A to Site B in 1 hour

Using the calculation of 1 TB = 8796093022210 bits

Using the calculation of 1 hour = 3,600 seconds

So then (if that's all correct), 8796093022178 bits / 3600 seconds = 2443359173 bits


Convert 2443359173 bits to MB = 291.27 MBps = 2.275 Gbps = 0.284 GBps

So at 0.284 GBps, and a theoretical line rate for 10 Gbps of 1.250 GBps, that translates to 22.72% efficiency. Granted, this could be expected, due to additional configuration issues in the data path, or something else I haven't come across yet.

(I'm sure there's a quicker way to calculate that math, this is just how I walked through it in my head and on paper to get a better understanding)

Main thing for me here is that I'm trying to gauge how long some other, larger transfers are going to take (say, 10-15 TB or so) and to determine if the process is working as expected according to client expectations (again, not knowing if there might be some network-related issues they need to examine to improve efficiency, if if they run something like iPerf, they get the same results). The client has never had to push this much data across the WAN before, so it's essentially their first real test of high throughput on the links (they actually have two, redundant 10 Gbps links, but it's more so for that - redundancy - and not performance). Question is - can the Production cluster at Site A really PUSH total expected line rate with the transfers (most likely not), so are these results desirable? Here, in this scenario, I would think so. Will I see any any queuing or drops? I don't think I will, but we'll see.

So to the OPs point (and maybe to answer your question as well), he was seeing only about 2 MBps over a 10 Gb link, and if that's a local LAN link, the efficiency should be much higher. In this case, I achieved 291.27 MBps over a 10 Gb WAN link at a far lower (expected) efficiency rate.

mark_schuren
12,803 Views

I think ~ 290MByte/s (average?) sound better than my own experiences. Maybe you max out the source or destination disks / aggr at some time during the transfer?

By the way, have you observed a more or less "constant stream", or do you see peaky throughput (per second / per minute...)?

See my thoughts regarding tcp window sizes: https://communities.netapp.com/message/135397

Cheers,

Mark

NetApp_SEAL
12,803 Views

Mark - I'll certainly dig into that note you call out in your other thread about the cluster-level LAN / WAN buffer size and see if the changes result in any throughput increase.

Regarding maxing out source or destination disks / aggrs during the transfer - definitely doesn't seem to be the case. Extremely low utilization of both (at both sites) when SnapMirror is running. I had disabled the reallocation settings at both sites as well to test, per a post in your other thread).

The stream when transferring has been incredibly constant, despite the low efficiency across the WAN. Handing that part over to the client's LAN / WAN team at this point, as from a NetApp perspective, everything looks to be working properly, and certainly better than when initially implemented.

I have another client with some similar issues over 10 Gb that I will shift focus to next and apply some of these same things.

cedric_renauld
11,037 Views

HEllo,

 

Humm, I take this postafter some answer .. Your post is closed ?

If no, one time ago I've the same problem, and this problem have the original source @ the default router ...

 

Could you send us the route table and the name uses and IP's uses for the Snapmirror relationshop ?

Tnanks

ed_symanzik
11,013 Views

na-adm
    c35.8.5.0/26
        35.8.5.0/26 cluster-mgmt 20
na-adm-01
    c169.254.0.0/16
        169.254.0.0/16 cluster 30
    i172.16.34.0/24
        172.16.34.0/24 intercluster 40
    n35.8.5.0/26
        35.8.5.0/26 node-mgmt 10
na-adm-02
    c169.254.0.0/16
        169.254.0.0/16 cluster 30
    i172.16.34.0/24
        172.16.34.0/24 intercluster 40
    n35.8.5.0/26
        35.8.5.0/26 node-mgmt 10

 

Routing
Vserver Group Destination Gateway Metric
--------- --------- --------------- --------------- ------
na-adm
    c35.8.5.0/26
        0.0.0.0/0 35.8.5.1 20
na-adm-01
    n35.8.5.0/26
        0.0.0.0/0 35.8.5.1 10
na-adm-02
    n35.8.5.0/26
        0.0.0.0/0 35.8.5.1 10

 

na-cc
    c35.8.5.0/26
        35.8.5.0/26 cluster-mgmt 20
na-cc-01
    c169.254.0.0/16
        169.254.0.0/16 cluster 30
    i172.16.34.0/24
        172.16.34.0/24 intercluster 40
    n35.8.5.0/26
        35.8.5.0/26 node-mgmt 10
na-cc-02
    c169.254.0.0/16
        169.254.0.0/16 cluster 30
    i172.16.34.0/24
        172.16.34.0/24 intercluster 40
    n35.8.5.0/26
        35.8.5.0/26 node-mgmt 10

 

Routing
Vserver Group Destination Gateway Metric
--------- --------- --------------- --------------- ------
na-cc
    c35.8.5.0/26
        0.0.0.0/0 35.8.5.1 20
na-cc-01
    n35.8.5.0/26
        0.0.0.0/0 35.8.5.1 10
na-cc-02
    n35.8.5.0/26
        0.0.0.0/0 35.8.5.1 10

ed_symanzik
11,014 Views

It doesn't look like my reply was accepted, but I posted the routing tables earlier in the thread.

MK
8,783 Views

Did this ever get resolved?  I am having the same issue?

Public