ONTAP Discussions

Snapmirror running slow - 2MB/s over 10GbE

ed_symanzik

I am seeing unhealthy snapmirror times; transfer rates on the order of 2MB/s over 10GbE.

 

last-transfer-size     last-transfer-duration

      631.6GB            86:35:23              

      11.16GB            2:12:48               

      755.5GB            91:32:48            

 

Schedule is to run every 10 minutes.

 

The network is clean.  ifstat has not recorded an error in months. 

No duplex mismatches and negotiated speed is 10000.

Network is not maxed out, hits 80% at times.

Jumbo frames are enabled throughout. 

CPU load is low on a pair of FAS6220's.

Systems using the storage get good performance.

Throttle is unlimited.

 

What else do I check?

30 REPLIES 30

cedric_renauld

HEllo,

 

Humm, I take this postafter some answer .. Your post is closed ?

If no, one time ago I've the same problem, and this problem have the original source @ the default router ...

 

Could you send us the route table and the name uses and IP's uses for the Snapmirror relationshop ?

Tnanks

It doesn't look like my reply was accepted, but I posted the routing tables earlier in the thread.

MK

Did this ever get resolved?  I am having the same issue?

na-adm
    c35.8.5.0/26
        35.8.5.0/26 cluster-mgmt 20
na-adm-01
    c169.254.0.0/16
        169.254.0.0/16 cluster 30
    i172.16.34.0/24
        172.16.34.0/24 intercluster 40
    n35.8.5.0/26
        35.8.5.0/26 node-mgmt 10
na-adm-02
    c169.254.0.0/16
        169.254.0.0/16 cluster 30
    i172.16.34.0/24
        172.16.34.0/24 intercluster 40
    n35.8.5.0/26
        35.8.5.0/26 node-mgmt 10

 

Routing
Vserver Group Destination Gateway Metric
--------- --------- --------------- --------------- ------
na-adm
    c35.8.5.0/26
        0.0.0.0/0 35.8.5.1 20
na-adm-01
    n35.8.5.0/26
        0.0.0.0/0 35.8.5.1 10
na-adm-02
    n35.8.5.0/26
        0.0.0.0/0 35.8.5.1 10

 

na-cc
    c35.8.5.0/26
        35.8.5.0/26 cluster-mgmt 20
na-cc-01
    c169.254.0.0/16
        169.254.0.0/16 cluster 30
    i172.16.34.0/24
        172.16.34.0/24 intercluster 40
    n35.8.5.0/26
        35.8.5.0/26 node-mgmt 10
na-cc-02
    c169.254.0.0/16
        169.254.0.0/16 cluster 30
    i172.16.34.0/24
        172.16.34.0/24 intercluster 40
    n35.8.5.0/26
        35.8.5.0/26 node-mgmt 10

 

Routing
Vserver Group Destination Gateway Metric
--------- --------- --------------- --------------- ------
na-cc
    c35.8.5.0/26
        0.0.0.0/0 35.8.5.1 20
na-cc-01
    n35.8.5.0/26
        0.0.0.0/0 35.8.5.1 10
na-cc-02
    n35.8.5.0/26
        0.0.0.0/0 35.8.5.1 10

NetApp_SEAL

Question - can you post an output of the following:

- Traceroute from source to destination IC LIF

- routing-group show

- routing-group route show

I'm just curious

So I've been battling this same issue for a client for a large part of the day today.

FAS8040 at source, FAS3220 at destination. Both sites using 10 Gb interface groups for data with respective VLAN tagging. Using the same data interface group vs. a dedicated port (client doesn't have the proper SFPs for the other UTA2 ports to dedicate - maybe in the future). The client's network admin was only seeing about 1 Gb utilization over the 10 Gb link between sites. So I was like..yeah...great...I've seen this one before.

I ran into two issues:

1) Routing group configuration

2) Jumbo Frames configuration

When the node intercluster LIFs got created, it created the respective routing groups (but not routes). The VLAN isn't spanned, so I needed to create the proper routes. Did that. Can ping fine. Yay! Still had issues. So I ran a traceroute from the IC LIF on Node A at souce site to the IC LIF on Node B at the destination site. Even with the routing group (and route) in place, the first hop at each site was to the node / cluster management VLAN on the respective node (which was interesting to me since management is on 1) a different physical port and 2) a different subnet.

The route for inter-cluster was using a destination of 0.0.0.0/0 (but with the proper gateway - the SVI of the SnapMirror VLAN). Decided to delete the route and create a new one. So, I went old skool 7-Mode static route on it and set the proper destination (the subnet of the destination site) and the same gateway (the local SnapMirror VLAN SVI). Now the trace comes back clean and the first hop is the SnapMirror VLAN SVI. w00t!

Still won't work.

Turns out that there WASN'T Jumbo Frames all the way from source to destination. The topology has 5Ks at each site then two 7Ks in between the sites over to 10 Gb links. I believe the network admin forgot to set the per-VLAN MTU size for the SnapMirror VLAN on the 7Ks to 9000. I'll get him to fix that tomorrow.

So I went back and set the MTU to 1500 on the VLAN interfaces on each node. Aborted / restarted the SnapMirror relationships (they seemed hung). Now it works like a charm

Hope to have some metrics to compare against in the AM to validate.

Hope this helps. Let me know!

Cheers,
Trey

mark_schuren

You say it now works like a charm - may I ask what snapmirror throughput do you get over 10GE?

I'm having problems getting snapmirror dp or xdp to be fast in 10gigE environments. Would like to hear what speeds others are seeing.

And maybe you have another idea regarding my post on same topic: https://communities.netapp.com/thread/33682 ...? Thanks.

NetApp_SEAL

So in my most recent scenario, and after some additional testing, here's what I can see. Mind you, I'm still waiting on the client's network team to validate the results from the latest transfer.And keep in mind that this testing is over a 10 Gbps WAN link, not local LAN. Latency is around 40 ms. Last test was done with MTU 9000 from end-to-end (a test earlier in the day with MTU 1500 yielded approximately the same transfer rates / times, give or take a few minutes, and granted, the bump from 1500 MTU to 9000 MTU is going to be leveraged far more efficiently by things that take better advantage of the larger overhead, like iSCSI, so I don't expect a drastic difference with 1500 MTU vs. 9000 MTU SnapMirror traffic).

Topology: Site A Cluster --> 5K --> 7K --> 7K --> 5K -- Site B Cluster

Using InterCluster SnapMirror over intercluster LIFs (one per node) on a 2-port 10 Gb ifgrp (data ports) VPC using LACP

Separate VLAN per site (so routed VLAN, not extended via OTV)

As a test earlier, I set up a SnapMirror for a 1 TB volume from Site A (Prod) to Site B (DR).

If I'm calculating things correctly (and again, if someone can validate if this is not correct, please call it out),

Theoretical limit of a 10 Gbps links is 1.25 GBps
Using the calculation of 10,000 bits / 8 = 1,250 bits

Transferred 1 TB of data from Site A to Site B in 1 hour

Using the calculation of 1 TB = 8796093022210 bits

Using the calculation of 1 hour = 3,600 seconds

So then (if that's all correct), 8796093022178 bits / 3600 seconds = 2443359173 bits


Convert 2443359173 bits to MB = 291.27 MBps = 2.275 Gbps = 0.284 GBps

So at 0.284 GBps, and a theoretical line rate for 10 Gbps of 1.250 GBps, that translates to 22.72% efficiency. Granted, this could be expected, due to additional configuration issues in the data path, or something else I haven't come across yet.

(I'm sure there's a quicker way to calculate that math, this is just how I walked through it in my head and on paper to get a better understanding)

Main thing for me here is that I'm trying to gauge how long some other, larger transfers are going to take (say, 10-15 TB or so) and to determine if the process is working as expected according to client expectations (again, not knowing if there might be some network-related issues they need to examine to improve efficiency, if if they run something like iPerf, they get the same results). The client has never had to push this much data across the WAN before, so it's essentially their first real test of high throughput on the links (they actually have two, redundant 10 Gbps links, but it's more so for that - redundancy - and not performance). Question is - can the Production cluster at Site A really PUSH total expected line rate with the transfers (most likely not), so are these results desirable? Here, in this scenario, I would think so. Will I see any any queuing or drops? I don't think I will, but we'll see.

So to the OPs point (and maybe to answer your question as well), he was seeing only about 2 MBps over a 10 Gb link, and if that's a local LAN link, the efficiency should be much higher. In this case, I achieved 291.27 MBps over a 10 Gb WAN link at a far lower (expected) efficiency rate.

mark_schuren

I think ~ 290MByte/s (average?) sound better than my own experiences. Maybe you max out the source or destination disks / aggr at some time during the transfer?

By the way, have you observed a more or less "constant stream", or do you see peaky throughput (per second / per minute...)?

See my thoughts regarding tcp window sizes: https://communities.netapp.com/message/135397

Cheers,

Mark

NetApp_SEAL

Mark - I'll certainly dig into that note you call out in your other thread about the cluster-level LAN / WAN buffer size and see if the changes result in any throughput increase.

Regarding maxing out source or destination disks / aggrs during the transfer - definitely doesn't seem to be the case. Extremely low utilization of both (at both sites) when SnapMirror is running. I had disabled the reallocation settings at both sites as well to test, per a post in your other thread).

The stream when transferring has been incredibly constant, despite the low efficiency across the WAN. Handing that part over to the client's LAN / WAN team at this point, as from a NetApp perspective, everything looks to be working properly, and certainly better than when initially implemented.

I have another client with some similar issues over 10 Gb that I will shift focus to next and apply some of these same things.

ed_symanzik

I have two nodes dual connected to the same two switches, so traceroute shows one hop.

na-adm::> net routing-groups show -vserver na-adm-*

  (network routing-groups show)

          Routing

Vserver   Group     Subnet          Role         Metric

--------- --------- --------------- ------------ -------

na-adm-01

          c169.254.0.0/16

                    169.254.0.0/16  cluster           30

          i172.16.34.0/24

                    172.16.34.0/24  intercluster      40

          n35.8.5.0/26

                    35.8.5.0/26     node-mgmt         10

na-adm-02

          c169.254.0.0/16

                    169.254.0.0/16  cluster           30

          i172.16.34.0/24

                    172.16.34.0/24  intercluster      40

          n35.8.5.0/26

                    35.8.5.0/26     node-mgmt         10

6 entries were displayed.

na-adm::> net routing-groups route show -vserver na-adm-*

  (network routing-groups route show)

          Routing

Vserver   Group     Destination     Gateway         Metric

--------- --------- --------------- --------------- ------

na-adm-01

          n35.8.5.0/26

                    0.0.0.0/0       35.8.5.1        10

na-adm-02

          n35.8.5.0/26

                    0.0.0.0/0       35.8.5.1        10

2 entries were displayed.

The big weirdness at this point is that outgoing jumbo pings fail but incoming jumbo pings succeed. Jumbos also work across the cluster switches.

NetApp_SEAL

Ok, so this is local Layer 2, correct? Not routing over a WAN to another site, so therefore, no routing required?

ed_symanzik

Correct.  The snapmirror is three switches away, but no router.  The jumbo ping shows up node-switch-node.

NetApp_SEAL

Also - did I read that correctly? Three switches?

So what's the data path? Node A --> Switch 1 --> Switch 2 --> Switch 3 --> Node B?

For SnapMirror, are you using dedicated Intercluster ports or Intercluster LIFs on Data ports? Using interface group with VLAN tagging?

You're going from Node A to Node B in the same cluster, correct? So that would be INTRAcluster vs. INTERcluster?

ed_symanzik

Did a test over the weekend of INTRAcluster snapmirror, node1 volume snapmirrored to node2.  1.1TB transferred in 3.5 days  ~3.6MB/s.

NetApp_SEAL

Have you checked the SnapMirror TCP window size?

node run -node * options snapmirror.window_size

If not set at recommended for when using 10 Gb, set value to 8388608

node run -node * options snapmirror.window_size 8388608

(Source: https://library.netapp.com/ecmdocs/ECMM1278318/html/onlinebk/protecting/task/t_oc_prot_sm-adjust-tcp-window-size.html)

Another thing I could think to check would be the ingress/egress of the ports. Perhaps some QoS policy issues at the network layer (probably unlikely but worth looking in to - I've seen cases where someone would set QoS at the switch level where bandwidth was getting choked to death for Ethernet traffic).

DOMINIC_WYSS

are you using jumbo frames everywhere or just on the snapmirror interfaces? make sure every switch between the source and destination filer is configured correctly for jumbo frames.

which size are the jumbo frames? if the switches and Ontap are configured to 9000 and you are using vlans, then either  set the switches bigger or Ontap smaller.

check your routing table (route -s). if you have e0M on the same subnet as another interface, then sometimes the default gateway routing goes over it (and that's 100Mbit!). then you need to redesign your network config...

ed_symanzik

Jumbo frames everywhere.  VLAN'd correctly.  Ontap configured to 9000 mtu and the switches to something larger (9018, I think).

We did find an odd route and removed it. Switch statistics showed that it wasn't being used though the the situation has not improved (up to 7 days on the current snapshot).

DOMINIC_WYSS

just to be sure I would temporarely set the MTU to standard 1500 (just on source and destination filers).

we had a customer with jumbo frames problems on 10GbE and it was an issue on the switches. they just couldn't handle jumbos correctly with 10GbE.

another thing may be flowcontrol, we always disable it on 10GbE: ifconfig eXa flowcontrol none. I've seen it at some customers that Netapp Professional Services also disables it on all 10GbE interfaces.

ed_symanzik

Set MTU to 1500.  No change in performance.

ed_symanzik

Can flow control and the MTU be changed without disruption?

With what switches did you see the jumbo/10GbE issue?

Announcements
NetApp on Discord Image

We're on Discord, are you?

Live Chat, Watch Parties, and More!

Explore Banner

Meet Explore, NetApp’s digital sales platform

Engage digitally throughout the sales process, from product discovery to configuration, and handle all your post-purchase needs.

NetApp Insights to Action
I2A Banner
Public