ONTAP Discussions

Highlighted

Snapmirror running slow - 2MB/s over 10GbE

I am seeing unhealthy snapmirror times; transfer rates on the order of 2MB/s over 10GbE.

 

last-transfer-size     last-transfer-duration

      631.6GB            86:35:23              

      11.16GB            2:12:48               

      755.5GB            91:32:48            

 

Schedule is to run every 10 minutes.

 

The network is clean.  ifstat has not recorded an error in months. 

No duplex mismatches and negotiated speed is 10000.

Network is not maxed out, hits 80% at times.

Jumbo frames are enabled throughout. 

CPU load is low on a pair of FAS6220's.

Systems using the storage get good performance.

Throttle is unlimited.

 

What else do I check?

30 REPLIES 30
Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

are you using jumbo frames everywhere or just on the snapmirror interfaces? make sure every switch between the source and destination filer is configured correctly for jumbo frames.

which size are the jumbo frames? if the switches and Ontap are configured to 9000 and you are using vlans, then either  set the switches bigger or Ontap smaller.

check your routing table (route -s). if you have e0M on the same subnet as another interface, then sometimes the default gateway routing goes over it (and that's 100Mbit!). then you need to redesign your network config...

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

Jumbo frames everywhere.  VLAN'd correctly.  Ontap configured to 9000 mtu and the switches to something larger (9018, I think).

We did find an odd route and removed it. Switch statistics showed that it wasn't being used though the the situation has not improved (up to 7 days on the current snapshot).

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

What is latency between sites?

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

just to be sure I would temporarely set the MTU to standard 1500 (just on source and destination filers).

we had a customer with jumbo frames problems on 10GbE and it was an issue on the switches. they just couldn't handle jumbos correctly with 10GbE.

another thing may be flowcontrol, we always disable it on 10GbE: ifconfig eXa flowcontrol none. I've seen it at some customers that Netapp Professional Services also disables it on all 10GbE interfaces.

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

I echo the above... I've also seen instances where a snapmirror appears "stuck"   When everything checks out and the configurations are correct, I've had to abort the snapmirror and start it over and everything worked fine.

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

Question - can you post an output of the following:

- Traceroute from source to destination IC LIF

- routing-group show

- routing-group route show

I'm just curious

So I've been battling this same issue for a client for a large part of the day today.

FAS8040 at source, FAS3220 at destination. Both sites using 10 Gb interface groups for data with respective VLAN tagging. Using the same data interface group vs. a dedicated port (client doesn't have the proper SFPs for the other UTA2 ports to dedicate - maybe in the future). The client's network admin was only seeing about 1 Gb utilization over the 10 Gb link between sites. So I was like..yeah...great...I've seen this one before.

I ran into two issues:

1) Routing group configuration

2) Jumbo Frames configuration

When the node intercluster LIFs got created, it created the respective routing groups (but not routes). The VLAN isn't spanned, so I needed to create the proper routes. Did that. Can ping fine. Yay! Still had issues. So I ran a traceroute from the IC LIF on Node A at souce site to the IC LIF on Node B at the destination site. Even with the routing group (and route) in place, the first hop at each site was to the node / cluster management VLAN on the respective node (which was interesting to me since management is on 1) a different physical port and 2) a different subnet.

The route for inter-cluster was using a destination of 0.0.0.0/0 (but with the proper gateway - the SVI of the SnapMirror VLAN). Decided to delete the route and create a new one. So, I went old skool 7-Mode static route on it and set the proper destination (the subnet of the destination site) and the same gateway (the local SnapMirror VLAN SVI). Now the trace comes back clean and the first hop is the SnapMirror VLAN SVI. w00t!

Still won't work.

Turns out that there WASN'T Jumbo Frames all the way from source to destination. The topology has 5Ks at each site then two 7Ks in between the sites over to 10 Gb links. I believe the network admin forgot to set the per-VLAN MTU size for the SnapMirror VLAN on the 7Ks to 9000. I'll get him to fix that tomorrow.

So I went back and set the MTU to 1500 on the VLAN interfaces on each node. Aborted / restarted the SnapMirror relationships (they seemed hung). Now it works like a charm

Hope to have some metrics to compare against in the AM to validate.

Hope this helps. Let me know!

Cheers,
Trey

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

64 bytes from 172.16.20.33: icmp_seq=1 ttl=64 time=0.196 ms

In getting this number I did discover that the NetApp ping has packet-size and disallow-fragmentation options.

na-adm::> net ping -node na-adm-01 -destination 172.16.34.9 -disallow-fragmentation true -packet-size 2000 -record-route -show-detail -verbose

  (network ping)

ping: sendto: Message too long

Double-checked with our Network Team and jumbo frames are set.  Double-checked the NetApp and jumbo frames are set on the physical ports, the ifgrp, and the vlan.

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

Can flow control and the MTU be changed without disruption?

With what switches did you see the jumbo/10GbE issue?

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

MTU change is realtime without any disruption.

disabling flowcontrol will do a port down/up with a minimal disruption of a few seconds if you don't have an ifgrp with two ports. you could do a LIF migrate to the other node and back again.

we had the issue with Brocade ICX switches.

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

I have two nodes dual connected to the same two switches, so traceroute shows one hop.

na-adm::> net routing-groups show -vserver na-adm-*

  (network routing-groups show)

          Routing

Vserver   Group     Subnet          Role         Metric

--------- --------- --------------- ------------ -------

na-adm-01

          c169.254.0.0/16

                    169.254.0.0/16  cluster           30

          i172.16.34.0/24

                    172.16.34.0/24  intercluster      40

          n35.8.5.0/26

                    35.8.5.0/26     node-mgmt         10

na-adm-02

          c169.254.0.0/16

                    169.254.0.0/16  cluster           30

          i172.16.34.0/24

                    172.16.34.0/24  intercluster      40

          n35.8.5.0/26

                    35.8.5.0/26     node-mgmt         10

6 entries were displayed.

na-adm::> net routing-groups route show -vserver na-adm-*

  (network routing-groups route show)

          Routing

Vserver   Group     Destination     Gateway         Metric

--------- --------- --------------- --------------- ------

na-adm-01

          n35.8.5.0/26

                    0.0.0.0/0       35.8.5.1        10

na-adm-02

          n35.8.5.0/26

                    0.0.0.0/0       35.8.5.1        10

2 entries were displayed.

The big weirdness at this point is that outgoing jumbo pings fail but incoming jumbo pings succeed. Jumbos also work across the cluster switches.

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

Ok, so this is local Layer 2, correct? Not routing over a WAN to another site, so therefore, no routing required?

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

Correct.  The snapmirror is three switches away, but no router.  The jumbo ping shows up node-switch-node.

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

Have you checked the SnapMirror TCP window size?

node run -node * options snapmirror.window_size

If not set at recommended for when using 10 Gb, set value to 8388608

node run -node * options snapmirror.window_size 8388608

(Source: https://library.netapp.com/ecmdocs/ECMM1278318/html/onlinebk/protecting/task/t_oc_prot_sm-adjust-tcp-window-size.html)

Another thing I could think to check would be the ingress/egress of the ports. Perhaps some QoS policy issues at the network layer (probably unlikely but worth looking in to - I've seen cases where someone would set QoS at the switch level where bandwidth was getting choked to death for Ethernet traffic).

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

Also - did I read that correctly? Three switches?

So what's the data path? Node A --> Switch 1 --> Switch 2 --> Switch 3 --> Node B?

For SnapMirror, are you using dedicated Intercluster ports or Intercluster LIFs on Data ports? Using interface group with VLAN tagging?

You're going from Node A to Node B in the same cluster, correct? So that would be INTRAcluster vs. INTERcluster?

Highlighted

Re: Snapmirror running slow - 2MB/s over 10GbE

You say it now works like a charm - may I ask what snapmirror throughput do you get over 10GE?

I'm having problems getting snapmirror dp or xdp to be fast in 10gigE environments. Would like to hear what speeds others are seeing.

And maybe you have another idea regarding my post on same topic: https://communities.netapp.com/thread/33682 ...? Thanks.

Check out the KB!
Knowledge Base
All Community Forums