2014-08-15 12:41 PM - last edited on 2015-01-16 09:11 AM by allison
I am seeing unhealthy snapmirror times; transfer rates on the order of 2MB/s over 10GbE.
Schedule is to run every 10 minutes.
The network is clean. ifstat has not recorded an error in months.
No duplex mismatches and negotiated speed is 10000.
Network is not maxed out, hits 80% at times.
Jumbo frames are enabled throughout.
CPU load is low on a pair of FAS6220's.
Systems using the storage get good performance.
Throttle is unlimited.
What else do I check?
2014-08-16 12:16 PM
are you using jumbo frames everywhere or just on the snapmirror interfaces? make sure every switch between the source and destination filer is configured correctly for jumbo frames.
which size are the jumbo frames? if the switches and Ontap are configured to 9000 and you are using vlans, then either set the switches bigger or Ontap smaller.
check your routing table (route -s). if you have e0M on the same subnet as another interface, then sometimes the default gateway routing goes over it (and that's 100Mbit!). then you need to redesign your network config...
2014-08-18 02:29 PM
Jumbo frames everywhere. VLAN'd correctly. Ontap configured to 9000 mtu and the switches to something larger (9018, I think).
We did find an odd route and removed it. Switch statistics showed that it wasn't being used though the the situation has not improved (up to 7 days on the current snapshot).
2014-08-18 10:03 PM
just to be sure I would temporarely set the MTU to standard 1500 (just on source and destination filers).
we had a customer with jumbo frames problems on 10GbE and it was an issue on the switches. they just couldn't handle jumbos correctly with 10GbE.
another thing may be flowcontrol, we always disable it on 10GbE: ifconfig eXa flowcontrol none. I've seen it at some customers that Netapp Professional Services also disables it on all 10GbE interfaces.
2014-08-19 04:40 AM
I echo the above... I've also seen instances where a snapmirror appears "stuck" When everything checks out and the configurations are correct, I've had to abort the snapmirror and start it over and everything worked fine.
2014-08-19 09:29 PM
Question - can you post an output of the following:
- Traceroute from source to destination IC LIF
- routing-group show
- routing-group route show
I'm just curious
So I've been battling this same issue for a client for a large part of the day today.
FAS8040 at source, FAS3220 at destination. Both sites using 10 Gb interface groups for data with respective VLAN tagging. Using the same data interface group vs. a dedicated port (client doesn't have the proper SFPs for the other UTA2 ports to dedicate - maybe in the future). The client's network admin was only seeing about 1 Gb utilization over the 10 Gb link between sites. So I was like..yeah...great...I've seen this one before.
I ran into two issues:
1) Routing group configuration
2) Jumbo Frames configuration
When the node intercluster LIFs got created, it created the respective routing groups (but not routes). The VLAN isn't spanned, so I needed to create the proper routes. Did that. Can ping fine. Yay! Still had issues. So I ran a traceroute from the IC LIF on Node A at souce site to the IC LIF on Node B at the destination site. Even with the routing group (and route) in place, the first hop at each site was to the node / cluster management VLAN on the respective node (which was interesting to me since management is on 1) a different physical port and 2) a different subnet.
The route for inter-cluster was using a destination of 0.0.0.0/0 (but with the proper gateway - the SVI of the SnapMirror VLAN). Decided to delete the route and create a new one. So, I went old skool 7-Mode static route on it and set the proper destination (the subnet of the destination site) and the same gateway (the local SnapMirror VLAN SVI). Now the trace comes back clean and the first hop is the SnapMirror VLAN SVI. w00t!
Still won't work.
Turns out that there WASN'T Jumbo Frames all the way from source to destination. The topology has 5Ks at each site then two 7Ks in between the sites over to 10 Gb links. I believe the network admin forgot to set the per-VLAN MTU size for the SnapMirror VLAN on the 7Ks to 9000. I'll get him to fix that tomorrow.
So I went back and set the MTU to 1500 on the VLAN interfaces on each node. Aborted / restarted the SnapMirror relationships (they seemed hung). Now it works like a charm
Hope to have some metrics to compare against in the AM to validate.
Hope this helps. Let me know!
2014-08-21 06:09 AM
64 bytes from 172.16.20.33: icmp_seq=1 ttl=64 time=0.196 ms
In getting this number I did discover that the NetApp ping has packet-size and disallow-fragmentation options.
na-adm::> net ping -node na-adm-01 -destination 172.16.34.9 -disallow-fragmentation true -packet-size 2000 -record-route -show-detail -verbose
ping: sendto: Message too long
Double-checked with our Network Team and jumbo frames are set. Double-checked the NetApp and jumbo frames are set on the physical ports, the ifgrp, and the vlan.
2014-08-21 07:06 AM
MTU change is realtime without any disruption.
disabling flowcontrol will do a port down/up with a minimal disruption of a few seconds if you don't have an ifgrp with two ports. you could do a LIF migrate to the other node and back again.
we had the issue with Brocade ICX switches.