ONTAP Discussions

Snapmirror replication suddenly stopped working

PIPERONTAP
9,817 Views

We have snapmirror replication set up on a metro-E network. We also have a point-to-point network which is being optimized by two Riverbed Steelheads. Here is what I don't get: Snapmirror replication broke the minute we enabled the Steelhead appliances. I ran packet traces and confirmed that the filers are talking to each other over the metro-E, and that no snapmirror or control traffic is passing over the point-to-point (only CIFS).

I opened a case with NetApp support. The only suggestion I got was to change my MTU on the filers. The MTU is currently set to 9000 on the iSCSI VIF and that's what we're using for snapmirror replication. Again, this worked perfectly fine until the Steelheads were enabled on a different network.

Here are the errors I'm seeing:

On my destination filer:

recover.abort.ROOLR:notice]: The abort event, snapmirror: Cannot Init

NTM, aborting , is just notified.

On my source filer:

replication.src.err:error]: SnapMirror: source transfer from

x to x : transfer

failed.

Has anyone else seen anything like this?

17 REPLIES 17

AGUMADAVALLI
9,754 Views

It is not at all netapp snapmirror issues, it is all about your steelhead appliance, you need to tweak it to work with your network. Make sure the it talks on all necessary ports of snapmirror using the telnet.

thank you,

AK G

PIPERONTAP
9,755 Views

AK G, the snapmirror traffic isn't traversing the Steelheads. It's on a separate network.

scottgelb
9,755 Views

does traceroute from the target to source and vice-versa go over the expected network?  routing in 7-mode can be interesting and some unexpected routes unless we route add net for the dedicated snapmirror network.  You probably already checked, but just to make sure it is going over the expected network and if so fixed with a route add net.

PIPERONTAP
9,755 Views

Yes, the traffic is flowing just as expected.

scottgelb
9,755 Views

Is options snapmirror.allow set to a hostname with a different IP address?  Does changing the setting to "*" or "all" let it work?  Just to test then you can put the IP of the source controller.

PIPERONTAP
9,755 Views

I have snapmirror.allow set to *. The filers are talking to each other on the correct network and IPs, but the initial transfer fails with a generic network error. Not surprisingly, NetApp support has been of no help. All the tech wants to do is close the ticket. That's why I'm posing the question here.

christin
9,755 Views

Hi Ben,

I'm sorry to hear that you are not able to resolve your issue with NetApp Support. Please send me a private message with the case number and I will look into it.

Thanks,

Christine

PIPERONTAP
9,754 Views

The NetApp tech has gone silent. He intimated that snapmirror is sensitive to any network delays or problems. I know for a fact that this is false. I have run snapmirror on saturated, high-latency networks with no problem. Either the tech I got assigned to doesn't know the product, or Snapmirror is just not ready for prime-time. At this point, I'm starting to suspect both.

scottgelb
9,754 Views

Sorry to hear you didn't get a response... it may be worth a call back and ask for the Duty Manager and escalate...most cases get handled just fine, but as NetApp grew to 6 Billion there can be some growing pains.  SnapMirror has been at many of our customers for many years (about 12 years doing this) and can vouch for the production quality of SnapMirror.  Hopefully support can escalate and quickly solve this issue.  The NetApp and/or VAR team may be able to help escalate as well and start packet traces that escalations can look at to see the issue.

PIPERONTAP
7,657 Views

Then there's no excuse for a tech not to have enough knowledge of Snapmirror to assist in troubleshooting. We pay for support to get support, not to waste time hounding NetApp just to get to a person who actually knows enough to help. The tech's name is David Cook. The takeaway here is that you get better support from NetApp forums than you do from NetApp support. My suggestion is to save your money and pay a third-party for support.

allison
7,657 Views

Ben, it was another user, not a Netapp employee that marked your response to be moderated.  I have sent you a message on why the comment was flagged. Please let me know if have any other questions or concerns.

We do appreciate and listen to our customers' feedback.

greghaa69
7,657 Views

I know this is old but I am in a similar boat.  We have filers connection over 2 100 mbit WAN connections.  I setup all of our SM relationships from a number of prod filers in California to a set of DR filers in Atlanta.  Problem I am seeing is random.  Some volumes will initialize and stay synced without any issue.  Other volumes will start to transfer and get some amount of the way transferred and will then just hang there.  It will sit in state "transferring" forever until I kill it.  NetApp support has told me it is a bandwidth issue, but I successfully transferred a 40 GB volume in about 2 hours.  Yet still, other volumes on the same filer will not transfer.  I have been through 3 or 4 techs now and they cannot solve it. 

Anyone here have any ideas?

AGUMADAVALLI
7,657 Views

Hi there,

You need to apply the network throttle to the replication and  snapmirror it will fix everything for you

options replication

snapmirror.throttle

set the snapmirror windowsize

You should rock with the above settings.

Thank you

AK G

greghaa69
7,657 Views

yeah, I have done all those things.  doesn't help.  it works on the volumes that work.  but not on the ones that won't x-fer.

Thanks

AGUMADAVALLI
7,657 Views

Quiesce the existing snapmirror destinations

turn the snapmirror off and turn it back after few minutes both on source and destination.

Initialize the once which are not transfering, before resuming the snapmirrors.

thank you,

AK G

davidsimeone
6,082 Views

Hey Greg is Riverbed present in your case?

jeras
6,081 Views

Found this in a old thread here in the Communities section (did a search for "Steelhead"):

The issue was solved by adding a rule to the Steelhead to pass snapmirror traffic for source {filer IP} to destination IP over port TCP 10566.  This needs to be on the destination site steelhead.  We mirror both ways so have the rule on both units.  The rule will only be applied to NEW snapmirror traffic.  We run async so we quiesced the 'stuck' snapmirrors and then set them running again.  All good.

Hope this helps!

Ken

Public