VMware Solutions Discussions

Snapmirror over running controller

BrettEdelen
3,959 Views

I just ran into an odd issue that I thought I would share.

My environment: V3240 IOXM HA pair with both FC and 10GB, serving VMware datastores, and linux and postgreSQL workloads.

In addition to native SAS (3GB) shelves, I have virtualized 2 IBM DS4800's for workload isolation.

I set up a snapmirror relationship between 2 aggregates. The 2 aggregates were owned by different controllers, and were each on a different virutalized DS array.

My goal was to replicate the data over to the 2nd virtualized array, to begin the new isolated environment.

When I fired up the snapmirror, everything went south.

Disk Util went to 100% and stayed there (sysstat), and all of the datastores showed huge latency spikes. In addition, the VMs were responding very slowly.

Confused, I aborted the snapmirror.

I then tried using snapmirror compression, with the exact same results.

I had thoughts about raid group size, etc...

in the end, something in system manager caught my eye... 'unlimited bandwidth'.

we are not using an isolated interface for snapmirror (all ethernet goes over the 10GB link).

I decided to limit the snapmirror relationship to 1000gbps, and everything is working fine.

I plan on playing with the throttle a little to see how hard i can push it, but for now, 1GB is ok for 2 concurrent snapmirrors.

My theory is, the array did the theoretical communication math over the 10GB link, and pushed so much data, the controller got overrun.

I just wanted to post this out there in case someone else runs into this, they will have a quick answer.

6 REPLIES 6

bsti
3,959 Views

That's really strange behavior, Brett.  What version of ONTAP? 

I don't have 10GB in my environment, so perhaps that's a different ballgame, but if anything I've been frustrated with the exact opposite behavior.

In my case, I have two 6000 series controllers at my DR site that are Snapmirror targets but do nothing outside that.  In one case, I wanted to move a large, 6+ TB volume from one controller to the other.  Long story short, because of the high cpu load on the source controller due to frequent snapmirror updates and lots of snapshots (deswizzling), I coudl not make SM push data any faster than 20 - 30 MBps on two GB links!  It was being throttled because the filer was busy doing other things, and it didn't matter that I wanted the SM transfer to be top priority (there is no way for me to control that).  In essence, I found Snapmirror did not alwasy perform the way I wanted because ONTAP throttles it so as not to overwhelm the controller.

Now I hear the story you just told... 

Thanks for posting that.  It's interesting to see that 10 GB either completely breaks ONTAP's SM throlltling or it at least does not throttle it very well, which to me is actually preferred.  You can now manally control the speed at which it replicates. 

One last thing to ask:  Is this an async SM relationship?

BrettEdelen
3,959 Views

I am on 8.0.3p2

i didn't specify syncmirror, but the controllers are literally separated by 5m 10GB cables, can't remember if it does sync by default if they are close by.

It is odd behavior. I am slowly cranking up the throttle now to see where the breaking point is (I'm up to 4096 now)

BrettEdelen
3,959 Views

update, at 5gbps, my disk util spent more time in 100% than not, so i backed it back down to 4gbps.

There, i see disk util going to 100% in short 2-5 second bursts every 20 seconds or so

bsti
3,959 Views

Ah, my apoloiges then. I'm referring to Volume Snapmirror.  It's interesting to see your results though.  Thanks for posting them.

BrettEdelen
3,959 Views

nope, my error, was doing a million things.

this is a volume snapmirror between HA filer controllers.

i did some more investigation.

although my throttle is set to 4gbps, i am only seeing 20mbps-ish over the interface, but the controller performance tanks when i open up the throttle.

BrettEdelen
3,959 Views

update to this.

i created two new snapmirrors between controllers to virtualized aggregates, and could raise the throttle to 10G on both without affecting anything

so, i cannot recreate the original symptom

Public