ONTAP Discussions
ONTAP Discussions
I'm trying to run snapmirrors from a customer's filer to a vfiler at one location and then snapmirror from the secondary base filer to another filer out of state. This works some of the time but often the primary deletes the snapshot that the tertiary depends on. This breaks the primary to secondary mirror because the secondary can't delete that snapshot because of the lock the tertiary puts on it.
step 1:
filer1->snap.2
filer2->snap.2, snap.1
filer3->snap.2, snap.1
step 2:
filer1->snap.3
filer2->snap.3, snap.2 (can't delete this) goes into pending state
filer3->snap.2, snap.1 goes into pending state
I'm mirroring from the customer to a vfiler for obvious security reasons. That vfiler only has net access to the customer's site. It doesn't have an interface that can reach the tertiary location. We also want to handle all data center to data center traffic replication separately. Is it necessary for the primary filer to be 'aware' of the tertiary copy to set a lock on the snapshot? Is there a way to force the primary to keep more than the latest snapmirror snapshot?
The mirrors happen every 2 hours and I think all works fine as long as the transfers complete within that time frame. Should I stagger the tertiary start time to allow the secondary to finish or just let the tertiary retry until the secondary has completed its update?
Here's what I've deduced and tested:
The tertiary volume needs a snapshot that the secondary has but the primary does not. There doesn't appear to be a way to keep snapmirror from deleting previous snapshots on the primary. So the trick is to get to them first .
After initialization of all mirrors, the snapshots look like this:
primary: v.2
secondary: v.2, v.1
tertiary: v.2, v.1
make sure that the tertiary updates before the secondary. How to do this? I don't have a fool proof method but right now I'm scheduling the tertiary mirror just before the secondary so it locks the secondary snapshot. This puts the secondary mirror into Pending state but as soon as the tertiary is done, the secondary retries and then gets updated from the primary. The snapshot the tertiary depended on now moves to the #2 slot on the secondary. Next cycle, the tertiary can hit that snapshot to update itself to where the secondary is before the primary gets rid of the next snapshot.
So once things get rolling:
tertiary pulls an update and either matches the secondary or finds no changes. tertiary now dependant on v.2
secondary updates and can toss v.1 because tertiary has v.2
secondary now has v.3 and v.2
tertiary updates and pulls v.3 and tosses v.1
etc.
If something interupts the tertiary transfer things can still get hairy. I had been breaking the tertiary mirror to get the secondary caught up but that can lead to a re-initialization. I found that breaking the secondary was better.
primary v.3
secondary v.3 and v.2
something prevents the tertiary from updating for a cycle or more
tertiary v.2 and v.1
secondary can't update because the tertiary needs v.2
break the secondary and update tertiary
This looks all wrong because the secondary is now writable and it gets its own snapshot for this mirror
secondary v.1 (new one), v.3 and v.2
tertiary ends up with v.1 (new), v.3 (this is the one we need) and v.2
resync secondary mirror. This gives secondary v.3 and v.2 and tosses v.1(new)
break and resync tertiary. This gives tertiary v.3 and v.2 and we are back in sync.
Does this make any sense at all?
Cascaded snapmirror uses a soft lock and works without timing this.. But looks like something that breaks when not using vfiler0.. Which you can't use in your case..
Do you have a case open? It sounds like a bug or feature not extended when cascaded through a non-default vfiler.
Typos Sent on Blackberry Wireless
I have opened a ticket about this. I devise find a scheme that works, at least for now.
Primary -> Secondary every 2 hours
Secondary -> Tertiary every hour
This allows the Tertiary to sync up with the Secondary before that one updates from the Primary again. When they collide every other hour one and/or the other goes into Pending but catches up within an hour. I haven't tested network outages of any link lasting over an hour.