Running (Quiesce or Break or Resync) on several volumes as jobs

jurphville · ‎2010-12-27

I've been working on a powershell script that does several quiesce/break/resync operations against multiple volumes on a set of filers. I've found that issuing the appropriate powershell cmdlet for each of these operations takes a while to return, so if I do the volumes in serial, it can take several minutes to get through all the volumes. I know that I can issue the commands as background jobs in powershell, but I'm not sure if that's a good idea. Running these commands natively at the command prompt also takes a while to return, so that makes me think these operations are best done in serial. To be clear I am issuing each command for a set of volumes...so...for example, "quiesce volumes 1-7". I'm not wanting to issue all three commands at once to a single volume. That being said, are there any cautions, or recommendations for issuing these commands on several volumes at once using background jobs?

paleon · ‎2010-12-27

Can you provide more information in the issue you are attempting to solve? Will this quiesce/break/resync process need to be performed again and again, or are you working to resolve a single incident?

jurphville · ‎2010-12-27

This is for a script that can be used multiple times. We have a primary Netapp (snapmirror source) and a secondary Netapp for disaster recovery (snapmirror destination). The script automatically flips the relationship (source/destination) between filers. So if, for example, you needed to physically move the primary filers from one rack to another, one could run this script which quiescesses data to the source filers, and then flips the snapmirror relationship so that the old destination is now the new source filer (it also makes updates so that clients go to the correct filers). Move the primary filers to their new location, bring up the original source filers and let the snapmirrors transfer any changed data to them. Then run the script again to flip the relationship back to the primary filers. (I hope that makes sense).

That being said, I've found that one of the slowest portions of the script happens when I perform one of these actions on the set of volumes. So I say, "for each snapmirror in the list of snapmirrors, quiesce the snapmirorr". Each issuance of the quiesce command takes several "beats" to complete. Same for break and resync.

The core of my question really is whether it is wise/acceptable to issue multiple quiesce (or break or resyc) commands at once (one for each of the volumes). Or if this is a process that is best left for serial processing (one volume at a time).

paleon · ‎2010-12-27

Thank you for the additional information. That helps considerably.

I suspect there will be very little gain from running the commands in the background. Data OnTap divides applications into pools. Each pool is single threaded. I believe that SnapMirror tasks all exist within the same application pool. In other words, if you issue 10 snapmirror quiesce commands in parallel (or in the background), the NetApp will still process the commands one at a time.

(At least, this is my understanding of the Data OnTAP internals. Hopefully someone from NetApp will correct me if I am wrong.)

paleon · ‎2010-12-27

I would also recommend that the snapmirror resync be accomplished in a separate script. In the event of an actual disaster, communication would likely be down between the NetApps and the resync commands would fail anyway.

For disaster simulations, I would also suggest that you add a "snap create" and a "snapmirror update" to the process to help avoid data loss during simulations. IIRC, snapmirror quiesce allows already started update tasks to complete but does not issue a snapmirror update command. Hence, any data after the most recent common snapshot would be lost when the resync command reverses the source and destination volumes. (This has the disadvantage that it causes disaster simulations to occur differently from actual diasasters.)

If you can share information about what you have learned from your DR simulations, I would very much like to hear more on this topic.

jurphville · ‎2011-01-04

The script prompts the user to find out if the source filers are available before starting the transition. So, the resync and snapshot cleanup code is only executed if the source filers are available. I thought about writing a second clean up script to be executed after a “source not available” flip, but I didn’t think it was worth the trouble, as that process can be done at leisure manually (since there is no client impact).

The script also has the snapmirror update option built in as well (assuming the source filer is avail). I didn’t do a snap create, as I wasn’t interested in creating a long term snapshot at that point, I just wanted to make sure any new data was transferred to the remote filers. The snapmirror update creates a new snapmirror snapshot to transfer, and that was good enough for me.

The overall process for the script is...

* Prompt the user for information

- Find out which filers to flip

- Ask if the source filers are avail

- Ask if they want to catch up any async snapmirrors

- Get credentials for the filers

* Do some sanity checks

- Import the DataONTAP module

- Make sure we can connect to each of the fillers

- Do some DNS checks (that’s what we use to swing clients from one filer to another)

- Automatically determine which filers are source and destination (using snapmirror status)...ask the user to confirm

- Check status of CIFS (our client connection protocol)

* Present user w/ list of actions and confirm start

* Backup the snapmirror.conf file on all filers

* If source avail and user answered to sync asyncs, do some “catch up” on any async snaps (snapmirror update)

* Stop CIFS (cuts off all client access) and update DNS servers to point to new source (to allow time to propagate)

* Perform final snapmirror update of async snaps and allow time for semi-sync snaps to complete transition of any new data

* Quiesce and break snapmirrors

* Update snapmirror schedules to reflect the flip if source avail, otherwise just delete the snapmirror schedule (use the backup from earlier to recreate manually)

* If the source is avail...

- Resync the snapmirrors to new destination

- Release the old snapmirror relationships

- Start CIFS to allow client access (should have been enough time for DNS to propagate)

- Clean up Remnant snaps

* If the source is not avail...

- Remove old snapmirror schedules

- Start CIFS

*Print out a recap to the user of what has been done and any cleanup that needs to be done manually

When I started out doing the script, I thought it would be a lot easier than it was. I kept finding new things that could go wrong and adding steps to make sure it went well (feature creep). I’d say the biggest thing I learned is that even though you test the script multiple times, unexpected things can go wrong (like trying to issue a command and finding the volume is not ready for it when it worked fine the previous ten times). So, I made sure that every command had error checking, even if I thought there was no way it would go wrong. I built a little snippet of code that I could reuse to prompt the user about the error and ask if they want to try again (recommended) or exit the script.

Another little gotcha is if you're working w/ the virtual netapps for testing, make sure you change the serial number of your virtual filers, otherwise the resync will never work.

Let me know if you have any specific questions. I’ll try to answer them as best I can.