Solved: DB2, Snapcreator and DRP

pierrek · ‎2012-01-20

Hi,

I have installed SC 3.5 at a customer yesterday to backup DB2 databases running on AIX (over FCP). The Customer has followed our best practices and created 2 separate volume groups per DB2 database, one VG for db2 datafiles and another VG for db2 active logs and archived logs. Disaster Recovery is an key element in the solution, so Volume SnapMirror has been used to replicate all db2 volumes to a remote datacenter:

(source db2 datafile volume on spso002) --> VSM --> (destination db2 datafile volume on spso003)

(source db2 log volume on spso002) --> VSM --> (destination db2 log volume on spso003)

SC 3.5 db2 module has been used to schedule consistent backups (by write suspend) of the databases (snapshots are created by SDU at the vg levels) , and trigger VSM updates after each hourly backup. Here is my config file as stored on the SC server (a SC agent is running on the AIX DB2 server):

CONFIG_TYPE=STANDARD

SNAME=hourly_db2_snap

SNAP_TIMESTAMP_ONLY=N

VOLUMES=spso002:vol_bemanp01_vgsap01,vol_bemanp01_vgsap01log

NTAP_SNAPSHOT_RETENTIONS=hourly:72

NTAP_USERS=spso003:root/53616c7465645f5f9d95c3815e32e476060ad0cbac5eb9b6548415fbf2b35b9b;spso002:root/53616c7465645f5fa95cfeb8b793cb263004841e394942c5b9e6d38bfc147cac

NTAP_PWD_PROTECTION=Y

TRANSPORT=HTTPS

PORT=443

LOG_NUM=30

LOG_TRACE_ENABLE=Y

SC_TMP_DIR=

NTAP_SNAPSHOT_RETENTION_AGE=3

SNAPDRIVE=Y

SNAPDRIVE_DISCOVERY=N

NTAP_SNAPSHOT_DISABLE=N

NTAP_SNAPSHOT_NODELETE=N

NTAP_CONSISTENCY_GROUP_SNAPSHOT=Y

NTAP_CONSISTENCY_GROUP_TIMEOUT=medium

NTAP_CONSISTENCY_GROUP_WAFL_SYNC=N

NTAP_SNAPSHOT_DELETE_BY_AGE_ONLY=N

NTAP_SNAPSHOT_DEPENDENCY_IGNORE=N

NTAP_SNAPSHOT_RESTORE_AUTO_DETECT=N

NTAP_SNAPMIRROR_UPDATE=Y

NTAP_SNAPMIRROR_CASCADING_UPDATE=N

SNAPMIRROR_VOLUMES=spso002:vol_bemanp01_vgsap01,vol_bemanp01_vgsap01log

SNAPMIRROR_CASCADING_VOLUMES=

NTAP_SNAPMIRROR_WAIT=60

NTAP_SNAPMIRROR_USE_SNAPSHOT=Y

NTAP_SNAPMIRROR_MAX_TRANSFER=

SC_AGENT=10.45.7.21:9090

SC_AGENT_TIMEOUT=900

SC_AGENT_UNQUIESCE_TIMEOUT=900

SC_CLONE_TARGET=

SC_AGENT_WATCHDOG_ENABLE=Y

SC_AGENT_LOG_ENABLE=Y

APP_NAME=db2

DB2_DATABASES=XD1:db2xd1

DB2_CMD=db2

NTAP_SNAPSHOT_CREATE_CMD01=/usr/bin/snapdrive snap create -vg vgsap01 vgsap01log -snapname %SNAME-%SNAP_TYPE_recent

######################################################################################################################

I have 2 questions about the above setup.

1) First, may I use the generic archive log management functionality built into SC to clean up the db2 archived logs older than my backup retention period, and may I receive an example of a db2 config file where this functionality is used? Is it tested with DB2, supported, ...?

##########################################################################################

# ARCHIVE_LOG_ENABLE - (Y|N) Setting which Enables Archive Log Management #

# (deletion of old archive logs) #

# #

# ARCHIVE_LOG_RETENTION - Retention in Days for how long archive logs should be kept #

# #

# ARCHIVE_LOG_DIR - Path to where the archive logs are stored #

# #

# ARCHIVE_LOG_EXT - File Extension for the archive logs, must be <something>.<extension> #

# ie: 109209011.log in which case you would enter "log" #

# #

# ARCHIVE_LOG_RECURSIVE_SEARCH - (Y|N) Setting which allows for managing archive logs #

# within subdirs. If archive logs are located under #

# subdirs this option should be used. #

##########################################################################################

2) My second question is about the db2 active and archived log snapmirror replication to the remote site. Say that I schedule one snapshot of the data and log volume groups every hour using SC.

This means that SC will trigger a VSM update to the remote site and replicate the consistent snapshots of my datafiles and logs to the remote site every hour. Now if I want to reduce my RPO in case of disaster to let's say 10 minutes, is it valid to run a crontab script every 10 minutes (from the DB2 AIX host) that will issue a couple of sync (to flush filesystem buffers to disks) then remotely (ssh to the snapmirror secondary filer) execute a snapmirror update of the db2 log volume?

In case of a DB2 disaster, I would basically do the following operations on the remote site:

1) snapmirror break + vol snap restore of the db2 datafile volume to the latest (1 hour old max) SC created (database consistent) snapshot

2) snapmirror break of the db2 log volume (10 minutes old max)

3) import vgs,...

4) start the db2 database at the remote site in a roll-forward pending state: "db2inidb [database] as mirror"

5) reapply the archived (and active?) logs and perform roll-forward recovery: db2 “rollforward database [database] to end of logs and complete”

Is this scenario valid? If I refer to Oracle, the online redo logs should not be part of a DR scenario (should not be replicated/reapplied), and the DRP oracle database should be opened with the "resetlogs" option, which will create new online redo logs. What about DB2 here? Will the db2 rollforward command try to apply transactions from replicated active logs as well? Is it recommended? Should the active logs be deleted before rollforward is started?

Thanks a lot in advance for any advice or comment.

Pierre

oommen · ‎2012-01-20

Yep, I see that as a problem - as "write suspend" is a dead stop of I/O. But SC with CG (Consistency Group (does the fencing of volumes during snap operation), type medium) will ensure snapshot gets done within seven seconds or timeout and put the database into a normal state. At the end it depends upon how busy the database is. In the past we have asked IBM to do the snaps (with CG) without "write suspend" for the LOGS if the LOGS are stored on just one controller, IBM doesn't recommend it.

You might run into a Snap limit for LOGS volume, which will force to go for a lower retention just for the LOGS.

Regards,

Bobby

View solution in original post

ktenzer · ‎2012-01-20

Hi Pierre,

I will try and answer your questions. I have also sent this to our DB2 expert at NetApp Bobby Oommen so he might chip in as well to elaborate on anything I mention.

1) Archive log management in SC is generic, so it works for any plug-in. It is supported for all plug-ins supported by SC. SC will simply delete files based on a retention in a directory. If the scAgent is used it would delete logs on scAgent, if no scAgent then on scServer.

The recommended best practice for archive logs is to combine it with SnapVault, SnapMirror, or Protection Manager. SC will only delete archive logs on primary and will do so at the end after SnapVault, SnapMirror (only if you set certain option), or PM. Idea is you first want to take snapshot, then vault snapshot ie: archive logs, then delete on primary based on retention. Deletion on secondary is handled through snapshot management.

You can also simply delete archive logs without SV, SM, or PM just means either you have archive log retention only on primary or you are using something else to provide longer retention of logs.

An example archive log mgmt options are:

Assuming I want to retain 5 days of archive logs on primary and assuming the logs end with *.log

ARCHIVE_LOG_ENABLE=Y

ARCHIVE_LOG_RETENTION=5

ARCHIVE_LOG_DIR=/path/to/archiveLogDir

ARCHIVE_LOG_EXT=log

ARCHIVE_LOG_RECURSIVE_SEARCH=N

If archive logs are nested so under sub directories you would want to set ARCHIVE_LOG_RECURSIVE_SEARCH=Y, this is however uncommon so it should be set to ARCHIVE_LOG_RECURSIVE_SEARCH=N.

2) Ok so you are using SnapMirror. I would actually use SC to do this, you can actually force SC to wait for snapmirror update to complete before deleting archive logs, this I think is the key. Here is option to set:

NTAP_CLONE_SECONDARY=Y - This option is a bit misleading it was developed to clone a snapmirror destination volume. However just enabling this means SC will wait for SM update to complete before proceeding, like it does for SV and PM.

You can also have two separate configs for SC one for logs and one for data if you want to do logs more frequently or you can have them in same config or you could have mix, 1 config with logs and data and 1 with just logs. There are endless possibilities.

As for DB2 questions about recovery I will let Bobby answer those but I believe what you described above is valid.

Regards,

Keith

pierrek · ‎2012-01-20

Hi Keith,

Thanks for your answer, your description of the NTAP_CLONE_SECONDARY parameter is very instructive.

About having a mix of SC configs as you suggest (one config for data + log, another config for logs only that is run more frequently), I see the following problems:

a) Say that I want a backup retention of 3 days, and I run my data+log config every hour, I have a total of 72 snapshots on my data and log volumes. To achieve a RPO of 10 minutes, the plan is to replicate the log volume to the remote site every 10 minutes. If a separate "log only" SC config is used for this, I would hit the 255 snapshot per volume limit for the log volume for a 3 day backup retention. This is why I proposed to use VSM outside of SC to achieve the 10 minute RPO.

b) If I have a separate config for the db2 logs only, should it be of type db2? If so, my concern is that I want to avoid a "write suspend" on the db2 database every 10 minutes. A write suspend is not transparent to users (unlike the backup mode in Oracle). May I use a non db2 config for the db2 log volume?

Thanks for your clarifications on this.

Pierre

ktenzer · ‎2012-01-20

Hi Pierre,

a) I understand, this is where a snapmirror and snapvault construct could be interesting. SnapMirror gives you the RPO of 10 minutes but SnapVault gives you backup granularity. You could have two configs 1) snapmirror 2) snapvault to accomplish different things. Just thinking outside of the box. The goal being having fewer snapshots on primary and more on snapvault/secondary but while maintaining RPO snapmirror provides.

b) If you have separate config for db2 and just care about backing up logs it shouldnt be type db2, that adds no value, APP_NAME should be commented out or blank and the DB2 parameters dont need to be set. Log files do not need quiescing, they are already in a consistent state and can be backed up as such.

The key is you need log files to roll forward from snapshot of data files, so goal is to ensure you have them, and thus backup them up frequently while also eliminating need to have long retentions on primary storage which takes up space and that costs $$.

Regards,

Keith

ktenzer · ‎2012-01-20

Additionally you can also delete archive logs independently of backup, --action arch. This could be useful if you build a wrapper script to drive the workflow in a unique way.

Regards,

Keith

oommen · ‎2012-01-20

Thanks Keith.

Question 1 : For archival you can use SC and configure as Keith explained above. If you would like to use TSM, you can do that as well. I have several customers using TSM where the USEREXIT will take care of the LOG cleanup.

Question 2 : If you are looking for a shorter RPO, yes, you can SM just the db2log volume. But you need to use SC to make sure database is put into a consistent state and take the snapshot of just the LOG volume. You can do this using a second configuration file.

In the event of a DR, you would break the SM relationship (Data and Logs) as you mentioned, and do the snap restore of just the data volume using the latest SC snapshot. Once the VG's are active, let DB2 do the ROLLFORWARD (RF) operation to the desired point-in-time (PIT). RF will read the LOGS from the online LOG path by default. But there could be scenario where you might have to use the OVERFLOW path parameter of RF command to access archive LOGS. I wouldn't delete any LOGS from the active LOG path.

Note : SDU is not a requirement in this case as there is no File system Buffering involved here. At the DB2 tablespace level its been turned off. After a restore you will use the varyoff, varyon VG commands to refresh the FS.

Please let me know if you have any questions.

Regards,

Bobby

pierrek · ‎2012-01-20

Bobby,

Thanks a lot for the very clear answer. But can you please have a short comment about this question:

If I have a separate config for the db2 logs only, should it be of type db2? My understanding is that a write suspend is not transparent to users (unlike the backup mode in Oracle). Is a write suspend acceptable on a SAP production environment every 10 minutes?

Kind regards

oommen · ‎2012-01-20

Yep, I see that as a problem - as "write suspend" is a dead stop of I/O. But SC with CG (Consistency Group (does the fencing of volumes during snap operation), type medium) will ensure snapshot gets done within seven seconds or timeout and put the database into a normal state. At the end it depends upon how busy the database is. In the past we have asked IBM to do the snaps (with CG) without "write suspend" for the LOGS if the LOGS are stored on just one controller, IBM doesn't recommend it.

You might run into a Snap limit for LOGS volume, which will force to go for a lower retention just for the LOGS.

Regards,

Bobby

pierrek · ‎2012-01-20

Bobby,

If I use SDU anyway (with NTAP_SNAPSHOT_CREATE_CMD01=/usr/bin/snapdrive snap create -vg vgsap01log -snapname %SNAME-%SNAP_TYPE_recent in my config file) to trigger the snapshot of the log VG, am I right to assume that a CG snapshot will be created as well? (In this case all the luns in my log VG are on the same controller). And my db2 logs should be consistent without the "write suspend" constraint. You know my background is mostly Oracle, that's why it's still unclear to me why we should quiesce the DB while we take a snapshot of the transaction logs.

Kind regards

Pierre

ktenzer · ‎2012-01-20

Hi Pierre,

SDU to my knowledge will not do a CG snapshot or at least not allow you to configure it. The only case where it does a CG snapshot is SMO w/ASM on Unix only.

CG snapshot in this case does not buy you anything from consistency perspective, DB is in write suspend and that is all you need, no advantage of CG vs serial snapshots on per volume basis.

However what CG does provide is a timeout mechanism. If the snapshot for whatever reason takes a certain amount of time which you can configure (urgent=2 sec, medium=8 sec, relaxed=20 sec), the backup will fail and SC will return database to normal state, to a write resume in this case. This could be something useful in this case, not sure. With that said SDU may be providing other value here (file system consistency imho still is nice to have) so you need to weigh the pros and cons, ultimately it is your decision to make.

Hope this helps

Keith

pierrek · ‎2012-01-20

Keith,

Thanks a lot for the time and enthusiasm you spend to help this community's members. It's great to post questions here

Kind regards

Pierre

ktenzer · ‎2012-01-20

Thank you for taking part in it, this is a two way process and that is what makes it all work

Keith