Data Backup and Recovery

SnapCreator Agent hangs when running command

pilkar
17,866 Views

Hi,

 

I have noticed "interesting" behaviour of SC Agent (tried on 4.1.1 and 4.1.2) when running pre/post exec scripts on agent side. I'm running Oracle DB 12R1 on RHEL/OEL 6.6, 6.7 guest on OVM 3.3.3 hosts and want to clone from the production guest/DB to the test one. Everything is working as expected, but...

Steps are:

SC Umount Action - everything OK:

1. Test Host via SC PRE_MOUNT_CMD on Clone Target side - Stop Oracle DB

2. Test Host via SC PRE_MOUNT_CMD on Clone Target side - Umount ASM Disk Groups for Data and FRA/archive logs

3. SC Server - destroy clone

 

SC Mount Action - here are troubles:

1. SC Server - clone create, present clone to "Clone Target" host

2. Test Host via SC POST_MOUNT_CMD on Clone Target side - Mount ASM Disk Groups for Data and FRA/archive logs

3. Test Host via SC POST_MOUNT_CMD on Clone Target side - Oracle DB recovery, resetlogs and open database

 

The Step 3 in Mount action is critical. Sometimes it's successfull, sometime fails. I have noticed the problem is caused by invoking commands thoung "su", for example:

su - oracle -c "${SC_OSAP_PATH}/oradb_recover.sh ${SC_OSAP_PATH}/oradb_resetlogs.sql $ORACLE_SID $ORACLE_HOME $ORACLE_BASE"

For the whole script please see attachment.

 

Even if the action on "clone target" host is finished successfully at the background, SC Agent reports an error after SC Agent Timeout is reached:

[2015-10-29 14:22:22,344] ERROR: [X.X.X.X:9090 (4.1.1.1)] Command [/opt/NetApp/OSAP/sc_oradb_recovery.sh recover ] failed with return code [-1] and stderr [] and stdout [Failed to execute command: /opt/NetApp/OSAP/sc_oradb_recovery.sh recover

 

And the most interesting thing is, that this action is sometimes successfull! Smiley Frustrated

 

I have found a scenario. when the same procedure works... If I run the "critical" script (in this case sc_oradb_recovery.sh) on target host (via ssh/tty) manually and then let's say in max. 5 minutes run these mount/umount actions from SC Server (actually sc_oradb_recovery.sh or any other script invoking "su" commands from SC Server via SC Agent), everything work perfectly!

[2015-10-30 10:05:50,805] INFO: [X.X.X.X:9090 (4.1.1.1)] Executing Post mount command [/opt/NetApp/OSAP/sc_oradb_recovery.sh mount wait 3] on 192.168.8.103
[2015-10-30 10:05:50,805] INFO: [X.X.X.X:9090 (4.1.1.1)] Executing Post mount command [/opt/NetApp/OSAP/sc_oradb_recovery.sh recover ] on 192.168.8.103
[2015-10-30 10:05:50,805] INFO: Post mount completed successfully

Then I try once again the same actions let's say after 30 mins. and the result is an error mentioned above... What is this "magic" timeout or why the SC Agent cannot handle some of child processes properly (or it's exit codes), it's a mistery for me...

[2015-10-30 10:53:12,238] INFO: [X.X.X.X:9090 (4.1.1.1)] Executing Post mount command [/opt/NetApp/OSAP/sc_oradb_recovery.sh mount wait 3] on 192.168.8.103
[2015-10-30 10:53:12,238] INFO: [X.X.X.X:9090 (4.1.1.1)] Executing Post mount command [/opt/NetApp/OSAP/sc_oradb_recovery.sh recover ] on 192.168.8.103
[2015-10-30 10:53:12,238] ERROR: [X.X.X.X:9090 (4.1.1.1)] Command [/opt/NetApp/OSAP/sc_oradb_recovery.sh recover ] failed with return code [-1] and stderr [] and stdout [Failed to execute command: /opt/NetApp/OSAP/sc_oradb_recovery.sh recover
Reason: null]

 

It does not metter how long the "su" process is execuded in the script, it can last one second or even several minutes, the behaviour (hanging) of SC Agent is the same - it waits for SC Agent Timeout and job fails. I have noticed these problem only when invoking "su" commands in scripts. Sometimes it hangs at the first "su" command, sometimes at the lasts one in the script. Unfortunatelly some actions (plsql script) need to be executed as oracle user, so there is no other way than "su".

 

Do you have any ideas what could be the problem? There are all involved files zipped in the attachment, just rename file from .txt to .zip

Thanks for any help in advance!

 

Rostislav

 

8 REPLIES 8

steiner
17,857 Views

Are you piping the output of sc_oradb_recovery.sh anywhere? Also, can you include a copy of that script?

pilkar
17,846 Views

All filles (configuration, logs, scripts) are included in the attachment - just rename the file "files.txt" to "files.zip". ZIP attachment is not allowed, so I have uploaded faked one...

You can find script file in clone_target_scripts\sc_oradb_recovery.sh and the script output (the successfull attempt) in logs\scServer\ORA_mount_start.debug.20151030100401.log 

The same mount action executed 40 minutes later (failed) can be found in logs\scServer\ORA_mount_start.debug.20151030104950.log.

 

regards,

Rostislav

steiner
17,822 Views

The timeouts with SC suggest that one of the scripts themselves is stalling on mounting an ASM diskgroup, running a recovery, or something like that. Depending on what happened, the SC debug logs might contain the SQL output that explains the failure, or you might have to modify some of your shell and SQL scripts to log to a local file. 

 

The only logs we have are this:

 

Failed to execute command: /opt/NetApp/OSAP/sc_oradb_recovery.sh recover
Reason: null]

 

Which I believe means SC got no response from the script. There's probably an error message in the output of that script, but it was still sitting in the buffer when SC timed the operation out. You'll need to get logs from the various "su - oracle" operations, plus add some lines like "echo recovery.controlfile >> /tmp/logfile" to sc_oradb_recovery.sh so you can figure out precisely where it's stalling.

pilkar
17,816 Views

The main catch is that the recovery process (sc_oradb_recovery.sh recover) finish successfully every time. SC Agent means that the script is still running even if it has been successfully finished. From SC Agent point of view it looks like the script is running (or is stalling somewhere in ora recovery process) and then of course fails on timeout.

 

There are two ORA alert logs attached bellow - both were completed successfully (from oracle db point of view) - the database has been recovered and opened. The first case was successfull for SC Agent, the second one was failed for SC Agent with reason: Null. Scripts are running forthe same amount of time to recover and open database, so there is no logic to cause SC Agent timeout failure.

I was of course really curious what's going on on Oracle side, why SC Agent hangs till timeout... But I found out that the script works as expected every time, but the SC Agent has sometimes troubles with child processes started by "su".

steiner
17,811 Views

I can't think of any reason why SC agents would have problems with su commands. Almost all of the plugins do an "su" regularly. There must be something blocking it from exiting. Can you pipe stdout to a temporary file within the su scripts? Maybe that will indicate what stalled the exit. 

pilkar
17,795 Views

Please see attached log files. The same result - postexec script finished OK (orarecovery_script_output.txt), SC Agent failed on timeout. I also cannot find any reasonable cause why it does not handle child process exit code properly...

 

The most interesting thing is that if I run the scenarion "by-hand" (cloning with SC without running post-exec from SC and then running the post-exec script from command line manually) and then run the mount SC config (clone + postexec db recovery), it's successfull too! The only condition for "success" is lets say 5-10 minutes timeout after running post-exec from cmd line... After more then 30 mins. SC config mount fails... Unbelievable, isn't it?

 

The main question is - why is completely the same scenario (SC config) sometimes successfull and sometimes unsuccessfull if everything (all factors included in SC config process) is the same?!

I have no more ideas what to try...

pilkar
17,781 Views

I have been trying last several hours to find out what is causing this weird behaviour and I think I have found at least the part of the cause...

 

When I run post-exec command from cmd line, everything works. When I try to catch script output, it hangs exactly for 4mins. and 7 seconds.

The same is needed also for SC Agent to finish the job - I have increased agent_timeout to 600 (for testing purposes) and the result was always succesfull then - SC Agent was waiting 4mins.7secs after finishing the post-exec script action... During this time period of "hanging" the script is still running (can be seen on "ps" list).

 

So, this is my actual conclusion:

1. if I run script from cmd line (let's say "./sc_oradb_recovery.sh"), everything works as expected and correctly

2. if I run script from cmd line with output forwarding ("./sc_oradb_recovery.sh 2>&1 > out.log"), it's OK too.

3. if I run script from cmd line with output tee-ing ("./sc_oradb_recovery.sh 2>&1 | tee out.log"), it hangs for 4mins.7secs as mentioned above.

4. if I run script from SC config (as SC Agent post-exec call), it's hangs as mentioned in step 3.

 

The interesting thing is that if I run step 1 and then in a short time run step 3 (or 4), it's succesfully finished without hanging for 4mins 7secs...

If I try to run step 3 (or 4) once again after several minutes (without running step 1 or 2 meanwhile), the result is hanging again.

 

I have found a bash bug that describes quite similar tee behaviour and it seems scAgent is influenced by this too. At this moment the only solution (or workaround) is increasing of sc_agent_timeout value in config.

 

 

 

pilkar
17,761 Views

I have continued exploring this isssue, but have not found any solution.

The only thing I have found is that the problem (script output hangs) appears only when stderr is redirected together with tee-ing stdout...

 

If I do: "./sc_oradb_recovery.sh 2>&1 | tee out.log" - the script hangs for several minutes.

Other variants ( "./sc_oradb_recovery.sh 2>&1 > out.log" or "./sc_oradb_recovery.sh | tee out.log") are OK...

Public