Active IQ Unified Manager Discussions

OnCommand 5.0 Snapmanager jobs abornally terminating

michaeldparker
3,350 Views

Hi All,

We reply heavily on the various snapmanager/snapcreator products integrated with OnCommand to backup our databases and vm's.  Every since upgrading to 5.0, I have been getting jobs in OnCommand that fail.  If you look at the "Complete jobs steps" in OnCommand, the 1st step always show "Job Terminated abormally" and the 2nd step is "End".  For instance, this AM, Snapmanager for SQL started on one of my SQL servers.  From the Snapmanager side, SMSQL thinks the job was completely successful because it initated the snaps and passed the replication portion off to OnCommand.  As soon as OnCommand received the jobs, the job terminated abnormally.  This happens very irregularly, but it seems that everything will work fine if I am able to reboot of the OnCommand server every two or three days.  Thinking that maybe something went wrong in the upgrade of the server from DFM 4.0.2, a few weeks ago, I wiped my server clean, rebuilt it from scratch, and imported in the old database so that I didn't loose my configuration and past data.  This did not help at all.

If anyone has some thoughts on how to resolve this, I'd appreciate it; I need some stability on the server before the holidays hit.

Thanks

Michael

5 REPLIES 5

adaikkap
3,350 Views

Hi Michael,

     Is this a Linux or windows DFM server ? How many jobs run in a day on a average in your DFM server ? 500 or 1000 ? etc ?

Regards

adai

michaeldparker
3,350 Views

Probably around 20 protection jobs a day.  I found bug ID 315578.  I wouldn't think that we'd be affected by this, but I made the method 1 change anyway to see if it helps.  So far the server has been running 2 days with no problems, but usually it will run 3 or 4 days before the server needs booting again.

Thanks

adaikkap
3,350 Views

Can you let me know if its linux or windows ?

Regards

adai

michaeldparker
3,350 Views

The OS is Windows 2008 R2 as a VM running with 2 vCPU, and 6GB of Ram.

Thanks

Michael

michaeldparker
3,350 Views

Well, It has been a week and half, going on 2 weeks and OnCommand this is the longest that OnCommand has been stable.  I hope this is resolved.  I found this Bug ID and implemented method 1. 

Michael

BugID    315578

When running on Windows, Protection Manager jobs could sometimes remain

in "queued" state for an extended period of time.

In DataFabric Manager (DFM) 3.8 or later, these jobs will eventually be marked

as failed with a message saying that "Job terminated abnormally". The reason

for this and steps to work around this problem are described below.

Due to resource limitations in Windows operating system, DataFabric Manager scheduler service

can't start more than a certain number of child processes at a time. After

resource limit is reached, the child process fails to start and there is

no trace left of that process.

If DataFabric Manager 3.8 or later is encountering this problem, it prints an error message in

dfmscheduler.log file saying "Process <id> failed to start job ... ". In that

case, you can increase the resources available to Windows services using one

or more of the following ways.

Use a dedicated machine for running DataFabric Manager

=======================================

System resources are shared by all the services installed on the system. Even

if you don't start some of the services, Windows still has to allocate resources

for them. Do not install any other applications on the system that is used

for running DataFabric Manager . That will increase the resources available to DataFabric Manager services.

Increase the resources available to the services

================================================

There are 2 ways you can do this. The second method requires registry

modification and reboot.

Method 1: Allow DataFabric Manager scheduler service to interact with the desktop

---------

- Go to Control Panel --> Services and select DataFabric Manager Scheduler service.

- Right click and open properties panel.

- Click on the check box "Allow service to interact with the desktop".

- Restart DataFabric Manager Scheduler service.

This method works because windows uses different resource allocation for the

services that interact with the desktop.

Method 2: Modify windows registry to increase resources available to the non-interactive services

---------

Reference: http://support.microsoft.com/kb/184802

Refer to the above article from the Microsoft knowledge base to increase

resources available to each service. The following instructions are for reference

only. Modification of the registry is dangerous and should be done by qualified

personnel only.

- start windows registry editor

- Go to:

   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows

- Modify value for key name = Windows

- Usually value looks like: %SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,768 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=winsrv:ConServerDllInitialization,2 ProfileControl=Off MaxRequestThreads=16

- Change "SharedSection=1024,20480,768" to "SharedSection=1024,20480,4096"

   The third value determines resources available for non-interactive services.

   Therefore, we need to increase it.

- Reboot.

Public