2011-11-07 06:49 AM
We reply heavily on the various snapmanager/snapcreator products integrated with OnCommand to backup our databases and vm's. Every since upgrading to 5.0, I have been getting jobs in OnCommand that fail. If you look at the "Complete jobs steps" in OnCommand, the 1st step always show "Job Terminated abormally" and the 2nd step is "End". For instance, this AM, Snapmanager for SQL started on one of my SQL servers. From the Snapmanager side, SMSQL thinks the job was completely successful because it initated the snaps and passed the replication portion off to OnCommand. As soon as OnCommand received the jobs, the job terminated abnormally. This happens very irregularly, but it seems that everything will work fine if I am able to reboot of the OnCommand server every two or three days. Thinking that maybe something went wrong in the upgrade of the server from DFM 4.0.2, a few weeks ago, I wiped my server clean, rebuilt it from scratch, and imported in the old database so that I didn't loose my configuration and past data. This did not help at all.
If anyone has some thoughts on how to resolve this, I'd appreciate it; I need some stability on the server before the holidays hit.
2011-11-08 01:15 PM
Probably around 20 protection jobs a day. I found bug ID 315578. I wouldn't think that we'd be affected by this, but I made the method 1 change anyway to see if it helps. So far the server has been running 2 days with no problems, but usually it will run 3 or 4 days before the server needs booting again.
2011-11-16 04:48 AM
Well, It has been a week and half, going on 2 weeks and OnCommand this is the longest that OnCommand has been stable. I hope this is resolved. I found this Bug ID and implemented method 1.
When running on Windows, Protection Manager jobs could sometimes remain
in "queued" state for an extended period of time.
In DataFabric Manager (DFM) 3.8 or later, these jobs will eventually be marked
as failed with a message saying that "Job terminated abnormally". The reason
for this and steps to work around this problem are described below.
Due to resource limitations in Windows operating system, DataFabric Manager scheduler service
can't start more than a certain number of child processes at a time. After
resource limit is reached, the child process fails to start and there is
no trace left of that process.
If DataFabric Manager 3.8 or later is encountering this problem, it prints an error message in
dfmscheduler.log file saying "Process <id> failed to start job ... ". In that
case, you can increase the resources available to Windows services using one
or more of the following ways.
Use a dedicated machine for running DataFabric Manager
System resources are shared by all the services installed on the system. Even
if you don't start some of the services, Windows still has to allocate resources
for them. Do not install any other applications on the system that is used
for running DataFabric Manager . That will increase the resources available to DataFabric Manager services.
Increase the resources available to the services
There are 2 ways you can do this. The second method requires registry
modification and reboot.
Method 1: Allow DataFabric Manager scheduler service to interact with the desktop
- Go to Control Panel --> Services and select DataFabric Manager Scheduler service.
- Right click and open properties panel.
- Click on the check box "Allow service to interact with the desktop".
- Restart DataFabric Manager Scheduler service.
This method works because windows uses different resource allocation for the
services that interact with the desktop.
Method 2: Modify windows registry to increase resources available to the non-interactive services
Refer to the above article from the Microsoft knowledge base to increase
resources available to each service. The following instructions are for reference
only. Modification of the registry is dangerous and should be done by qualified
- start windows registry editor
- Go to:
- Modify value for key name = Windows
- Usually value looks like: %SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,768 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=winsrv:ConServerDllInitialization,2 ProfileControl=Off MaxRequestThreads=16
- Change "SharedSection=1024,20480,768" to "SharedSection=1024,20480,4096"
The third value determines resources available for non-interactive services.
Therefore, we need to increase it.