OnCommand 5.0 Jobs failing with Job Terminated abnormally

michaeldparker · ‎2011-08-31

Hi All,

I upgraded my DFM 4.0.2 to OnComannd 5.0 last week and I thought it was successful, but for two differnent nights now, I have had multiple Dataset protection jobs fail. The first time it occurred, every job from Sunday failed. Since my install was new, I performed a reboot on Monday morning and bumped my memory up to 6GB; I had been noticing memory consumption was high and I couldn't remember if I had rebooted after the upgrade.

Now, just last night (Tuesday night), I had multiple dataset protection jobs fail, but not all. This morning, I was able to manually start a new job on one dataset that needed to be updated and it seems to be running fine. I attempted to start a few more and they hung out as queued for awhile before finally failing. If I look at the Completed jobs steps window, all failed jobs only have two entries. The first entry merely states Progess and Job terminated abnormally for the description. The 2nd step states End.

My OnCommand 5.0 server is running as a VM on ESX 4.1. The OS of the VM is Windows 2008 R2 Enterprise so my install is 64 bit of course. This is the same VM that I had been running 4.0.2 on as I had performed an upgrade. I have given the OS 2 CPU's and 6GB of Ram. I noticed that even at 6GB, task manager shows my memory usage staying steady at 5.6GB.

Thanks in advance for the help. Please let me know if you need any further information. I'm going to open a ticket with support as well and I will update this post if support has any resolutions.

Thanks

Mike

michaeldparker · ‎2011-08-31

Update:

My single job completed. After it completed, I attempted to start another job which started fine and is running. While it was running, I attempted to start another job which stayed queued, then failed after a few minutes.

Thanks

kvishal · ‎2011-08-31

By default scheduler runs only 100 scheduled backup jobs at a time. If job remains in the

queue for more than an hour, it is dropped. You can control those settings using

the dpMaxRunningJobs and dpScheduledJobExpiration global options. Default values

are 100 jobs and 1 hour respectively.

Is your job long running ? Is it taking more than an hour to finish ?

Thanks & Regards

--

VK

Perf & Scal QA

MEI

Create more value than you capture ..

michaeldparker · ‎2011-08-31

I do have some long running jobs that can take an hour to two or 3 perhaps, but I never more than 2 or 3 jobs running at any given point in time. Our environment is not that big to every have to worry about 100 jobs at a time. In addition, the jobs that I've seen in the queue as queued are usually in that state for no more than 5 to 10 minutes before terminating abnormally.

Thanks for the input.

adaikkap · ‎2011-09-02

Are all the services running ? check the same using dfm service list cli.

Also check for any error in log folder, under <installdir>/NetApp/DFM/Log

Can you also get the job detail output for the jobs. dfpm job detail <jobid>

Regards

adai

michaeldparker · ‎2011-09-02

Hmm ... Thanks. Since my last failures, it is running stable now. I think I'll probably see this problem in another day or two. When it happens, I'll check the services and the error log.

Thanks

tamamy · ‎2011-09-06

Hello Adaikkappan,

what can i know from the output of "dfpm job detail",

there is no explanation or description of the failure, only "job terminated abnormallly"

my customer is experiancing from time to time (nothing consistent) those kind of failures and i can't find the reason.

i will appriciate any reply,

michaeldparker · ‎2011-09-12

Hi Adai,

I'm seeing more job failures lately. I think rebooting my DFM server helps, but not for certain because this happening sporidcally. I have checked the output of dfmp job detail and unfortunately it is not very helpful either. It merely states terminated abnormally. I have also checked the log directory you specified, but nothing meaningful that I can see. Any further thought on why this might happen or how to troubleshoot? I have a ticket open, but it's unfuntnately going absolutely nowhere fast.

Thanks

Michael

kvishal · ‎2011-09-12

Hi Michael,

I am not quite sure, but I suspect this could be due to semaphores getting leaked or exhausted. Because of this the DFM will not be

able to connect to the DB. Could you see, if there any errors in the db logs ? Also could you track your senmaphores.

On reboot the semaphores may be getting released and become available and jobs work fine, till they get exhausted again.

You can use ipcs -s | wc -l to see the number of semaphores getting used. You can also track your service list to see, if the DB service goes down.

Regards

Vishal

michaeldparker · ‎2011-09-13

Hi Vishal,

Thanks very much for the reply. I started digging through the various logs a little deeper today and in the DFMScheduler.log I saw this below error with different process id's repeated numerous times. This particular error did not begin until the first job failure on the 28th.

Sep 06 18:16:32 [DFMScheduler:ERROR]: [1628:0x6c8]: Process 5136 failed to start job 10338, elapsed = 601, is_running = 0, StartedTS: '', ScheduledTS: '2011-09-06 18:06:27.000000', ContinueReqTS: ''.

Sep 06 18:16:32 [DFMScheduler:ERROR]: [1628:0x6c8]: Process 5136 started to run job 10338 but has ended before 601 seconds with is_regulated_request = 0 and continue_request = 0.The system may have run out of resources for running this job.

Also in the same log, I saw this below error which occured the day after I upgraded my DFM server.

Aug 25 13:08:54 [DFMScheduler:DEBUG]: [1692:0xbc8]: Process crashed: Wrote dump at 'dfmscheduler-73de6a34-2011_08_25-13_08_51.dmp'

In the sybase.log, the only error that I saw being repeated routinely is below, but it was occuring prior to the upgrade of my DFM servers so I'm not thinking it means much as related to my current problem:

Disconnecting shared memory client, process id not found

Please let me know if any of this means anything to you. The ipcs command that you mention seems to be only available on Linux. My server is Windows.

Thanks for the help.

Michael

kvishal · ‎2011-09-13

Is your scheduler services up and running ? Seems like it has crashed and dumped core.

michaeldparker · ‎2011-09-14

Thanks for the reply. I think I might have the problem resolved, at least I am hopeful anyway. I was working the ticket with RTP yesterday and we realized that although I was giving the server 6GB of RAM in VMware, I had the limit inadvertently set to 2GB. Once we adjusted that, the server seemed more responsive and no errors have occurred yet. I’ll watch it a few more days and see how it does.

Thanks