Subscribe

Protection Manager SnapVault Job Hung

We have a protection manager dataset that has failed for 7 days.  Each day the job says it is waiting on a previous job #103885.  The protection manager GUI indicates job #103885 completed successfully.  I had the option to cancel job #103885.  After issuing the cancellation through the GUI, after 2 hours the job still indicates it is processing the abort.

Is there a backdoor method to kill the job?

I need help understanding what caused this job to hang and how to kill it.  I do not want to restart the dfm services.


Here is a snapvault status from one of the qtrees in the dataset.  The filer sees the relationship as a normal "snapvaulted" state:

[root@trulsut0001 ~]# rsh truanap0004 snapvault status -l /vol/dfpm_tr02_vol19_SV_1236712977_2946468784/gpho_1107
Snapvault secondary is ON.

Source:                 172.21.124.11:/vol/tr02_vol19/gpho.1107
Destination:            truanap0004:/vol/dfpm_tr02_vol19_SV_1236712977_2946468784/gpho_1107
Status:                 Idle
Progress:               -
State:                  Snapvaulted
Lag:                    176:10:39
Mirror Timestamp:       Wed Jul 14 05:02:33 EDT 2010
Base Snapshot:          truanap0004(0118064306)_dfpm_tr02_vol19_SV_1236712977_2946468784-base.0
Current Transfer Type:  -
Current Transfer Error: -
Contents:               Replica
Last Transfer Type:     Update
Last Transfer Size:     12 KB
Last Transfer Duration: 00:26:45
Last Transfer From:     172.21.124.11:/vol/tr02_vol19/gpho.1107


[pozezac@trulvop0001 ~]$ dfpm job list 103885
Job Id Job State     Job Description
------ ------------- ------------------------------------------------------------
103885 aborting      Back up data from node Primary data to node Backup of dataset tr02_vol19 (26919) with daily retention
[pozezac@trulvop0001 ~]$

Here are the first and last events from job 103885 that PM claims is still running:

Job Id:                    103885
Job State:                 aborting
Job Description:           Back up data from node Primary data to node Backup of dataset tr02_vol19 (26919) with daily retention
Job Type:                  remote_backup
Job Status:                success
Bytes Transferred:         79045287936
Dataset Name:              tr02_vol19
Dataset Id:                26919
Object Name:               tr02_vol19
Object Id:                 26919
Policy Name:               Back up_05:00
Policy Id:                 33831
Started Timestamp:         14 Jul 2010 05:00:06
Abort Requested Timestamp: 21 Jul 2010 09:49:36
Completed Timestamp:      
Submitted By:              dfmscheduler
Destination Node Id:       2
Destination Node Name:     Backup
Source Node Id:            1
Source Node Name:          Primary data

Job progress messages:

Event Id:      16482760
Event Status:  normal
Event Type:    job-start
Job Id:        103885
Timestamp:     14 Jul 2010 05:00:06
Message:      
Error Message:

Event Id:      16574211
Event Status:  warning
Event Type:    job-progress
Job Id:        103885
Timestamp:     21 Jul 2010 09:49:37
Message:      
Error Message: Received request to abort job.


Here are the first and last messages from the next job that was the first failure for this dataset:

[pozezac@trulvop0001 ~]$ dfpm job details 103969

Job Id:                    103969
Job State:                 completed
Job Description:           Create local backup on node 'Primary data' of dataset 'tr02_vol19' (26919) with daily retention
Job Type:                  local_backup
Job Status:                failure
Bytes Transferred:         0
Dataset Name:              tr02_vol19
Dataset Id:                26919
Object Name:               tr02_vol19
Object Id:                 26919
Policy Name:               Back up_05:00
Policy Id:                 33831
Started Timestamp:         15 Jul 2010 00:01:08
Abort Requested Timestamp:
Completed Timestamp:       15 Jul 2010 01:01:09
Submitted By:              dfmscheduler
Source Node Id:            1
Source Node Name:          Primary data

Job progress messages:

Event Id:      16489784
Event Status:  normal
Event Type:    job-start
Job Id:        103969
Timestamp:     15 Jul 2010 00:01:08
Message:
Error Message:

Event Id:      16489785
Event Status:  normal
Event Type:    job-progress
Job Id:        103969
Timestamp:     15 Jul 2010 00:01:08
Message:       Waiting for job 103885 to finish
Error Message:

Event Id:      16490974
Event Status:  error
Event Type:    job-progress
Job Id:        103969
Timestamp:     15 Jul 2010 01:01:09
Message:
Error Message: tr02_vol19: Timed out while waiting for protection job  "Back up data from node Primary data to node Backup of dataset tr02_vol19 (26919) with daily retention" (103885) to finish.

Event Id:      16490975
Event Status:  error
Event Type:    job-end
Job Id:        103969
Timestamp:     15 Jul 2010 01:01:09
Message:
Error Message:

Re: Protection Manager SnapVault Job Hung

Hi Chris

Looks like the dfm did not receive the notification from filer on the job completion. Was there any disruption in the network or the filer when this job was running?

Can you do a ndmpd status on the filer and see if there are any thing that is active?

Regards

adai

Re: Protection Manager SnapVault Job Hung

I don't know why your job is waiting for a week. Jobs were intended to give up after an hour. This sounds like a bug, so if you could save a copy of your database and file a bug report, we can diagnose what happened.

In the mean time, there should be a "dfpm" process that's running the job. If you kill that process, we should figure out the job has failed and unlock its resources in about 30 seconds. At this point, you might as well try it.

Re: Protection Manager SnapVault Job Hung

Adai:

There are no ndmp sessions running on either the host or the destination system.

Pete:

I was able to find the process running for job #103885 and kill it.  I was able to successfully run on on-demand job to give us a backup.

I will submit a BURT and provide you with the database and BURT # offline.

Thanks for your help!

Chris

Protection Manager SnapVault Job Hung

Hi,

I have the same problem, how can I find the process running for my job and how to kill it ?

Ps: did you find a bug, is it corrected ? we are running DFM 4.0.2D2

Thanks in advance

Jerome Barrelet

Re: Protection Manager SnapVault Job Hung

This is logged under BURT # 385906.  A workaround can be found here:

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=385906

The issue is fixed in the following releases:  4.0.2D6, 4.0.2D9, 5.0D2

Note:  5.0 does not include the fix, so if you are upgrading from 4.0.x, be sure to upgrade to 5.0D2.