Active IQ Unified Manager Discussions

dataset error after upgrade

marc_ho_ltx
7,579 Views

Hello, we have recently upgraded our Protection Manager from dfm 4.0.2 to OnCommand 5.2.  We installed a fresh OS and made a backup of the old DFM and restored it to the new OnCommand server.  From what I can tell everything seems to be working fine.  However there is one dataset the takes snapshot backups to another Netapp that is failing.  A similar data set with the same relationship but with different volumes is successful.  I do not see much details in the logs about the failure except that there was an error. 

Here is a sample email of the error;

=======================================================================

An Error event at 15 Jun 17:39 PDT on Qtree nofs3_nofs3_users on Volume nofs3_backup on Storage System nofs0.ltx-credence.com:

SnapMirror Update: Failed.

Click below to see the details of this event.

http://dfm.milpitas.credence.com:8080/start.html#st=1&data=(eventID=140750)

*** Event details follow.***

General Information

-------------------

DataFabric Manager server Serial Number: 1-50-124130

Alarm Identifier: 2

Event Fields

-------------

Event Identifier: 140750

Event Name: SnapMirror Update: Failed

Event Description: SnapMirror Update

Event Severity: Error

Event Timestamp: 15 Jun 17:39

Source of Event

---------------

Source Identifier: 2930

Source Name: nofs0:/nofs3_backup/nofs3_nofs3_users

Source Type: Qtree

Name of the host: nofs0.ltx-credence.com

Type of the host: Storage System

Host identifier: 2858

Event Arguments

---------------

datasetId: 2915

backupJobId: 24050

jobId: 24050

--NetApp DataFabric Manager

=======================================================================

Here is a list of the snapshots in one of the destination volume for the dataset on the backup Netapp

  

NameDateUsedTotalStatus
2012-07-30 00:13:40 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_usersJul 29 20:13123.7 GB263.6 GBnormal
2012-10-01 00:28:16 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_usersSep 30 20:2826.99 GB139.9 GBnormal
2012-12-31 02:22:39 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_usersDec 30 21:2224.16 GB113 GBnormal
2013-02-25 02:36:19 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_usersFeb 24 21:3622.01 GB88.8 GBnormal
2013-04-01 02:37:02 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_usersMar 31 22:3729.04 GB66.79 GBnormal
2013-05-13 02:40:20 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_usersMay 12 22:4015.22 GB37.75 GBnormal
2013-05-20 03:00:57 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_usersMay 19 23:005.924 GB22.52 GBnormal
2013-05-27 03:13:22 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_usersMay 26 23:137.341 GB16.6 GBnormal
2013-06-03 03:06:24 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_usersJun 02 23:064.364 GB9.259 GBnormal
2013-06-10 02:53:14 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_usersJun 09 22:532.597 GB4.895 GBnormal
nofs0(0118060768)_nofs3_backup_nofs3_nofs3_users-dst.838Jun 12 22:052.063 GB2.298 GBbusy,snapmirror - Busy
nofs0(0118060768)_nofs3_backup_nofs3_nofs3_users-src.0Jun 14 20:16240.3 MB240.3 MBnormal

Any suggestions would be greatly appreciated.

Thanks,

- Marc

1 ACCEPTED SOLUTION

adaikkap
7,579 Views

Hi Marc,

    I just confirmation from our folks internally that bug 624459 affects qtree snapmirror as well. Pls open a case with netapp and reference this bug to them.

Also to find the problematic file, follow the public report for bug 624459 in the link that I gave in my previous reply.

Regards

adai

View solution in original post

14 REPLIES 14

adaikkap
7,544 Views

Hi Marc,

     I don't see any correlations between this failure and upgrade. Can you also paste the output of the job details cli for this job id ?

dfpm job detail 24050

The error is basically coming from the storage systems which Protection manager is relying back.

Regards

adai

marc_ho_ltx
7,544 Views

Hello Adai,

The output of dfpm job detail 24050 is over 2000 lines.

If I grep for 'error' there are many error messages.  These are the only lines that had something after it.

Error Message:    
Event Status:      error
Error Message:     nofs0.ltx-credence.com: replication destination hard link create failed
Error Message:     LRD DIROPS
Event Status:      error
Error Message:     SnapMirror transfer failed.
Error Message:    

I'm not sure if that will be helpful to you.

Thanks!

- Marc

adaikkap
7,544 Views

Hi Marc,

          Pls redirect the output of the job and upload it as attachment.

Regards

adai

paul_wolf
7,544 Views

Also, can you look thru the snapmirror logs on the target controller?

marc_ho_ltx
7,545 Views

Hi Paul,

Here is the error message from the snapmirror log at the time of the dataset error.

slk Tue Jun 18 20:35:23 EDT state.qtree_softlock.nofs3_backup.0003ca71.028.nofs3_nofs3_users.src-qt.nofs0:/vol/nofs3_backup/nofs3_nofs3_users.00000000.-01.0 Softlock_delete (Transfer)
dst Tue Jun 18 20:35:24 EDT nofs3.ltx-credence.com:/vol/users/- nofs0:/vol/nofs3_backup/nofs3_nofs3_users Rollback_failed (replication destination hard link create failed)
dst Tue Jun 18 20:35:24 EDT nofs3.ltx-credence.com:/vol/users/- nofs0:/vol/nofs3_backup/nofs3_nofs3_users Abort (replication destination hard link create failed)

dst Tue Jun 18 20:55:25 EDT nofs3.ltx-credence.com:/vol/users2/- nofs0:/vol/nofs3_backup_2/nofs3_nofs3_users2 End (905892 KB)
dst Tue Jun 18 21:21:11 EDT nofs3.ltx-credence.com:/vol/users1/- nofs0:/vol/nofs3_backup_1/nofs3_nofs3_users1 End (2487300 KB)

Thanks,

- Marc

marc_ho_ltx
7,545 Views

Hello Adai,

Attached is the job detail.


Thanks for your time.

- Marc

adaikkap
7,545 Views

Hi Marc,

          Sorry for the delay. I did some internal search and found a similar bug.  Here is the link to the same.

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=624459

Though it says snapvault, I guess you may have some relations because Qtree SnapMirror and SnapVault uses same replication engine as far as I know.

What is the version of ONTAP that you are running ?  I also suggest you to open a support case with NetApp for the same. This is a pure ontap error message and has nothing to

do with Protection Manager.

Regards

adai

marc_ho_ltx
7,545 Views

Hi Adai!

Thanks for the feedback and pointing me to the bug.  We are running Ontap 7.3.6 so this is probably it. 

I ran this command to locate all the hard links in the volume 'find . -type f -links +1 | xargs ls -i'  redirected the output to a file and sorted it.  I found over 400,000 hard links referencing a few files in someones home directory. 

I'll report back tomorrow after the job runs to see if it's successful

Thanks!

- Marc

adaikkap
6,065 Views

Hi Marc,

      Good to know it helped. Let me know how the next update jobs goes.

Regards

adai

marc_ho_ltx
6,065 Views

Hello Adai,

After removing the hard links I get the same error message. "replication destination hard link create failed"  Do I need to re-initialize the the snap mirror or something else to get it going again?

Thanks,

- Marc

adaikkap
6,065 Views

Hi Marc,

     This is more of an ONTAP issue. I suggest you open a  case against this bug and support should be able to help you. Sorry that I couldnt help you on this.

Regards

adai

adaikkap
7,580 Views

Hi Marc,

    I just confirmation from our folks internally that bug 624459 affects qtree snapmirror as well. Pls open a case with netapp and reference this bug to them.

Also to find the problematic file, follow the public report for bug 624459 in the link that I gave in my previous reply.

Regards

adai

marc_ho_ltx
6,065 Views

Hello Adai,

Sorry for the late update.  After removing the hard links I ended up having to remove the volume from the dataset, created a new dataset, and placed the volume in the newly created dataset.  It's been working again.

Thanks,

- Marc

marc_ho_ltx
6,065 Views

Hello Adai,

After correcting the hard links limitation error numerous times by deleting the hard links in the volumes I was hoping there was a better way to recover from this error.  Currently I remove the hard links and then have to delete the dataset and create a new dataset job to start the backups over from scratch.  Is there another method I can try to get the job working again without deleting the dataset and still work in the same backup volume?  Some kind of refresh?  The hard links are removed but unless I remove the dataset it still comes up with the hard links error.    

Thank you for your time.

- Marc

Public