Solved: Re: dataset error after upgrade

marc_ho_ltx · ‎2013-06-17

Hello, we have recently upgraded our Protection Manager from dfm 4.0.2 to OnCommand 5.2. We installed a fresh OS and made a backup of the old DFM and restored it to the new OnCommand server. From what I can tell everything seems to be working fine. However there is one dataset the takes snapshot backups to another Netapp that is failing. A similar data set with the same relationship but with different volumes is successful. I do not see much details in the logs about the failure except that there was an error.

Here is a sample email of the error;

=======================================================================

An Error event at 15 Jun 17:39 PDT on Qtree nofs3_nofs3_users on Volume nofs3_backup on Storage System nofs0.ltx-credence.com:

SnapMirror Update: Failed.

Click below to see the details of this event.

http://dfm.milpitas.credence.com:8080/start.html#st=1&data=(eventID=140750)

*** Event details follow.***

General Information

-------------------

DataFabric Manager server Serial Number: 1-50-124130

Alarm Identifier: 2

Event Fields

-------------

Event Identifier: 140750

Event Name: SnapMirror Update: Failed

Event Description: SnapMirror Update

Event Severity: Error

Event Timestamp: 15 Jun 17:39

Source of Event

---------------

Source Identifier: 2930

Source Name: nofs0:/nofs3_backup/nofs3_nofs3_users

Source Type: Qtree

Name of the host: nofs0.ltx-credence.com

Type of the host: Storage System

Host identifier: 2858

Event Arguments

---------------

datasetId: 2915

backupJobId: 24050

jobId: 24050

--NetApp DataFabric Manager

=======================================================================

Here is a list of the snapshots in one of the destination volume for the dataset on the backup Netapp

Showing snapshots of volume nofs3_backup.

Name	Date	Used	Total	Status
2012-07-30 00:13:40 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_users	Jul 29 20:13	123.7 GB	263.6 GB	normal
2012-10-01 00:28:16 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_users	Sep 30 20:28	26.99 GB	139.9 GB	normal
2012-12-31 02:22:39 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_users	Dec 30 21:22	24.16 GB	113 GB	normal
2013-02-25 02:36:19 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_users	Feb 24 21:36	22.01 GB	88.8 GB	normal
2013-04-01 02:37:02 monthly_nofs0_nofs3_backup.-.nofs3_nofs3_users	Mar 31 22:37	29.04 GB	66.79 GB	normal
2013-05-13 02:40:20 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_users	May 12 22:40	15.22 GB	37.75 GB	normal
2013-05-20 03:00:57 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_users	May 19 23:00	5.924 GB	22.52 GB	normal
2013-05-27 03:13:22 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_users	May 26 23:13	7.341 GB	16.6 GB	normal
2013-06-03 03:06:24 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_users	Jun 02 23:06	4.364 GB	9.259 GB	normal
2013-06-10 02:53:14 weekly_nofs0_nofs3_backup.-.nofs3_nofs3_users	Jun 09 22:53	2.597 GB	4.895 GB	normal
nofs0(0118060768)_nofs3_backup_nofs3_nofs3_users-dst.838	Jun 12 22:05	2.063 GB	2.298 GB	busy,snapmirror - Busy
nofs0(0118060768)_nofs3_backup_nofs3_nofs3_users-src.0	Jun 14 20:16	240.3 MB	240.3 MB	normal

Any suggestions would be greatly appreciated.

Thanks,

- Marc

adaikkap · ‎2013-06-26

Hi Marc,

I just confirmation from our folks internally that bug 624459 affects qtree snapmirror as well. Pls open a case with netapp and reference this bug to them.

Also to find the problematic file, follow the public report for bug 624459 in the link that I gave in my previous reply.

Regards

adai

View solution in original post

adaikkap · ‎2013-06-17

Hi Marc,

I don't see any correlations between this failure and upgrade. Can you also paste the output of the job details cli for this job id ?

dfpm job detail 24050

The error is basically coming from the storage systems which Protection manager is relying back.

Regards

adai

marc_ho_ltx · ‎2013-06-18

Hello Adai,

The output of dfpm job detail 24050 is over 2000 lines.

If I grep for 'error' there are many error messages. These are the only lines that had something after it.

Error Message:
Event Status:      error
Error Message:     nofs0.ltx-credence.com: replication destination hard link create failed
Error Message:     LRD DIROPS
Event Status:      error
Error Message:     SnapMirror transfer failed.
Error Message:

I'm not sure if that will be helpful to you.

Thanks!

- Marc

adaikkap · ‎2013-06-19

Hi Marc,

Pls redirect the output of the job and upload it as attachment.

Regards

adai

paul_wolf · ‎2013-06-19

Also, can you look thru the snapmirror logs on the target controller?

marc_ho_ltx · ‎2013-06-19

Hi Paul,

Here is the error message from the snapmirror log at the time of the dataset error.

slk Tue Jun 18 20:35:23 EDT state.qtree_softlock.nofs3_backup.0003ca71.028.nofs3_nofs3_users.src-qt.nofs0:/vol/nofs3_backup/nofs3_nofs3_users.00000000.-01.0 Softlock_delete (Transfer)
dst Tue Jun 18 20:35:24 EDT nofs3.ltx-credence.com:/vol/users/- nofs0:/vol/nofs3_backup/nofs3_nofs3_users Rollback_failed (replication destination hard link create failed)
dst Tue Jun 18 20:35:24 EDT nofs3.ltx-credence.com:/vol/users/- nofs0:/vol/nofs3_backup/nofs3_nofs3_users Abort (replication destination hard link create failed)

dst Tue Jun 18 20:55:25 EDT nofs3.ltx-credence.com:/vol/users2/- nofs0:/vol/nofs3_backup_2/nofs3_nofs3_users2 End (905892 KB)
dst Tue Jun 18 21:21:11 EDT nofs3.ltx-credence.com:/vol/users1/- nofs0:/vol/nofs3_backup_1/nofs3_nofs3_users1 End (2487300 KB)

Thanks,

- Marc

marc_ho_ltx · ‎2013-06-19

Hello Adai,

Attached is the job detail.

Thanks for your time.

- Marc

adaikkap · ‎2013-06-26

Hi Marc,

Sorry for the delay. I did some internal search and found a similar bug. Here is the link to the same.

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=624459

Though it says snapvault, I guess you may have some relations because Qtree SnapMirror and SnapVault uses same replication engine as far as I know.

What is the version of ONTAP that you are running ? I also suggest you to open a support case with NetApp for the same. This is a pure ontap error message and has nothing to

do with Protection Manager.

Regards

adai

marc_ho_ltx · ‎2013-06-27

Hi Adai!

Thanks for the feedback and pointing me to the bug. We are running Ontap 7.3.6 so this is probably it.

I ran this command to locate all the hard links in the volume 'find . -type f -links +1 | xargs ls -i' redirected the output to a file and sorted it. I found over 400,000 hard links referencing a few files in someones home directory.

I'll report back tomorrow after the job runs to see if it's successful

Thanks!

- Marc

adaikkap · ‎2013-06-27

Hi Marc,

Good to know it helped. Let me know how the next update jobs goes.

Regards

adai

marc_ho_ltx · ‎2013-06-28

Hello Adai,

After removing the hard links I get the same error message. "replication destination hard link create failed" Do I need to re-initialize the the snap mirror or something else to get it going again?

Thanks,

- Marc

adaikkap · ‎2013-06-30

Hi Marc,

This is more of an ONTAP issue. I suggest you open a case against this bug and support should be able to help you. Sorry that I couldnt help you on this.

Regards

adai

adaikkap · ‎2013-06-26

Hi Marc,

I just confirmation from our folks internally that bug 624459 affects qtree snapmirror as well. Pls open a case with netapp and reference this bug to them.

Also to find the problematic file, follow the public report for bug 624459 in the link that I gave in my previous reply.

Regards

adai

marc_ho_ltx · ‎2013-08-07

Hello Adai,

Sorry for the late update. After removing the hard links I ended up having to remove the volume from the dataset, created a new dataset, and placed the volume in the newly created dataset. It's been working again.

Thanks,

- Marc

marc_ho_ltx · ‎2014-07-17

Hello Adai,

After correcting the hard links limitation error numerous times by deleting the hard links in the volumes I was hoping there was a better way to recover from this error. Currently I remove the hard links and then have to delete the dataset and create a new dataset job to start the backups over from scratch. Is there another method I can try to get the job working again without deleting the dataset and still work in the same backup volume? Some kind of refresh? The hard links are removed but unless I remove the dataset it still comes up with the hard links error.

Thank you for your time.

- Marc