NDMP Backup Issue

Khan2022 · ‎2022-01-27

Dear Members,

We recently deployed NDMP backup in our environment. Initially, there was port blockage issue which has been resolved but there is issue with NDMP backup. We are using Veritas Netbackup 8.2 and ONTAP 9.3 on filer.

The NDMP backup works fine but most of the time we are getting this error as can be seen in snapshot as attached:

Error: ndmpagent ***** *****: DUMP: Message from Write Dirnet: Interrupted system call
Error: ndmpagent ***** *****: DUMP: DUMP IS ABORTED

Warning: DUMP: Total Dir to FH time spent is greater than 15 percent pf phase 3 total time. Please verify

Please if anyone can share his experience of getting this error while running NDMP backup job, the job get hanged, neither complete nor failed.

Thanks

Ontapforrum · ‎2022-01-27

Very common scenario in large volume(dense folders/files too many) NDMP(dump) backups. There has been some improvement depends on what Ontap you are running. But, I suggest take your time to understand the issue properly.

There is very good NetApp kb that explains this very problem referenced below.

Cause: If the total Dir to FH entry time is 15% or more of total phase 3 time, this is considered file history backpressure in "phase 3". In other words, dump cannot continue writing data to the backup stream until the associated file history is completely ingested and acknowledged by the backup application(DMA).

What is file-history?
File history enables a backup application or Data Management Application (DMA) to build an index database of all the files in a backup.

What is dump Phase III/3?
dump writes the entire directory structure for the backup dataset to tape

How is file history back pressure identified?
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Protection_and_Security/NDMP/FAQ%3A_NDMP_File_history

Backup logs for NDMP:
Backup log file - /mroot/etc/log/backup
NDMP log file - /mroot/etc/log/mlog/ndmpd.log
Data Management Application (DMA) logs [Depends on the backup application]

What is NDMP Backup phases?
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Protection_and_Security/NDMP/Network_Data_Management_Protocol_(NDMP)_%2F%2F_dump_phases_descript...

Khan2022 · ‎2022-01-27

Man thanks for the quick response.

Could you please guide me either the issue exist from Filer side or there is some necessary actions need to be taken through DMA - Netbackup

Please advise

Khan2022 · ‎2022-02-02

Sorry for replying late since I was on Vacation.

Thanks for your wonderful advise but it can be seen we are getting errors as showin below:

any idea for getting this error : DUMP: Write to socket failed

Thanks once again for your guidance and support

Feb 2, 2022 10:08:40 PM - Info ndmpagent (pid=30230) hogarage: DUMP: Wed Feb 2 22:08:40 2022 : We have written 1722964 KB.
Feb 2, 2022 10:13:40 PM - Info ndmpagent (pid=30230) hogarage: DUMP: Wed Feb 2 22:13:40 2022 : We have written 3603134 KB.
Feb 2, 2022 10:20:20 PM - Error ndmpagent (pid=30230) hogarage: DUMP: Write to socket failed
Feb 2, 2022 10:20:20 PM - Error ndmpagent (pid=30230) hogarage: DUMP: DUMP IS ABORTED
Feb 2, 2022 10:20:20 PM - Warning ndmpagent (pid=30230) hogarage: DUMP: Total Dir to FH time spent is greater than 15 percent of phase 3 total time. Please verify the settings of backup application and the network connectivity.
Feb 2, 2022 10:20:24 PM - Info ndmpagent (pid=30230) hogarage: DUMP: Deleting "/HOSVM/HOSVM_OBIEE02/../snapshot_for_backup.729" snapshot.
Feb 2, 2022 10:20:24 PM - Error ndmpagent (pid=30230) hogarage: DUMP: deletion of "/vol/HOSVM_OBIEE02/../snapshot_for_backup.729" snapshot failed: Device busy
Feb 2, 2022 10:20:24 PM - Error ndmpagent (pid=30230) hogarage: DATA: Operation terminated: EVENT: INTERNAL ERROR (for /HOSVM/HOSVM_OBIEE02/)
Feb 3, 2022 10:46:26 AM - Error bptm (pid=30232) media manager terminated by parent process
Feb 3, 2022 10:46:27 AM - Error ndmpagent (pid=30230) NDMP backup failed, path = /HOSVM/HOSVM_OBIEE02/
Feb 3, 2022 10:46:30 AM - Info ro-bak-02.gosi.ins (pid=30232) StorageServer=PureDisk:ro-bak-02.gosi.ins; Report=PDDO Stats for (ro-bak-02.gosi.ins): scanned: 4646168 KB, CR sent: 54244 KB, CR sent over FC: 0 KB, dedup: 98.8%, cache disabled, where dedup space saving:95.4%, compression space saving:3.4%
Feb 3, 2022 10:46:30 AM - Error bpbrm (pid=30132) could not send server status message to client
Feb 3, 2022 10:46:30 AM - Info ndmpagent (pid=0) done. status: 150: termination requested by administrator
Feb 3, 2022 10:46:30 AM - end writing; write time: 12:46:05
termination requested by administrator (150)

Ontapforrum · ‎2022-02-03

Root causes and solution is already given in the Kb I mentioned, you might have missed it. Please do read that kb, it has some very important information especially dealing with performance related issue. To be fair, only you and your team can better access the underline infrastructure - FILER, Network and DMA. If you wish support to do the full investigation then I would advise log a call with both NetApp & Veritas this will enable both the teams to resolve this issue efficiently.

In short : Latency in file history delivery or ingestion can cause a slow-down in the overall backup performance. In other words, dump cannot continue writing data to the backup stream until the associated file history is completely ingested and acknowledged by the backup application / DMA. In general, it is seen that the DMA (Backup application) sends 'ABORT' to Filer. However to establish where the problem lies, we need complete logs.

Request - Please do not copy paste the logs in the thread, rather attach the complete following logs.

FILER - Backup log file - /mroot/etc/log/backup
FILER - NDMP log file - /mroot/etc/log/mlog/ndmpd.log
MEDIA Server - Netbackup jog log.

Khan2022 · ‎2022-02-03

Many thanks for the quick reply.

I will go through the KB article again, in case i missed something. Actually, I work here in Govermnent organization, due to security reasons, I cannot access the logs on Filer. The all I can do is to request the Storage Admin to provide me with the logs but he didn't agree.

I believe he has already raised a case support with NetApp and will work on this issue in coming week.

For the DMA side, I already provide the logs to Veritas, and they clearly tell me that this DUMP related errors are coming from Filer and there is nothing else they can do for this.

We have an arranged session with NetApp support and I will try my best to share the logs with you.

My sincere thanks to you for considering this issue and replying back to me.

Once again what is happening exactly, the NDMP backup jobs get hanged, nor comleted neither failed. and after killing the job manaully, when i try again to run in different time, it run successfully.

Ontapforrum · ‎2022-02-04

You're welcome. If you could arrange filer logs and logs from Veritas that will allow us to see the full picture. Also, could you let me know the size of the Volume that is being backed up and what does it contains (Is it very dense folder/file structure). As I understand, you do not have access to Storage directly, so it makes sense to let Support take the look inside your filer. We need to check, how is filer doing (peak loads), how are jobs staggered etc. When Data Mover is idle for long depending upon the backup application, it can time-out as well and then DMA will send the ABORT to filer. So there is lot of aspects before we could come to any conclusion. But, as you said, some jobs go through that means, we are talking about resource contention either on Media Server (Indexing) or at Filer side. Keep us updated.

Ontapforrum · ‎2022-01-27

No worries. Performance related issues doesn't work with quick fixes. I suppose, while you are going through the content I shared, you could forward me the logs I have suggested, I will review the logs.

Logs to review:

Backup log file - /mroot/etc/log/backup
NDMP log file - /mroot/etc/log/mlog/ndmpd.log
Data Management Application (DMA) logs :Netbackup logs