ndmpcopy timeout for local 8TB directory transfer

parityerror · ‎2014-09-04

Hi,

I am attempting to perform a local ndmpcopy (i.e. within the same filer) of a subdirectory within a qtree from one volume to another, but receiving a timeout error. Any help or assistance would be hugely appreciated!

First, some background info:

NetApp FAS3240-AE

ONTAP 8.1.4P1 7-mode

64-bit aggregates & volumes

3TB SATA disks. Volumes are 46TB, aggregates 62TB. Data is served via CIFS, a low amount of network IO (under 1Gbps)

TOASTER2> ndmpcopy -sa root:netapp01 -da root:netapp01 172.28.12.157:/vol/vol_02_02/qt1/1099_resources 172.28.12.157:/vol/vol_02_10/qt1/1099_resources

Ndmpcopy: Starting copy [ 6 ] ...

Ndmpcopy: 172.28.12.157: Notify: Connection established

Ndmpcopy: 172.28.12.157: Connect: Authentication successful

Ndmpcopy: 172.28.12.157: Log: DUMP: creating "/vol/vol_02_02/../snapshot_for_backup.6" snapshot.

Ndmpcopy: 172.28.12.157: Log: DUMP: Using subtree dump

Ndmpcopy: 172.28.12.157: Log: DUMP: Date of this level 0 dump: Thu Sep 4 10:23:27 2014.

Ndmpcopy: 172.28.12.157: Log: DUMP: Date of last level 0 dump: the epoch.

Ndmpcopy: 172.28.12.157: Log: DUMP: Dumping /vol/vol_02_02/qt1/1099_resources to NDMP connection

Ndmpcopy: 172.28.12.157: Log: DUMP: mapping (Pass I)[regular files]

The directory being moved is about 8TB in size and has roughly 10 million files.

We have performed this sort of transfer successfully twice in this environment already on similar directory migrations.

After about 14 hours (without progressing any further) it returns the following error:

Not heard from other end. Timing out

Ndmpcopy: Notify: Transfer failed

Ndmpcopy: Done

TOASTER2>

The first time I ran the transfer (and on previous occasions), the system has defaulted to e0M (which I believe shouldn't matter as it is a local transfer)

The network interface I specified for the most recent attempt is a VLAN on a 2 x 10g LACP interface group, just to try and eliminate e0M as a bottleneck.

Below are the log files (note, the Error in data_listen_reply is very early on in the transfer)

TOASTER2>rdfile /etc/log/ndmpcopy.20140904

Thu Sep 4 10:20:36 EST 2014 [ndmpc:5]: Starting ndmpcopy operation

Thu Sep 4 10:20:38 EST 2014 [ndmpc:5]: Source version: 4

Thu Sep 4 10:20:38 EST 2014 [ndmpc:5]: Destination version: 4

Thu Sep 4 10:20:38 EST 2014 [ndmpc:5]: Ndmpcopy: Auto negotiated to version 4

Thu Sep 4 10:20:38 EST 2014 [ndmpc:5]: Error in data_listen_reply: Error: unknown error

Thu Sep 4 10:23:24 EST 2014 [ndmpc:6]: Starting ndmpcopy operation

Thu Sep 4 10:23:26 EST 2014 [ndmpc:6]: Source version: 4

Thu Sep 4 10:23:26 EST 2014 [ndmpc:6]: Destination version: 4

Thu Sep 4 10:23:26 EST 2014 [ndmpc:6]: Ndmpcopy: Auto negotiated to version 4

TOASTER2> rdfile /etc/log/ndmpcopy.20140903

Wed Sep 3 13:30:13 EST 2014 [ndmpc:4]: Starting ndmpcopy operation

Wed Sep 3 13:30:15 EST 2014 [ndmpc:4]: Source version: 4

Wed Sep 3 13:30:15 EST 2014 [ndmpc:4]: Destination version: 4

Wed Sep 3 13:30:15 EST 2014 [ndmpc:4]: Ndmpcopy: Auto negotiated to version 4

Wed Sep 3 13:30:15 EST 2014 [ndmpc:4]: Error in data_listen_reply: Error: unknown error

TOASTER2> rdfile /etc/log/ndmpcopy.20140905

Fri Sep 5 07:32:40 EST 2014 [ndmpc:7]: Starting ndmpcopy operation

Fri Sep 5 07:32:42 EST 2014 [ndmpc:7]: Source version: 4

Fri Sep 5 07:32:42 EST 2014 [ndmpc:7]: Destination version: 4

Fri Sep 5 07:32:42 EST 2014 [ndmpc:7]: Ndmpcopy: Auto negotiated to version 4

The next command was run about 10 minutes after the ndmpcopy was commenced. The data counter didn't seem to increase when I ran the same command a few hours later.

TOASTER2> ndmpd probe

ndmpd ON.

Session: 96166

isActive: TRUE

protocol version: 4

effHost: Local

authorized: TRUE

client addr: 172.28.12.157:35360

spt.device_id: none

spt.ha: -1

spt.scsi_id: -1

spt.scsi_lun: -1

tape.device: not open

tape.mode: Read only

mover state: Idle

mover.mode: Data

mover.pauseReason N/A

mover.haltReason N/A

mover.recordSize: 0

mover.recordNum: 0

mover.bytesMoved: 0

mover.seekPosition: 0

mover.bytesLeftToRead: 0

mover.windowOffset: 0

mover.windowLength: 0

mover.position: 0

mover.setRecordSizeFlag: false

mover.setWindowFlag: false

mover.connect.addr_type: LOCAL

data.operation: Backup

data.state: Active

data.haltReason: N/A

data.connect.addr_type: TCP

data.connect.addr: 172.28.12.157

data.connect.port: 26198

data.bytesProcessed: 1024

Session: 96167

isActive: TRUE

protocol version: 4

effHost: Local

authorized: TRUE

client addr: 172.28.12.157:35361

spt.device_id: none

spt.ha: -1

spt.scsi_id: -1

spt.scsi_lun: -1

tape.device: not open

tape.mode: Read only

mover state: Idle

mover.mode: Data

mover.pauseReason N/A

mover.haltReason N/A

mover.recordSize: 0

mover.recordNum: 0

mover.bytesMoved: 0

mover.seekPosition: 0

mover.bytesLeftToRead: 0

mover.windowOffset: 0

mover.windowLength: 0

mover.position: 0

mover.setRecordSizeFlag: false

mover.setWindowFlag: false

mover.connect.addr_type: LOCAL

data.operation: Restore

data.state: Active

data.haltReason: N/A

data.connect.addr_type: TCP

data.connect.addr: 172.28.12.157

data.connect.port: 26199

data.bytesProcessed: 0

TOASTER2> ndmpd status

ndmpd ON.

Session: 96166

Active

version: 4

Operating on behalf of primary host.

tape.device: Not open

mover state: Idle

data state: Active

data operation: Backup

Session: 96167

Active

version: 4

Operating on behalf of primary host.

tape.device: Not open

mover state: Idle

data state: Active

data operation: Restore

TOASTER2> snap list -V vol_02_02

Volume vol_02_02

working...

%/used %/total date name

---------- ---------- ------------ --------

0% ( 0%) 0% ( 0%) Sep 04 10:23 snapshot_for_backup.6 (busy,backup[0],dump)

TOASTER2> options ndmp

ndmpd.access all

ndmpd.authtype challenge

ndmpd.connectlog.enabled off

ndmpd.data_port_range all

ndmpd.enable on

ndmpd.fh_node_retry_interval 250

ndmpd.ignore_ctime.enabled off

ndmpd.maxversion 4

ndmpd.offset_map.enable on

ndmpd.password_length 16

ndmpd.preferred_interface disable (value might be overwritten in takeover)

ndmpd.tcpnodelay.enable off

ndmpd.tcpwinsize 32768

TOASTER2> df -Vg vol_02_02

Filesystem total used avail capacity Mounted on

/vol/vol_02_02/ 46079GB 33751GB 12328GB 73% /vol/vol_02_02/

snap reserve 0GB 34GB 0GB ---% /vol/vol_02_02/..

TOASTER2> df -Vg vol_02_10

Filesystem total used avail capacity Mounted on

/vol/vol_02_10/ 12500GB 0GB 12499GB 0% /vol/vol_02_10/

/vol/vol_02_10/.snapshot 0GB 0GB 0GB ---% /vol/vol_02_10/.snapshot

ekashpureff · ‎2014-09-05

Parity Error -

Try using the localhost, 127.0.0.1, instead ?

I hope this response has been helpful to you.

At your service,

Eugene E. Kashpureff, Sr.

Independent NetApp Consultant, K&H Research http://www.linkedin.com/in/eugenekashpureff

Senior NetApp Instructor, IT Learning Solutions http://sg.itls.asia/netapp

(P.S. I appreciate points for helpful or correct answers.)

parityerror · ‎2014-09-07

Hi Eugene,

I retried the transfer (using the same command as the previous attempts) when I posted this question and it looks like it has been successful (data is still being transferred).

I have a few more of these to go so I'll try using localhost next time and see if that makes any difference to the outcome.

Interestingly, when this was successful, the first pass (mapping regular files) only took 7 hours (well below the "14 hours before timeout previously observed). So perhaps there was some sort of early termination of one side fo the NDMP copy thus resulting in the timeout.

The unsuccessful ndmpcopy attempts had logged the following messages "Error in data_listen_reply: Error: unknown error"

in the ndmpcopy log files, very early on in the transfer. Perhaps this is a good indication that the transfer is broken and will likely time out after 14 hours. The successful copy did not include that error.

Anyway, thats heaps for the suggestion, I'll try it next time I run an ndmpcopy

ndmpcopy timeout for local 8TB directory transfer

NetApp Support Site Wins Silver Stevie® Award

Sign-up for Software Release Notifications!