ONTAP Hardware

ndmpcopy timeout for local 8TB directory transfer

parityerror
4,891 Views

Hi,

I am attempting to perform a local ndmpcopy (i.e. within the same filer) of a subdirectory within a qtree from one volume to another, but receiving a timeout error. Any help or assistance would be hugely appreciated!

First, some background info:

NetApp FAS3240-AE

ONTAP 8.1.4P1 7-mode

64-bit aggregates & volumes

3TB SATA disks. Volumes are 46TB, aggregates 62TB. Data is served via CIFS, a low amount of network IO (under 1Gbps)

TOASTER2> ndmpcopy -sa root:netapp01 -da root:netapp01 172.28.12.157:/vol/vol_02_02/qt1/1099_resources 172.28.12.157:/vol/vol_02_10/qt1/1099_resources

Ndmpcopy: Starting copy [ 6 ] ...

Ndmpcopy: 172.28.12.157: Notify: Connection established

Ndmpcopy: 172.28.12.157: Notify: Connection established

Ndmpcopy: 172.28.12.157: Connect: Authentication successful

Ndmpcopy: 172.28.12.157: Connect: Authentication successful

Ndmpcopy: 172.28.12.157: Log: DUMP: creating "/vol/vol_02_02/../snapshot_for_backup.6" snapshot.

Ndmpcopy: 172.28.12.157: Log: DUMP: Using subtree dump

Ndmpcopy: 172.28.12.157: Log: DUMP: Date of this level 0 dump: Thu Sep  4 10:23:27 2014.

Ndmpcopy: 172.28.12.157: Log: DUMP: Date of last level 0 dump: the epoch.

Ndmpcopy: 172.28.12.157: Log: DUMP: Dumping /vol/vol_02_02/qt1/1099_resources to NDMP connection

Ndmpcopy: 172.28.12.157: Log: DUMP: mapping (Pass I)[regular files]

The directory being moved is about 8TB in size and has roughly 10 million files.

We have performed this sort of transfer successfully twice in this environment already on similar directory migrations.

After about 14 hours (without progressing any further) it returns the following error:

Not heard from other end. Timing out

Ndmpcopy: Notify: Transfer failed

Ndmpcopy: Done

TOASTER2>

The first time I ran the transfer (and on previous occasions), the system has defaulted to e0M (which I believe shouldn't matter as it is a local transfer)

The network interface I specified for the most recent attempt is a VLAN on a 2 x 10g LACP interface group, just to try and eliminate e0M as a bottleneck.

Below are the log files (note, the Error in data_listen_reply is very early on in the transfer)

TOASTER2>rdfile /etc/log/ndmpcopy.20140904

Thu Sep  4 10:20:36 EST 2014 [ndmpc:5]: Starting ndmpcopy operation

Thu Sep  4 10:20:38 EST 2014 [ndmpc:5]: Source version: 4

Thu Sep  4 10:20:38 EST 2014 [ndmpc:5]: Destination version: 4

Thu Sep  4 10:20:38 EST 2014 [ndmpc:5]: Ndmpcopy: Auto negotiated to version 4

Thu Sep  4 10:20:38 EST 2014 [ndmpc:5]: Error in data_listen_reply: Error: unknown error

Thu Sep  4 10:23:24 EST 2014 [ndmpc:6]: Starting ndmpcopy operation

Thu Sep  4 10:23:26 EST 2014 [ndmpc:6]: Source version: 4

Thu Sep  4 10:23:26 EST 2014 [ndmpc:6]: Destination version: 4

Thu Sep  4 10:23:26 EST 2014 [ndmpc:6]: Ndmpcopy: Auto negotiated to version 4

TOASTER2> rdfile /etc/log/ndmpcopy.20140903

Wed Sep  3 13:30:13 EST 2014 [ndmpc:4]: Starting ndmpcopy operation

Wed Sep  3 13:30:15 EST 2014 [ndmpc:4]: Source version: 4

Wed Sep  3 13:30:15 EST 2014 [ndmpc:4]: Destination version: 4

Wed Sep  3 13:30:15 EST 2014 [ndmpc:4]: Ndmpcopy: Auto negotiated to version 4

Wed Sep  3 13:30:15 EST 2014 [ndmpc:4]: Error in data_listen_reply: Error: unknown error

TOASTER2> rdfile /etc/log/ndmpcopy.20140905

Fri Sep  5 07:32:40 EST 2014 [ndmpc:7]: Starting ndmpcopy operation

Fri Sep  5 07:32:42 EST 2014 [ndmpc:7]: Source version: 4

Fri Sep  5 07:32:42 EST 2014 [ndmpc:7]: Destination version: 4

Fri Sep  5 07:32:42 EST 2014 [ndmpc:7]: Ndmpcopy: Auto negotiated to version 4

The next command was run about 10 minutes after the ndmpcopy was commenced. The data counter didn't seem to increase when I ran the same command a few hours later.

TOASTER2> ndmpd probe

ndmpd ON.

Session: 96166

  isActive:                  TRUE

  protocol version:          4

  effHost:                   Local

  authorized:                TRUE

  client addr:               172.28.12.157:35360

  spt.device_id:             none

  spt.ha:                    -1

  spt.scsi_id:               -1

  spt.scsi_lun:              -1

  tape.device:               not open

  tape.mode:                 Read only

  mover state:               Idle

  mover.mode:                Data

  mover.pauseReason          N/A

  mover.haltReason           N/A

  mover.recordSize:          0

  mover.recordNum:           0

  mover.bytesMoved:          0

  mover.seekPosition:        0

  mover.bytesLeftToRead:     0

  mover.windowOffset:        0

  mover.windowLength:        0

  mover.position:            0

  mover.setRecordSizeFlag:   false

  mover.setWindowFlag:       false

  mover.connect.addr_type:   LOCAL

  data.operation:            Backup

  data.state:                Active

  data.haltReason:           N/A

  data.connect.addr_type:    TCP

  data.connect.addr:         172.28.12.157

  data.connect.port:         26198

  data.bytesProcessed:       1024

Session: 96167

  isActive:                  TRUE

  protocol version:          4

  effHost:                   Local

  authorized:                TRUE

  client addr:               172.28.12.157:35361

  spt.device_id:             none

  spt.ha:                    -1

  spt.scsi_id:               -1

  spt.scsi_lun:              -1

  tape.device:               not open

  tape.mode:                 Read only

  mover state:               Idle

  mover.mode:                Data

  mover.pauseReason          N/A

  mover.haltReason           N/A

  mover.recordSize:          0

  mover.recordNum:           0

  mover.bytesMoved:          0

  mover.seekPosition:        0

  mover.bytesLeftToRead:     0

  mover.windowOffset:        0

  mover.windowLength:        0

  mover.position:            0

  mover.setRecordSizeFlag:   false

  mover.setWindowFlag:       false

  mover.connect.addr_type:   LOCAL

  data.operation:            Restore

  data.state:                Active

  data.haltReason:           N/A

  data.connect.addr_type:    TCP

  data.connect.addr:         172.28.12.157

  data.connect.port:         26199

  data.bytesProcessed:       0

TOASTER2> ndmpd status

ndmpd ON.

Session: 96166

  Active

  version:        4

  Operating on behalf of primary host.

  tape.device:    Not open

  mover state:    Idle

  data state:     Active

  data operation: Backup

Session: 96167

  Active

  version:        4

  Operating on behalf of primary host.

  tape.device:    Not open

  mover state:    Idle

  data state:     Active

  data operation: Restore

TOASTER2> snap list -V vol_02_02

Volume vol_02_02

working...

  %/used       %/total  date          name

----------  ----------  ------------  --------

  0% ( 0%)    0% ( 0%)  Sep 04 10:23  snapshot_for_backup.6 (busy,backup[0],dump)

TOASTER2> options ndmp

ndmpd.access                 all

ndmpd.authtype               challenge

ndmpd.connectlog.enabled     off

ndmpd.data_port_range        all

ndmpd.enable                 on

ndmpd.fh_node_retry_interval 250

ndmpd.ignore_ctime.enabled   off

ndmpd.maxversion             4

ndmpd.offset_map.enable      on

ndmpd.password_length        16

ndmpd.preferred_interface    disable    (value might be overwritten in takeover)

ndmpd.tcpnodelay.enable      off

ndmpd.tcpwinsize             32768

TOASTER2> df -Vg vol_02_02

Filesystem               total       used      avail capacity  Mounted on

/vol/vol_02_02/        46079GB    33751GB    12328GB      73%  /vol/vol_02_02/

snap reserve               0GB       34GB        0GB     ---%  /vol/vol_02_02/..

TOASTER2> df -Vg vol_02_10

Filesystem               total       used      avail capacity  Mounted on

/vol/vol_02_10/        12500GB        0GB    12499GB       0%  /vol/vol_02_10/

/vol/vol_02_10/.snapshot        0GB        0GB        0GB     ---%  /vol/vol_02_10/.snapshot

2 REPLIES 2

ekashpureff
4,891 Views

Parity Error -

Try using the localhost, 127.0.0.1, instead ?

I hope this response has been helpful to you.

At your service,

Eugene E. Kashpureff, Sr.

Independent NetApp Consultant, K&H Research http://www.linkedin.com/in/eugenekashpureff

Senior NetApp Instructor, IT Learning Solutions http://sg.itls.asia/netapp

(P.S. I appreciate points for helpful or correct answers.)

parityerror
4,891 Views

Hi Eugene,

I retried the transfer (using the same command as the previous attempts) when I posted this question and it looks like it has been successful (data is still being transferred).

I have a few more of these to go so I'll try using localhost next time and see if that makes any difference to the outcome.

Interestingly, when this was successful, the first pass (mapping regular files) only took 7 hours (well below the "14 hours before timeout previously observed). So perhaps there was some sort of early termination of one side fo the NDMP copy thus resulting in the timeout.

The unsuccessful ndmpcopy attempts had logged the following messages "Error in data_listen_reply: Error: unknown error"

in the ndmpcopy log files, very early on in the transfer. Perhaps this is a good indication that the transfer is broken and will likely time out after 14 hours. The successful copy did not include that error.

Anyway, thats heaps for the suggestion, I'll try it next time I run an ndmpcopy

Public