Network and Storage Protocols

Slow local NDMP throughput

numbernine
6,852 Views

Hi All,

hope you can advise.

We have a 2040 (OnTapp 8.0.2P2) on a DR site to which we snapmirror from our live site.

We have a volume for archiving general files from live and another vol for archiving old Virtual machines...

The DR filer is fibre attached to a Quantum Scalar i80 with LTO5 drives

Also fibre attached is a Netbackup 7.5 server

We're seeing a consistant ceiling of approx 65MB/s using direct NDMP. I see people on here bemoaning 120MB/s limitations, if we could get anywhere near that I would be very happy.

So today I am doing an ndmp backup of old virtual machines - it's a very low file count, approx 900GB of data and it's still only getting 65MB/s

The underlying aggregate is only 13 x 7200K Sata disks. This is a capacity system, not performance. However there's not a lot going on with this filer apart from receiving snapmirrors from live so I'm surprised if the disks can't push harder than that.

Are there any methods to diagnose where the bottleneck is?

Am I expecting too much from these disks?

Are there any tuning parameters I can change?

Lastly, in case I have a fit of bookishness, is there a good tech doc which covers NDMP performance that I might reasonably understand? (I have a lot of other systems to look after too and am not a netapp expert)!

Many thanks in advance

Andy

9 REPLIES 9

brendanheading
6,852 Views

Andy,

Can you run sysstat -x 1 on the filer for a few minutes while a backup is in progress and post the output here. We can then see if you are hitting any bottlenecks.

There are a couple of different ways that NDMP can be configured; the filer can stream the backup data to the backup server which then forwards it to tape; or the backup server can direct the filer to send the backup data directly to the tape. Can you confirm which of the two modes you have set it up in ? This should be a configuration option in NetBackup somewhere.

Another experiment that might be worth trying would be to use the dump command (see the command reference manual for usage and examples). The backup speed you get with dump might isolate the problem to the filer itself or to the network.

numbernine
6,852 Views

Thanks for the quick response Brendan, here's the sysstat output:

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s

                                       in    out    read  write    read  write    age    hit  time  ty  util                            in    out      in    out

85%     13      0      0      13   63531    959   73896  84852       0  65012    39s    89%  100%  :f   99%       0      0      0       0      0       0      0

73%     26      0      0      26   59918    786   62102  65710       0  48448     4s    90%  100%  :f  100%       0      0      0       0      0       0      0

72%     50      0      0      50   37350    787   82908  74524       0  70255     4s    86%  100%  :f  100%       0      0      0       0      0       0      0

48%     85      0      0      90   27986    695   59512   1740       0  53215    40s    87%  100%  :v   99%       5      0      0       0      0       0      0

76%     55      0      0      55   38801    698   59936  70328       0  53106     4s    90%   99%  Hf  100%       0      0      0       0      0       0      0

70%     20      0      0      20   41415    655   69409  86809       0  50043     4s    89%  100%  :f  100%       0      0      0       0      0       0      0

88%     69      0      0      69   56853    862   86225  77344       0  80499     4s    85%  100%  :f  100%       0      0      0       0      0       0      0

It should be configured to write directly to tape, (local) and not via the media server. You can see that the disk read and tape write pretty much correlate. I think that's correct, could you confirm.

Will have a read up on the dump command...

Thanks

Andy

aborzenkov
6,852 Views

Is the system also snapmirror destination? It appears to be extremely busy writing to disks. Tape is unlikely to be bottleneck here.

You could try dump to null just to test how fast it can read from disks. There was KB article how to do it, search kb.netapp.com.

numbernine
6,852 Views

Hi There, I think this forums scrunches the data

The Disk read rate is actually very low. I hope this screenshot clears it up

brendanheading
6,852 Views

Your two readouts seem to be very different, as if the filer was doing different things.

The first paste of the text shows lots of data coming in over the network and being written to disk, and high CPU utilization, which as abzorkenov says looks like an incoming snapmirror update. What are your snapmirror schedules ? You'd normally want to be very sure that the snapmirror updates are complete before you start your tape dump.

The second drop, the screenshot, shows minimal data coming in over the network and being written to disk, but as you have said correlated disk read and tape write numbers.

It's notable in both cases that the disk utilization is typically in the 90s, often 100%. Note that this is the utilization of the busiest disk (not all the disks). This may be a hint that the disks are being maxed out. Which in turn may be a hint that your data is heavily fragmented and there's a lot of seeking going on.

A couple of further questions .. what's the output of sysconfig -r ? (shows your raid config). Did you gradually add disks to the aggregate to grow it, or did you add them all at once ? How full is the aggregate and the volumes within it ? (df -A and df -r will show this). Are you using volume or qtree snapmirror ? If you are using volume snapmirror then how many snapshots are being retained on the source volume ? Are you doing full NDMP backups or are you doing differential backups ?

if you're doing very regular snapmirror updates to an aggregate/volume which is nearly full, there might be a lot of fragmentation going on, and if the aggregate was grown gradually over time it would exacerbate the problem.

scottgelb
6,852 Views

dump to null is a good test to see max throughput it can dump bypassing tape (although like mentioned tape isn't the bottleneck) with "dump 0f null /vol/volname".

With a snapmirror target, there is wafl deswizzling...that can really take an effect of backups with contention for I/O.  If you run "priv set advanced ; wafl scan status" do you see deswizzling operations?  It would be good to schedule backups offset from snapmirror updates to work around deswizzling.  A perfstat during backup and a performance case will help identify this too and troubleshoot any other bottlenecks. 

For a FAS2040 with SATA and a snapmirror target this doesn't sound like unexpected performance, but maybe some tuning and workarounds like listed can help speed it up.

brendanheading
6,852 Views

Andy, were we able to move this along at all for you ?

numbernine
6,852 Views

Hi All,

Thanks for all the replies, I've not been able to do a dump to Null as a bunch of other stuff has come up (in the shape of a failed Netbackup migration - hooray)

Right now, I'd settle for 3MB/s and a working backup system

Fridays used to be nice

Will update when I get some happier news.

Andy

brendanheading
6,852 Views

Sorry to hear that Andy, best of luck sorting that out.

(Ever thought about getting rid of the tape side of things altogether ? You could hook up a rake of cheap SATA drives to your 2040 and use SnapVault to retain the archives. Deduplication is the big thing that makes this possible.)

Public