Slow local NDMP throughput

numbernine · ‎2012-04-11

Hi All,

hope you can advise.

We have a 2040 (OnTapp 8.0.2P2) on a DR site to which we snapmirror from our live site.

We have a volume for archiving general files from live and another vol for archiving old Virtual machines...

The DR filer is fibre attached to a Quantum Scalar i80 with LTO5 drives

Also fibre attached is a Netbackup 7.5 server

We're seeing a consistant ceiling of approx 65MB/s using direct NDMP. I see people on here bemoaning 120MB/s limitations, if we could get anywhere near that I would be very happy.

So today I am doing an ndmp backup of old virtual machines - it's a very low file count, approx 900GB of data and it's still only getting 65MB/s

The underlying aggregate is only 13 x 7200K Sata disks. This is a capacity system, not performance. However there's not a lot going on with this filer apart from receiving snapmirrors from live so I'm surprised if the disks can't push harder than that.

Are there any methods to diagnose where the bottleneck is?

Am I expecting too much from these disks?

Are there any tuning parameters I can change?

Lastly, in case I have a fit of bookishness, is there a good tech doc which covers NDMP performance that I might reasonably understand? (I have a lot of other systems to look after too and am not a netapp expert)!

Many thanks in advance

Andy

brendanheading · ‎2012-04-11

Andy,

Can you run sysstat -x 1 on the filer for a few minutes while a backup is in progress and post the output here. We can then see if you are hitting any bottlenecks.

There are a couple of different ways that NDMP can be configured; the filer can stream the backup data to the backup server which then forwards it to tape; or the backup server can direct the filer to send the backup data directly to the tape. Can you confirm which of the two modes you have set it up in ? This should be a configuration option in NetBackup somewhere.

Another experiment that might be worth trying would be to use the dump command (see the command reference manual for usage and examples). The backup speed you get with dump might isolate the problem to the filer itself or to the network.

numbernine · ‎2012-04-11

Thanks for the quick response Brendan, here's the sysstat output:

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

85% 13 0 0 13 63531 959 73896 84852 0 65012 39s 89% 100% :f 99% 0 0 0 0 0 0 0

73% 26 0 0 26 59918 786 62102 65710 0 48448 4s 90% 100% :f 100% 0 0 0 0 0 0 0

72% 50 0 0 50 37350 787 82908 74524 0 70255 4s 86% 100% :f 100% 0 0 0 0 0 0 0

48% 85 0 0 90 27986 695 59512 1740 0 53215 40s 87% 100% :v 99% 5 0 0 0 0 0 0

76% 55 0 0 55 38801 698 59936 70328 0 53106 4s 90% 99% Hf 100% 0 0 0 0 0 0 0

70% 20 0 0 20 41415 655 69409 86809 0 50043 4s 89% 100% :f 100% 0 0 0 0 0 0 0

88% 69 0 0 69 56853 862 86225 77344 0 80499 4s 85% 100% :f 100% 0 0 0 0 0 0 0

It should be configured to write directly to tape, (local) and not via the media server. You can see that the disk read and tape write pretty much correlate. I think that's correct, could you confirm.

Will have a read up on the dump command...

Thanks

Andy

aborzenkov · ‎2012-04-11

Is the system also snapmirror destination? It appears to be extremely busy writing to disks. Tape is unlikely to be bottleneck here.

You could try dump to null just to test how fast it can read from disks. There was KB article how to do it, search kb.netapp.com.

numbernine · ‎2012-04-11

Hi There, I think this forums scrunches the data

The Disk read rate is actually very low. I hope this screenshot clears it up

brendanheading · ‎2012-04-11

Your two readouts seem to be very different, as if the filer was doing different things.

The first paste of the text shows lots of data coming in over the network and being written to disk, and high CPU utilization, which as abzorkenov says looks like an incoming snapmirror update. What are your snapmirror schedules ? You'd normally want to be very sure that the snapmirror updates are complete before you start your tape dump.

The second drop, the screenshot, shows minimal data coming in over the network and being written to disk, but as you have said correlated disk read and tape write numbers.

It's notable in both cases that the disk utilization is typically in the 90s, often 100%. Note that this is the utilization of the busiest disk (not all the disks). This may be a hint that the disks are being maxed out. Which in turn may be a hint that your data is heavily fragmented and there's a lot of seeking going on.

A couple of further questions .. what's the output of sysconfig -r ? (shows your raid config). Did you gradually add disks to the aggregate to grow it, or did you add them all at once ? How full is the aggregate and the volumes within it ? (df -A and df -r will show this). Are you using volume or qtree snapmirror ? If you are using volume snapmirror then how many snapshots are being retained on the source volume ? Are you doing full NDMP backups or are you doing differential backups ?

if you're doing very regular snapmirror updates to an aggregate/volume which is nearly full, there might be a lot of fragmentation going on, and if the aggregate was grown gradually over time it would exacerbate the problem.

scottgelb · ‎2012-04-11

dump to null is a good test to see max throughput it can dump bypassing tape (although like mentioned tape isn't the bottleneck) with "dump 0f null /vol/volname".

With a snapmirror target, there is wafl deswizzling...that can really take an effect of backups with contention for I/O. If you run "priv set advanced ; wafl scan status" do you see deswizzling operations? It would be good to schedule backups offset from snapmirror updates to work around deswizzling. A perfstat during backup and a performance case will help identify this too and troubleshoot any other bottlenecks.

For a FAS2040 with SATA and a snapmirror target this doesn't sound like unexpected performance, but maybe some tuning and workarounds like listed can help speed it up.

brendanheading · ‎2012-04-13

Andy, were we able to move this along at all for you ?

numbernine · ‎2012-04-13

Hi All,

Thanks for all the replies, I've not been able to do a dump to Null as a bunch of other stuff has come up (in the shape of a failed Netbackup migration - hooray)

Right now, I'd settle for 3MB/s and a working backup system

Fridays used to be nice

Will update when I get some happier news.

Andy

brendanheading · ‎2012-04-13

Sorry to hear that Andy, best of luck sorting that out.

(Ever thought about getting rid of the tape side of things altogether ? You could hook up a rake of cheap SATA drives to your 2040 and use SnapVault to retain the archives. Deduplication is the big thing that makes this possible.)