Why would NDMP/SMTAPE of volume of LUNs be so slow while volume of CIFS is so fast??

pclayton99 · ‎2012-03-19

This past weekend was spent examining why the throughput of NDMP/SMTAPE operations varied so much.

I do not have an answer as yet just more mystery.

The configuration is three LT05 tape drives connected over 4GB SAN fabric to a FAS3240 and Dell R815 (quad processor, 12 core, 256GB memory) with the NetApp being able to perform NDMP/SMTAPE operations direct to tape.

What has been found is:

Using NetBackup V7 with NDMP/SMTAPE operation, the backup operation can go directly from the filer to tape without talking with the R815 server
If the source volume and dozens of snapshots within it is a CIFS share with over 8 million files can be put to a LT05 tape drive at upwards of 113MB/sec.
If the source volume and dozens of snapshots within it contains LUNs which are used by our Exchange 2010 servers the writing to a LT05 tape drive does not get above 10MB/sec. I have found it to be as low as 3MB/sec.
Both volumes can be on the same or different controllers. Does not make a difference.
Both volumes can be in the same aggregate. Does not make a difference.
The volumes can reside on 7.2K SATA or 15K SAS. Does not make a difference.
The volume sizes have ranged from hundreds of GB through 7TB. Does not make a difference.
This same ratio happens with multiple volumes of CIFS and Exchange data.
The NDMP/SMTAPE commanding can originate from NetBackup or from the filer command line using 'smtape backup ...'. Does not make a difference.
Under the covers it looks like NDMP is using snapmirror functionality to perform the data transport to tape.
There is no 'throttle' option for NDMP/SMTAPE operations. An error message is displayed stating such. I was thinking the limit was due to the 'options replication.*' values I had set.

From what I can tell the NDMP/SMTAPE operation causes a new snapshot to be taken for a static version of the data to then be analyzed and sent to tape.

The controllers are not 'beat', nothing glaring (that I could think to examine) on the filer.

I have opened a support case and sent them this information and perfstats output to try and solve this puzzle.

The question is why would there be such a drastic difference in the throughput due to having CIFS versus LUNs within the volume?

I have no current answer and am wondering if others have found the same thing and maybe the answer/solution to getting great throughput all the time?!

Thanks.

pdc

aborzenkov · ‎2012-03-19

Try to increase buffer size.

http://www.symantec.com/business/support/index?page=content&id=TECH51967

http://www.symantec.com/business/support/index?page=content&id=HOWTO56152#v19527723

pclayton99 · ‎2012-03-19

The buffer size for these was 245760 for the NetBackup invoked operations and 240KB (based on documentation) for filer invoked operations.

pdc

stephan_troxler · ‎2012-03-19

I once had a similar problem with Exchange LUNs backed up through NDMP since they can fragment very hard. Try a "reallocate measure" on the volume with the LUNs and check the layout. I think you will never reach the same performance on backing up Exchange LUNs via NDMP than with other data.

pclayton99 · ‎2012-03-19

I have started the 'reallocate measure' jobs against the volumes to see what the fragmentation levels are and they are still running.

In light of the magic of what WAFL does with data writes and having 512GB PAMII cards in the controllers I can not come up with why there should be desparate bandwidths to the large degree we have encountered.

Actually, a good next question is if the PAMII cards are even in the data flow of an NDMP/SMTAPE operation when writing to tape?

A difference of 20 to 40MB/sec, maybe. Saw this just doing different CIFS volumes.

A difference between 7MB/sec and 120MB/sec, I can not fathom a reason at this time.

Will post the fragmentation findings when they show up.

pdc

FRED_MANGELSDORF · ‎2012-03-29

The Flash Cache or PAM cards are *NOT* used when writing to the NetApp disks (that is reading from the tape) but also, they're *NOT* used for large sequential reads (which is what will happen when you try to write a single large file like a LUN to a tape).

The reason for the Flash Cache to disregard the writes to disk, is that these writes are already 100 % cached by the NVRAM and then flushed to the disks in the subsequent consistency point. The Flash Cache may get populated with the written data so that it can then serve as a read-cache for future reads of the data written.

The large sequential reads (which is, I think, what you were asking about) are considered to 'pollute' the cache, since on a backup you typically read the data just once, therefore storing this information in a Flash Cache would simply evict other (potentially valuable from a performance point) data and occupy that space for the backed-up data, which is not going to be read again in the near future (most probably).

This may also explain why you found such 'good' performance on the volume with the many small files.

From my experience this doesn't explain why you get such 'bad' performance (like the 10 or even 3 MB/s), since Data ONTAP will still read-ahead in it's RAM, even if it's not going to store that data in the Flash Cache. Is the system performing a lot of other work while you're performing these tests ? as this could 'flush' the read-ahead data in the RAM. I believe that a 3240 has 8 GB of RAM per head, so that's about the quantity of caching that's available for the read-aheads (for all operations).

pclayton99 · ‎2012-03-29

Frederik..

Yep, know abou tthe PAM cards not in the data flow for writes to disk.

I can understand arguments for the PAM card being in or out of the data flow when reading data from disk.

Don't understand about why the read of many small files would be good in this case. In one case it was over 8 million files spread out over a thin-provisioned volume within a two shelf set of 1TB SATA disks and this went to tape at 120+MB/sec! I would have fully expected that the real disk fragmentation within WAFL to store all those files in this case would severely impact data throughput to tape.

For comparison, the Exchange data is on both SAS and SATA (different volumes) and thin provisioned at both the LUN and volume levels.

And yep, the FAS-3240 has 8GB of memory per controller.

pclayton99 · ‎2012-03-29

The 'reallocate measure' command against a 343GB Exchange volume completed within a day (back on 3/19) and reported values in the EMS log of:

wafl_reallocate_check_highAdvise_1

path="/vol/FC_Exchange_Data2"

optim="7"

hot_spot="0"

threshold="4"

The same command against a 5TB CIFS volume is still running the first pass!

Reallocation scans are on

/vol/homes1:

State: Checking: public inode 1523449 of 25894197, block 0 of 1159

Flags: whole_vol,measure_only,repeat

Threshold: 4

Schedule: n/a

Interval: 1 day

Optimization: n/a

Measure Log: n/a

FRED_MANGELSDORF · ‎2012-03-29

Take care with the rallocate commands, since these tend to run a reallocate start every 24 hours, even if you initially requested just a measure. The good thing about this is that a frequent reallocate won't typically take long (except for the initial run).

Check for this with to command reallocate status. If you see an interval reported (typically 1 day), then you can do a reallocate stop <path> to get rid of the repeating task.

The reallocate status command will also report you fragmentation. Usually, everything up to about 6 is acceptable. I've seen fragmented WAFL systems run up to 28 or more!

pclayton99 · ‎2012-03-30

The 'reallocate measure' command against the 8.1 million file CIFS, 5TB volume completed with the results being:

Reallocation scans are on

/vol/homes:

State: Idle

Flags: whole_vol,measure_only,repeat

Threshold: 4

Schedule: n/a

Interval: 1 day

Optimization: 2

Measure Log: n/a

Which to me says this volume is not in problems fragmentation wise.

Curiously, this reallocate command spanned multiple days to complete and for this instance there is/was NO output in the EMS log in similar fashion to the other one I did that completed within a day and was reported earlier in this thread.

And just to muddy the waters even more, I am doing a Snapmirror of the volume that has the Exchange LUNs within it to another controller over a local 10Gb LAN to see how it behaves and how much data is being transferred. What I have found is that the best transfer rate has been 6.6MB/sec.

I actually started this last test out as a Snapmirror between volumes within the same controller and that transfer rate was under 10MB/sec which eliminates the NDMP/SMTAPE aspect to the poor throughput I am finding.

The mystery continues.

brendanheading · ‎2012-03-30

I think the CIFS vs LUN part must be a red herring. If you're using SMTAPE, you're effectively streaming a Volume Snapmirror to tape, which is a block-by-block operation, so it's not a question of file count (as it would be if you were using Qtree Snapmirror). Can you confirm that your local Snapmirror test is also Volume Snapmirror ?

It's hard to believe that even heavy fragmentation would cause this sort of slowness. I'd have thought with the WAFL magic (readsets etc) the RAM and the Flash Cache at least caching your metadata (if not the heavy sequential reads themselves) you'd easily beat 3MB/sec.

Can you run sysstat -x 1 while the backup is running and post the output here ? That said, I'm sure by the time you do Support will spot something in your perfstats. If it's not a background task of some kind it's got to be something weird, like a misbehaving hardware or disk interconnects ..

slattimo · ‎2012-03-31

I'd have to go with brendanheading on this one. Support will probably be your best bet on this one. Make sure you provide a good baseline perfstat from a volume that is moving as desired and specify that one is "the daisy". That way, they can start from there. Also, you will want to include anything in between source and destination.