Filebased backup from CIFS Vault shares cause high latency and utilization

thomasb82 · ‎2015-07-01

Hi guys,

we use SnapVault to replicate various volumes (NFS and CIFS) from a remote 8.3 Cluster to our internal 8.3 Cluster.

So far so good.

Now we would like to Backup the replicated CIFS shares with TSM to tape.

Everytime a backup starts on a read-only Vault share we get warning and ciritcal alerts regarding CIFS latency.

For comparison:

~ 140us on "normal" read-write CIFS shares

~ 502992us on the replicated VAULT CIFS shares

This also happens if tools like xcopy, robocopy etc. are used to copy the files.

statistics (just 5 seconds) while doing backups

cluster.cluster: 7/1/2015 21:44:00
cpu cpu total fcache total total data data data cluster cluster cluster disk disk pkts pkts
avg busy ops nfs-ops cifs-ops ops spin-ops recv sent busy recv sent busy recv sent read write recv sent
---- ---- -------- -------- -------- -------- -------- -------- -------- ---- -------- -------- ------- -------- -------- -------- -------- -------- --------
Minimums:
23% 41% 5239 611 4273 0 4241 5.56MB 4.86MB 0% 5.14MB 4.69MB 0% 142KB 120KB 528KB 0B 6302 8296
Averages for 15 samples:
28% 67% 6120 1488 4631 0 5111 7.26MB 49.2MB 3% 6.99MB 49.0MB 0% 282KB 280KB 13.1MB 6.94MB 9862 41362
Maximums:
40% 83% 7554 2927 4774 0 6514 12.2MB 137MB 11% 11.9MB 136MB 0% 664KB 659KB 38.7MB 26.3MB 15809 106834

Did anyone experience similar issues?

Anything I could try to resolve this? (exchanging the backup software is not an option currently)

Many thanks!

JGPSHNTAP · ‎2015-07-01

So i'm on the ame page as you, are you using ndmp?

And what is your disk setup on the backup side?

thomasb82 · ‎2015-07-01

Unfortunatly it`s just basic file-copying. No NDMP for now.

It does not matter if we use TSM, robocopy, xcopy etc. also it does not matter if we copy it to a VM on the same cluster, a QNAP NAS, or to a physical windows server.

NetApp support told us "low-end-systems" like ours (2552) can be affected of the bug BURT "880471" - and our system is affected.

They said it was caused by a lif that was not on his home port. But we have this issue even if all lifs are on their home ports.

So far we did not get a proper solution.

JGPSHNTAP · ‎2015-07-01

Well, that's an ugly way to backup..

Ok, what's your disk setup.. How many disks in the aggregate...

thomasb82 · ‎2015-07-02

I know it`s not ideal and it`s going to be changed but not at this time.

We have 2 Shelfs, each have 20x900GB SAS and 4x200GB SSD.

Out of those we have built 2 aggregates with the same assignment.

I hope 8.3.1 or a future release will fix it, if this is an hardware issue I hope we get replacements.

JGPSHNTAP · ‎2015-07-06

Ok i apologize for keep asking but i want to make sure i follow.

Your vaulted shares seem to be going to an aggregate with enough iops based on your above disk layout. I assume you have hybrid aggrs with sas and ssd.
Gimme exact raid group layout of the destination shares underlying aggr. I just want to triple check

Also did someone accidently put qos on?

thomasb82 · ‎2015-07-06

raid type = dp for SAS, 4 for SSD

raid gr. size = 19

raid alloc = 18

1 flashpool with 6 disks

the latency during file copys/backups on the vault cifs shares are 300-400x higher than on the normal cifs shares.

And the normal shares are about 3TB (no backups / latency issues) compared to the vault shares with only 50GB.

Both reside on the same aggr. So I think this is a software related issue.

QoS is active for all CIFS shares (100MB/s). Nothing improves when QoS is disabled.