Deduplication and 7zip files

TINGWEI_LIM · ‎2014-04-15

Hi guys,

I have a volume is filled with lots of 7zip files which contained MSSQL dump. The volume has about 2000++ 7zip files which is a MSSQL daily dump, each sized 500MB-1.5GB.

Initially I thought deduplication is going to save a lot but it turned out that I was wrong, the saving wasn't much.... is that happening to most zip files? Anyone experienced before?

sis status -l output

State: Enabled

Compression: Disabled

Inline Compression: Disabled

Status: Idle

Progress: Idle for 08:19:57

Type: Regular

Schedule: sun-sat@7

Minimum Blocks Shared: 1

Blocks Skipped Sharing: 0

Last Operation State: Success

Last Successful Operation Begin: Tue Apr 15 07:00:00 MYT 2014

Last Successful Operation End: Tue Apr 15 07:02:48 MYT 2014

Last Operation Begin: Tue Apr 15 07:00:00 MYT 2014

Last Operation End: Tue Apr 15 07:02:48 MYT 2014

Last Operation Size: 6978 MB

Last Operation Error: -

Change Log Usage: 0%

Logical Data&colon; 1345 GB/49 TB (3%)

Queued Job: -

Stale Fingerprints: 0%

df -sh output

Filesystem used saved %saved

/vol/myvolume/ 1341GB 3772MB 0%

ekashpureff · ‎2014-04-15

TingWei -

DeDupe is working on WAFL 4K blocks. There's probably not many duplicate blocks in your DB dumps, and less likely that there'd be dupe zipped blocks.

If there were lots of duplicate files that had been zipped up then dedupe would do great things.

I hope this response has been helpful to you.

At your service,

Eugene E. Kashpureff

Senior Consultant, K&H Research http://www.khresear.ch/

Senior Instructor, Unitek Education http://www.unitek.com/training/netapp/

TINGWEI_LIM · ‎2014-04-16

Hi Eugene,

Thanks for the reply. My DB dumps was actually from the same DB. I would assume there will be a lot of duplicates, my guess is the 7zip compression algorithm made the file unique to each other....I don't know but it is good to be aware of this.

TingWei

MMUELLER_HC · ‎2014-05-13

Hi,

I just saw this and maybe it's already too late. But maybe it will help someone else.

The way most compression algorithms are working is that even very small changes in the uncompressed file will cause a "cascade" of differences throughout the compressed file. So even though the uncompressed source files might be 99% identical, the compressed files are totally different.

There is an option in newer versions of gzip called --rsyncable (http://superuser.com/questions/636881/what-are-good-compression-algorithms-for-delta-synchronization) which slightly increases the size of the compressed archive, but also syncs the compressed output with the uncompressed input frequently. In this case, the compressed output will remain more similar, even when the uncompressed input has small changes.

Give it a try and let us know if it helps.

Michael

TINGWEI_LIM · ‎2014-05-30

Hey Michael,

Thanks for the info! I will check with my client to see if they are willing to include the option --resyncable .