Subscribe

A-SIS de-duplication and deleted data inside VMDKs

Hi all,

We're currently using de-duplication on our FAS2050, which is storingng VMDK's for our ESX cluster to access via NFS. It works fantastically well - we're saving approximately 80% space on our "OS disk" volume, and about 50% average across all volumes on our filer.

That being said, I have a question I'm hoping "the experts" can share their insight into.

I was recently trying to figure out why deleting a bunch of data inside one of our VM's didn't translate into the expected reduction in space usage of the VMDK file inside it's FlexVol.

The "answer" I came up follows - I'm hoping someone can "check my logic", and if it's right, perhaps suggest any work-arounds?

Imagine this scenario:

1) NetApp filer:

- storing VMDKs for ESX

- using NFS (I'd think the scenario would be the same with iSCSI / FC LUNs, but NFS is easy and what I'm familiar with)

- de-dupe turned on

- no snapshots - simplifies the example

2) 100GB disk created in VMWare, attached to a Windows VM, and formatted NTFS

- ignore thin provisioning - simplifies the example

- results in 100GB file created on NetApp FlexVol containing almost entirely zeros.

3) A-SIS goes to work

- VMDK de-duplicated

- VMDK's real disk consumption is now (essentially) 0GB - close enough to 100% space saving

4) 50GB of data copied onto disk inside VM

- VMDK's real disk consumption grows to 50GB

5) A-SIS goes to work

- Runs de-dupe pass on VMDK

- let's say data copied onto disk is 50% duplicate

- so real disk consumption is now 25GB (50% of 50GB)

6) All data deleted from the disk inside VM

- Windows deletes data

- But this doesn't actually zero the blocks, just destroys inodes / marks blocks as space as free / etc

- Inside-VM disk consumption now reported as 0GB

7) A-SIS goes to work

- Runs de-dupe pass on VMDK

- But this time the blocks inside the VMDK aren't all zeros - they still contain the old data, they're just flagged as "free" in the NTFS file system

- Therefore de-dupe can't do much (or any) better than the previous pass

- And real disk consumption for the VMDK remains around 25GB

So therefore the real space consumption of a de-duplicated "empty" VMDK with deleted data is far higher than a de-duplicated "empty" VMDK that is a "fresh" disk?

Is that all correct? Or did I miss some big (or small) step in my logic?

Obviously the above is a synthetic - and extreme - example, but assuming I'm correct does this not mean the net savings (inside-VM reported disk usage to real disk usage ratio) from A-SIS will gradually decrease over time as data gets deleted inside a VMDK, and more and more blocks in a VMDK are free-but-not-zero? Assuming, of course, Windows doesn't choose to overwrite the deleted blocks with newer data?

I understand that if I'm correct A-SIS is still just doing it's job - it's the underlying data that gets less de-dupable due to the way NTFS works - but the net result from the user's point of view is you see less and less net space savings from A-SIS over time?

Finally, again assuming this is all correct - is there any way to counteract this effect? Like making Windows actually zero blocks when it deletes them, or something similar?

Thanks,

Matt

Message was edited by: Mathew Kilham

Message was edited by: Mathew Kilham

Re: A-SIS de-duplication and deleted data inside VMDKs

Hallo Mathew,

Very good question!

You are right. Because de-duplication is at block level, it does not free up space, because a delete file does not really change except fot the flag which signs it is overwritable. Just a minimum of the block changes, hoewer you do not see it, but it is recoverable by undelete tools. In other words, it is not really deleted. You need to have exactly the same blocks in other vmdk's for deduplication to have effect.

Another question comes up: What if you zero unused blocks in vmdk's at a regular schedule? Do you save more data because you are duplicate more blocks bij zeroing them?

Tomas

Re: A-SIS de-duplication and deleted data inside VMDKs

> Very good question!

Thanks :-)

> Another question comes up: What if you zero unused blocks in vmdk's at a regular schedule? Do you save more data because you are duplicate more blocks bij zeroing them?

I had considered that, but two issues:

1) Don't think Windows can be set to do this automatically, and don't know of any tools to do this - although I didn't look too hard.

2) This would help A-SIS but would increase snapshot sizes - not sure to what extent though?

Re: A-SIS de-duplication and deleted data inside VMDKs

You basically need the equivalent of what SnapDrive for Windows Space Reclaimer does for NTFS LUNs within VMware.

I have no idea if this will work, but you may want to try the cipher command in Windows with the /w command. According to the command "Removes data from available unused disk space on the entire volume. If this option is chosen, all other options are ignored. The directory specified can be anywhere in a local volume. If it is a mount point or points to a directory in another volume, the data on that volume will be removed."

Thanks,

Mitchell

Re: A-SIS de-duplication and deleted data inside VMDKs

Has anyone tried using the cipher command and can verify if it works inside VM's? I have a VM with an RDM that I am doing VCB backups on and the RDM drive does cut out its previously used slack space when performing VCB backups.

In other words, the drive has 100gb of live data in it. When the system was being built, there was 200gb of data in it. 100 gb of data was deleted from that volume. The VCB backs up 200gb, even though there is only 100gb of live data in this volume.

This scenario only occurs for RDM's. For typical VMDK's, VCB properly removes the slack space.