Data Deduplication Impact on Misalignment

I'm reposting this as a Question. This same post was posted a few days ago as a discussion and no one from NetApp responded. On a side note how about adding a feature to the forum that lets us change a discussion to a question for when we forget to check teh "Mark this thread as a Question" checkbox.

In the NetApp University class on VMWare the subject of misalignment is discussed. The infomration covererd about performance makes perfect sense. However, new information I hadn't heard before was also shared. Partners are being told that misalignment also affects Data Deduplication. If you review the attached lab manual and go to page 5-7 (PDF Page 123) you will see the following section.

  • Data deduplication impact of misalignment
    • Not completely clear
    • A customer experienced 6% savings then 64% space savings after properly aligning virutual disks

The the two statements above lead to a lot of questions. The first point says it isn't complely clear what the impact is and then it goes on to show a customer who had a huge space savings due to it. Well, if that really was related to misalignment then it is pretty clear. Unless the point is that NetApp isn't really sure the misalignment was the issue.

While the statement alone would leave the reader to believe "it isn't clear" the instructor teaching the class made it seem much more like a fact that proper alignment helps deduplication.

To be clear I'm all for proper alginment for performance reasons. However, I have 5 or 6 cusotomers who have not aligned any of their VMs and each is seeing a huge space savings with data deduplication.

Keep in mind that migrating from misalignment LUNs to aligned LUNS requires moving all the data. A number of things could have changed during that process to affect deduplication.

In the end I have the following questions:

  1. Has there been any testing internal to NetApp to substatiate what this customer belives they saw?
  2. How exactly does alignment affect data dedulplication alone?
  3. If all the VMs are misaligned by the same offset then wouldn't they fall on the same boundries in the 4KB blocks and thus still apear as duplicates?

Re: Data Deduplication Impact on Misalignment

This question was answered in Virtualization forum.

Re: Data Deduplication Impact on Misalignment

Hi Kyle,

Great subject and question, let me try to explain. Yes misaligned VM's can have an effect on deduplication savings, but whether it will is another matter. NetApp deduplication depends on the ability to discover duplicate WAFL blocks. If the data in two otherwise identical VM's becomes shifted, we lose the ability to dedupe them. This is a result of our conscience decision to make deduplication as resource "light" as possible. If we have to spend time trying to re-align blocks then we are competing for the same CPU, memory, and I/O resources that your primary applications also need, and your system performance may very well degrade. Our approach was not to try and manipulate the data but rather to dedupe quickly and then get out of the way. BTW you are correct when you say that if all VM's become misaligned by the same byte-offset this should not have any ill effect dedupe savings, since the blocks are still identical. Based on my experience, the majority of our VMware users do not require any sort of re-alignment for good dedupe savings (as per your examples below), but occasionally we get people saying "why is my savings only XX%, I thought it would be higher?" Because of these comments, we decided to start promoting the use of aligned VM's as a best practice, not just for dedupe but because it helps with overall VM performance regardless of deduplication.

Hope that helps,

Dr Dedupe

Re: Data Deduplication Impact on Misalignment

I have same scenario on my PC Images volumes ( ISO Images-Dell PC Images - different languages-approximately 21 languages, diff. operating system version - Windows XP, Windows 2000, etc) got around 8% of savings which is supposedly have a great savings bcoz this images were most likely the same.

filer02> df -g -s

/vol/vol04/ 3562GB 328GB 8%

File size for each image is ranging from 500MB , 700MB, 1GB, 1.2GB, 1.4GB, 2GB, 2.5GB to even 4GB for Vista Images.

Hope you could enlighten me on how do I save much more on this one.



Re: Data Deduplication Impact on Misalignment

what was the answer?