Back to Basics: Deduplication

Tech_OnTap · ‎2011-03-03

By Carlos Alvarez, Sr. Technical Marketing Engineer

This month, Tech OnTap is pleased to bring you the second installment of Back to Basics, a series of articles that discuss the fundamentals of popular NetApp technologies to help you understand and get started using them.

In 2007, NetApp introduced deduplication technology that significantly decreases storage capacity requirements. NetApp deduplication improves efficiency by locating identical blocks of data and replacing them with references to a single shared block after performing a byte-level verification check. This technique reduces storage capacity requirements by eliminating redundant blocks of data that reside in the same volume or LUN.

NetApp deduplication is an integral part of the NetApp Data ONTAP® operating environment and the WAFL® file system, which manages all data on NetApp storage systems. Deduplication works "behind the scenes," regardless of what applications you run or how you access data, and its overhead is low.

A common question is, "How much space can you save?" We'll come back to this question in more detail later, but, in general, it depends on the dataset and the amount of duplication it contains. As an example, the value of NetApp deduplication in a mixed environment with both business and engineering data is illustrated by Atlanta-based Polysius Corporation, which designs and enhances new and existing cement plants.

Polysius was experiencing up to 30% annual increases in its production storage requirements. By applying deduplication to its mix of AutoCAD files, Microsoft® Office documents, and other unstructured data, Polysius was able to recover 47% of its storage space. Some volumes showed reductions as high as 70%. As a result, the company was able to defer new storage purchases and has been able to double the time period it retains backup data on disk. Read the Polysius success story for more details.

There are some significant advantages to NetApp deduplication:

Operates on NetApp or third-party primary, secondary, and archive storage
Application independent
Protocol independent
Minimal overhead
Works on both NetApp FAS and V-Series storage systems
Byte-by-byte validation
Can be applied to new data or data previously stored in volumes and LUNs
Can run during off-peak times
Integrated with other NetApp storage efficiency technologies
Savings due to deduplication can be inherited when using SnapMirror® or Flash Cache
Simple evaluation tools
Free of charge

This chapter of Back to Basics explores how NetApp deduplication is implemented, the most common use cases, practices for implementing deduplication, and more.

How Deduplication Is Implemented in Data ONTAP

At its heart, NetApp deduplication relies on the time-honored computer science technique of reference counting. Previously, Data ONTAP kept track only of whether a block was free or in use. With deduplication, it also keeps track of how many uses there are. Using deduplication, a single block can be referenced up to 255 times for both NAS and SAN configurations. Files don't "know" that they are using shared blocks—bookkeeping within WAFL takes care of the details invisibly.

Figure 1) How NetApp deduplication works.

How does Data ONTAP decide that two blocks can be shared? The answer is that, for each block, it computes a "fingerprint," which is a hash of the block's data. Two blocks that have the same fingerprint are candidates for sharing.

When NetApp deduplication is enabled, it computes a database of fingerprints for all of the in-use blocks in the volume (a process known as "gathering"). Once this initial setup is finished, the data is ready for deduplication.

To avoid slowing down ordinary operations, the search for duplicates is done as a separate batch process. As data is written during normal use, WAFL creates a catalog of fingerprints for this data. This catalog accumulates until dedupe is triggered by one of the following events, as determined by the storage system administrator:

A deduplication "start" command is issued manually
A scheduled deduplication process occurs
20% new data has been written to the volume
When a SnapVault® transfer is completed

Once the deduplication process is started, a sorting operation begins, using the fingerprints of the changed blocks as a key. This sorted list merges with the fingerprint database file. Whenever the same fingerprint appears in both lists, there are possibly identical blocks that can be collapsed into one. In this case, Data ONTAP can discard one of the blocks and replace it with a reference to the other block. Since the file system is changing all the time, we of course can take this step only if both blocks are really still in use and contain the same data. To make sure that two blocks are really identical, we do a byte-by-byte comparison after the candidate blocks are identified.

The implementation of NetApp deduplication takes advantage of some special features of WAFL to minimize the cost of deduplication. For instance, every block of data on disk is protected with a checksum.

NetApp uses this checksum as the basis for the fingerprint. Since we were going to compute it anyway, we get it "for free"—there is no additional load on the system. And since WAFL never overwrites a block of data that is in use, fingerprints remain valid until a block gets freed. The tight integration of NetApp deduplication with WAFL also means that change logging is an efficient operation. The upshot is that deduplication can be used with a wide range of workloads, not just for backups, as has been the case with other deduplication implementations.

Use Cases

NetApp has been measuring the benefits of deduplication in real-world environments since deduplication was introduced. The most popular use cases are VMware® and VDI, home directory data, and file services. Microsoft SharePoint® and Exchange 2010 are also rapidly gaining traction.

Many Tech OnTap articles have focused on the particular benefits of deduplication in VMware and VDI environments, which have an inherently high level of file duplication owing to the nearly identical operating system environments used by each virtual machine. The following table summarizes typical results for a variety of environments.

Table 1) Typical deduplication space savings.

In a typical VMware or VDI environment you may have numerous virtual machines (VMs) that are all installed with more or less the same operating system and applications, resulting in a lot of duplication.

If you have 100 VMs running the same OS and each virtual machine requires 10GB to 20GB of storage, that's 1TB to 2TB of storage dedicated to almost identical copies. Applying NetApp deduplication can eliminate much of the inherent redundancy.

In general terms, if you have X virtual machines assigned to a storage volume, after deduplication you will need approximately 1/X the amount of operating system storage you would require in a nondeduplicated environment. Obviously, the actual results you achieve will depend on how many VMs you have in a volume and how similar they are.

In practice, customers typically see space savings of 50% or more in ESX VI3 environments, with some obtaining storage savings as high as 90%. This is for deduplication of the entire VMware storage environment, including application data—not just operating systems. In VDI environments, customers typically see space savings of up to 90%.

NetApp has also been investigating the benefits of deduplication on the repositories of unstructured file data created by some popular engineering and scientific applications, including Siemens Teamcenter PLM software, IBM Rational ClearCase SCM software, and Schlumberger Petrel software for seismic data analysis.

Teamcenter utilizes a relatively small metadata database combined with a large "vault," where engineering design files are stored. Every time an engineer saves a design within Teamcenter, a complete copy of that design file is saved in the vault, even if the change made to the design is minor.

NetApp worked closely with Siemens PLM to assess the value of deduplication in a Teamcenter environment using Siemens's performance and scalability benchmarking tool, which simulates the creation of multiple revisions of many design files, as would occur during normal use. Deduplication of the resulting vault yielded a 57% space savings. Results in the real world might be even higher than this since in many cases the number of file revisions is likely to be higher than that which we simulated. (Of course, in general you have to be careful when using simulators to measure the potential savings of deduplication. Simulated data in many cases will create artificially high amounts of duplicate data, since the focus is typically performance, not data patterns.)

Similar to Teamcenter, IBM Rational ClearCase—a leading software configuration management solution—consists of a metadata database in combination with a large "versioned object base," or VOB, in which files are stored. Deduplication is most useful with ClearCase in situations in which a copy of a VOB needs to be made. In addition, preliminary results in a laboratory environment suggest space savings of 40% or more using deduplication in a ClearCase environment when whole files are stored.

Schlumberger Petrel is used for seismic data interpretation, reservoir visualization, and simulation workflows. It creates project directories that contain huge numbers of files. As users create, distribute, and archive data, duplicate data objects are stored across multiple storage devices. NetApp observed space savings of approximately 48% by applying deduplication to such project directories.

Using NetApp Deduplication

The basic requirements for running NetApp deduplication are summarized in Table 2.

Table 2) Basic requirements for NetApp deduplication.

In addition to these requirements, attention to a few best practices can help enable success with deduplication. Some of the important best practices are summarized in this section along with information about using deduplication in conjunction with other popular NetApp technologies. For full details, refer to TR-3505: NetApp Deduplication Deployment and Implementation Guide.

You should measure the performance impact of deduplication as well as the sizing requirements in a test environment before deployment, especially for applications that have not been previously tested by NetApp, such as those summarized in Table 1.
Deduplication consumes system resources and can alter the data layout on disk. Due to the application's I/O pattern and the effect of deduplication on the data layout, the read and write I/O performance can change. The space savings and the performance impact depend on the application and the data contents.
If your application creates only a small amount of new data, run deduplication infrequently, because there is very little benefit in running it frequently in such a case. The frequency for running deduplication depends on the rate of change of the data in the flexible volume.
The more concurrent deduplication scanner processes you run, the more system resources are consumed. The best option may be to do one of the following:
- Stagger the deduplication schedule for the flexible volumes so that deduplication processes run on alternate days, reducing the possibility of running too many concurrent processes.
- Use Auto mode so that deduplication runs only when significant additional data has been written to each flexible volume. (This tends to naturally stagger out when deduplication runs in smaller environments.)
- Run deduplication manually.
- Run deduplication nightly to minimize the amount of new data to be deduplicated, so it requires less time to complete.
If a Snapshot® copy is created before the deduplication process has completed, it is likely to result in lower space savings. Run the deduplication process before creating Snapshot copies if possible and make sure that the deduplication process has completed before creating a Snapshot copy.
For deduplication to run properly, you need to leave some free space for the deduplication metadata. For Data ONTAP versions prior to 7.3, each flexible volume should have 6% of the total data's worth of free space. For Data ONTAP versions 7.3 or later, the aggregate should have 4% (fingerprint + change logs) of the total data's worth of free space for all deduplicated flexible volumes, and each flexible volume should have 2% of the total data's worth of free space. This is described in more detail in TR-3505, section 5.3.3.

Deduplication and Other NetApp Technologies

Deduplication is designed to work with other NetApp technologies. In many cases, those technologies get an additional benefit:

Flash Cache. Flash Cache provides intelligent caching that accelerates I/O operations. NetApp deduplication increases the likelihood of a cache hit. When a deduplicated block is in Flash Cache, the probability that it will be requested again is much higher. This effect is referred to as cache amplification, and is particularly beneficial with server and desktop virtualization.
Volume SnapMirror. When you replicate a deduplicated volume with volume SnapMirror, the target volume automatically inherits the deduplicated state from the source volume. The impact of this effect for VMware was described in a previous article. The interaction of deduplication with all forms of SnapMirror and SnapVault were also described in a recent article.
FlexClone. FlexClone® technology instantly replicates data volumes and datasets as transparent, virtual copies. When a FlexClone volume is created:
- If the parent FlexClone volume has duplication enabled, the new volume inherits the savings.
- The cloned volume inherits the deduplication configuration of the parent volume, such as the deduplication schedule.
- Starting with Data ONTAP 7.3, the deduplication metadata files (the fingerprint database and the change log files) do not get cloned, because they are located outside the volume in the aggregate. Deduplication must be started on the cloned volume for full deduplication to continue.
Compression. NetApp data compression, introduced in Data ONTAP 8.0.1, is highly complementary to deduplication. Potential space savings and use cases were described in a previous Tech OnTap article.
Thin provisioning. Deduplication also works in conjunction with NetApp thin provisioning to maximize space savings. For NAS volumes, configuration is straightforward. For LUNs, you can achieve maximum space savings using the following settings (see TR-3505, section 6.4.18, for full details):
- LUN space reservation value = off
- Volume fractional reserve value = any value from 0–100
- Volume guarantee = none
- Snap reserve = 0%
- Autodelete = on
- Autosize = on
- Try_first = volume_grow

Conclusion

Deduplication is an important storage efficiency tool that can be used alone or in conjunction with other storage efficiency solutions such as NetApp thin provisioning, FlexClone, and others. To learn more about deduplication, be sure to refer to TR-3505: NetApp Deduplication for FAS and V-Series Deployment and Implementation Guide. This frequently updated guide covers a broad range of topics including:

Configuration and operation
Sizing for performance and space efficiency
Use with other NetApp technologies
Best practices including best practices for use with specific applications including VMware; Microsoft Exchange, SQL Server, and SharePoint; Lotus Domino; Oracle; and others
Troubleshooting

Got opinions about deduplication?

Ask questions, exchange ideas, and share your thoughts online in NetApp Communities.

Carlos Alvarez
Sr. Technical Marketing Engineer
NetApp

Carlos has been working with NetApp since 2008, specializing in storage efficiency with deep-dive expertise in deduplication, data compression, and thin provisioning. He regularly provides guidance for integrating the most effective and appropriate NetApp® storage efficiency technologies into customer configurations. With over 20 years of industry experience, Carlos has been called on to create numerous implementation guides, technical white papers, reference architectures, best practices, and solutions guides.

Deduplication in Tech OnTap

Want to know more about deduplication? You may be interested in reading these previous articles from Tech OnTap:

More Back to Basics

The first Back to Basics article focused on NetApp® thin provisioning. Check out the article to learn how it's implemented, best practices, and more.

Learn more