Subscribe

EMC vs NetApp Deduplication - Fact Only Please

I'm writing this in hopes to become better educated on the facts, not opinions, about the pros and cons as well as similarities and differences between NetApp's "current" Deduplication technology and EMC's "current Deduplication technology.

I have deployed NetApp solutions within my previous environment, but my current (new workplace) utilizes EMC. On a personal note, I prefer NetApp hands down. However, my responsibility is to define the capabilities and make decisions from the facts.

I have read some about EMC and NetApp fighting over Data Domain around 2009. When talking to EMC, they're deduplication recomenations are Data Domain.

EMC claims that their Data Domain product provide real time deduplication at a 4kb block level which they claim to be much more efficient. EMC mentioned that NetApp does dedupliation at around 128kb. THe bottom line is a claim to better performance capabilities. Thoughts?

Does the curent version o ONTAP utilize SIS or Deduplication?

EMC claims SIS is a "limited form of deduplication". http://www.datadomain.com/resources/faq.html#q5

Please clarify the facts regarding the type of deduplication utilized by NetApp as well as any thoughts to the comments above. Facts only please.

As a side note. I'm currently comparing the NetApp VSeries with EMC produt lines. The goal is to place a single device in front of all SAN's to provide snap shot and deduplication to all SAN data. Over time we'll be bringing multiple storage vendors into our environment. The VSeries is a one stop shop. We can vertualize all data regardless of vendor, provide snap shot capability, dedupe the data, and simplify management. EMC's solutions requires me to purchase two new devices, a EMC VG2 and a EMC Data Domain. Event with EMC's recommendations, there is no capability to snap shot other vendor data. The Data Domain appliance will only provide dedupe for all vendors. The EMC VG2 is being recommend to consolidate multiple file storage servers into CIFS as well as provide NFS for virtualization. So, in the end EMC is saying buy one appliance from us to provide dedupe, buy a second appliance to provide NAS capabilities. Wait a minute, all of these features are built into a single NetApp.... Thoughts?

Re: EMC vs NetApp Deduplication - Fact Only Please

EMC claims that their Data Domain product provide real time deduplication at a 4kb block level which they claim to be much more efficient. EMC mentioned that NetApp does dedupliation at around 128kb. THe bottom line is a claim to better performance capabilities. Thoughts?


NetApp deduplicates on 4K block level - this is fact. AFAIK DataDomain is using variable block length.

Does the curent version o ONTAP utilize SIS or Deduplication?


The feature is officially called A-SIS, but everyone is using just Dedupe speaking about it. Both terms are used interchangeably as far as I can tell.

EMC claims SIS is a "limited form of deduplication". http://www.datadomain.com/resources/faq.html#q5

NetApp A-SIS works on block level, not on file level. So the above simply has nothing to do with NetApp.

Re: EMC vs NetApp Deduplication - Fact Only Please

Well, to me the key differentiator is that NetApp de-dupe can be (& often is) used for primary data (e.g. VMware datastores), whilst Data Domain is aimed purely at secondary data (i.e. backup images) - unless they changed their positioning recently.

Regards,

Radek

Re: EMC vs NetApp Deduplication - Fact Only Please

Netapp Called their Deduplication Advanced Single Instance Storage, which means that ASIS IS Deduplication. 

ASIS can be configured on either primary data or secondary storage volumes and is agnostic to the type of data in those volumes.  There is only one other vendor (unless things have changed recently) that can provide DEDUPE on primary data, everyone else does it on the secondary data (backups).. so not that useful really.

Deduplicated folders are mirrored (using SM) in a dedupe state so replication is more lightweight.

ASIS runs at the block level, so Word Documents and Excel Documents don't matter, it doesnt give a monkeys what the blocks are part of.

It isn't perfect, but reducing the primary storage footprint, to my mind is what we want to talk about.  Who cares if the backups are smaller, we want more usable space out of our terabytes...

Re: EMC vs NetApp Deduplication - Fact Only Please

Hi all,

before you validate a solution or technique, please consider in your decissioins, what are you looking for?

Do you need realtime Deduplication, post (async) deduplication, will primary data be the target or backup data?

A-SIS works only job driven and it consumes a lot of Filer-Ressources via deduplication run.

Via Deduplication run the Filer is degraded in performance.

Using A-SIS for Backup-Data will only show post processed the reduction of space (so primarly it consumes all the space befor reduction)

DataDomain from EMC is Realtime Inline DeDuplication and is targeted for secondary (backup) data and optimized for this flavor of data.

DataDomain in this case only consumes reduced capacity onthe backend and saves large amounts of capacity in the Storage System.

Using DataDomain as Online Storage will show weak perofmrance in transactional IOs, but good performance in streaming IOs.

EMC also delivers Online Storage DeDuplication in their Celerra Products Code for Fileservices only, and also uses a Post-DeDuplication process (automated process)

EM;C Combines also Compressioin with this to gain better Reduction as they only use Single Instancing.

Benefit out of this concept seems to be lower consumption of COU Ressources running the DeDuplication.

So in fact it's not easy choosing the right vendor and concept as you first need to concentrate on you benefits of each solution and method and the Tradeoffs of this concept.

You never will get all-in-one from any vendor at all--> you never can be left and right once (unless you get cloned) ;-)

Re: EMC vs NetApp Deduplication - Fact Only Please

A key differentiator between NetApp and most other dedupe options is how they determine data is duplicate or not.  Most vendors use "safe enough" hash based matching, while having false positives is rare as the amount of data being deduped increases so does the risk of hash collisions.  NetApp, being post-process, can afford a more complex matching alogorithm.  NetApp only uses hashes to detect /possible/ duplicate data, it then does a comparison of those possible matchines byte by byte...which is far safer.

Re: EMC vs NetApp Deduplication - Fact Only Please

So....there's lot of answers here given the varying array of products...no "ONTap to rule them all model" (and I don't claim to know them all...just the general topics).

For primary storage....

  • Symmetrix and/or V-Max -- nope, no dedup.
  • Clariion -- nope again.
  • Celerra -- yep and compression too but there's different variations that I frankly don't understand. Googling on "celerra deduplication" seems like the best path....3rd hit is an EMC white paper....albeit 7 months old.
    • If you do have a Symmetrix or Clariion and have it front-ended with a NAS gateway (i.e. what a Celerra really is....a Celerra = Clariion + NAS gateway bundled together), there may be some dedup possibilities on the NAS protocol side.

For backups....

  • Data Domain -- yep...good tech IMHO and if not for that little bidding war would have been a NetApp product after all.
  • Avamar -- yep again....source side dedup which has some interesting implications.

I'm sure there's more out there but I can't claim to be an EMC expert....I just try to make sure I know the general lay of the land though.

Re: EMC vs NetApp Deduplication - Fact Only Please

One Dedupe drawback I've run across while reading and learning about NetApp (I'm a former EMC guy...) is that deduplication on a volume which has snapshots turned on negates the dedupe savings.  I've read that the following happens:

Scenario

Volume created and populated with data (LUN or NFS/CIFs share etc.).  As the data changes snapshots are generated for changed blocks. Lets say that deduplication is then turned on after a certain period of time.  At the point where deduplication is turned on all of the blocks which are consolidated via the dedupe process are saved within the snapshot performed prior to the deduplication process being kicked off.  This essentially negates the decuplication savings since the deduplicated blocks still exist on your primary storage within the prior snapshot.  In order to fix this conundrum you must delete all snapshots performed before the deduplication process is kicked off.  This of course has the drawback of destroying your recovery capabilities for that volume.

Can anyone refute this? I see this as a huge flaw in the NetApp primary storage dedupe solution since data integrity/recovery is the primary goal of most production storage solutions with deduplication and optimization as a close second.

Re: EMC vs NetApp Deduplication - Fact Only Please

That is correct, as currently snapshots are not 'de-dupe aware'. Whether it will change in the nearest future is beyond my knowledge.

This is arguably a drawback, but not a show-stopper IMHO:

- if you don't de-dupe a volume with existing snapshots, you are not gaining anything

- if you de-dupe a volume with existing snapshots, you are not gaining anything until your 'pre-de-dupe' snapshots expire & get deleted.

Ideally de-dupe should be run when you load a lot of data into a freshly created volume with no snapshots & after that you start taking snapshots (e.g. as a SnapMirror baseline).

Yes, it means a bit of a jiggery here & there, but at the end of the day NetApp ASIS is the only widely deployed primary data deduplication operating on block-level.

Regards,
Radek

Re: EMC vs NetApp Deduplication - Fact Only Please

Thanks for confirming.  I thought of the second solution you mentioned after I finished that post.  It's not a huge deal breaker since as you mentioned you just need to wait for the pre-dedupe snapshots to roll off.

One additional argument against dedupe on primary storage which I've seen within my environment is that dedupe processes take up production resources.  As our environment has grown we've had to redistribute dedupe jobs so that CPU utilization doesn't spike into the 90-100% range.  It's yet one more thing to monitor to ensure the "optimization" built into Data ONTAP doesn't undermine our core capability of serving I/O in a timely manner to hosts.