ONTAP Discussions

EMC vs NetApp Deduplication - Fact Only Please

wade_pawless
35,479 Views

I'm writing this in hopes to become better educated on the facts, not opinions, about the pros and cons as well as similarities and differences between NetApp's "current" Deduplication technology and EMC's "current Deduplication technology.

I have deployed NetApp solutions within my previous environment, but my current (new workplace) utilizes EMC. On a personal note, I prefer NetApp hands down. However, my responsibility is to define the capabilities and make decisions from the facts.

I have read some about EMC and NetApp fighting over Data Domain around 2009. When talking to EMC, they're deduplication recomenations are Data Domain.

EMC claims that their Data Domain product provide real time deduplication at a 4kb block level which they claim to be much more efficient. EMC mentioned that NetApp does dedupliation at around 128kb. THe bottom line is a claim to better performance capabilities. Thoughts?

Does the curent version o ONTAP utilize SIS or Deduplication?

EMC claims SIS is a "limited form of deduplication". http://www.datadomain.com/resources/faq.html#q5

Please clarify the facts regarding the type of deduplication utilized by NetApp as well as any thoughts to the comments above. Facts only please.

As a side note. I'm currently comparing the NetApp VSeries with EMC produt lines. The goal is to place a single device in front of all SAN's to provide snap shot and deduplication to all SAN data. Over time we'll be bringing multiple storage vendors into our environment. The VSeries is a one stop shop. We can vertualize all data regardless of vendor, provide snap shot capability, dedupe the data, and simplify management. EMC's solutions requires me to purchase two new devices, a EMC VG2 and a EMC Data Domain. Event with EMC's recommendations, there is no capability to snap shot other vendor data. The Data Domain appliance will only provide dedupe for all vendors. The EMC VG2 is being recommend to consolidate multiple file storage servers into CIFS as well as provide NFS for virtualization. So, in the end EMC is saying buy one appliance from us to provide dedupe, buy a second appliance to provide NAS capabilities. Wait a minute, all of these features are built into a single NetApp.... Thoughts?

30 REPLIES 30

aborzenkov
32,974 Views

EMC claims that their Data Domain product provide real time deduplication at a 4kb block level which they claim to be much more efficient. EMC mentioned that NetApp does dedupliation at around 128kb. THe bottom line is a claim to better performance capabilities. Thoughts?


NetApp deduplicates on 4K block level - this is fact. AFAIK DataDomain is using variable block length.

Does the curent version o ONTAP utilize SIS or Deduplication?


The feature is officially called A-SIS, but everyone is using just Dedupe speaking about it. Both terms are used interchangeably as far as I can tell.

EMC claims SIS is a "limited form of deduplication". http://www.datadomain.com/resources/faq.html#q5

NetApp A-SIS works on block level, not on file level. So the above simply has nothing to do with NetApp.

radek_kubka
32,974 Views

Well, to me the key differentiator is that NetApp de-dupe can be (& often is) used for primary data (e.g. VMware datastores), whilst Data Domain is aimed purely at secondary data (i.e. backup images) - unless they changed their positioning recently.

Regards,

Radek

kevingolding
32,974 Views

Netapp Called their Deduplication Advanced Single Instance Storage, which means that ASIS IS Deduplication. 

ASIS can be configured on either primary data or secondary storage volumes and is agnostic to the type of data in those volumes.  There is only one other vendor (unless things have changed recently) that can provide DEDUPE on primary data, everyone else does it on the secondary data (backups).. so not that useful really.

Deduplicated folders are mirrored (using SM) in a dedupe state so replication is more lightweight.

ASIS runs at the block level, so Word Documents and Excel Documents don't matter, it doesnt give a monkeys what the blocks are part of.

It isn't perfect, but reducing the primary storage footprint, to my mind is what we want to talk about.  Who cares if the backups are smaller, we want more usable space out of our terabytes...

martinmeinl
32,973 Views

Hi all,

before you validate a solution or technique, please consider in your decissioins, what are you looking for?

Do you need realtime Deduplication, post (async) deduplication, will primary data be the target or backup data?

A-SIS works only job driven and it consumes a lot of Filer-Ressources via deduplication run.

Via Deduplication run the Filer is degraded in performance.

Using A-SIS for Backup-Data will only show post processed the reduction of space (so primarly it consumes all the space befor reduction)

DataDomain from EMC is Realtime Inline DeDuplication and is targeted for secondary (backup) data and optimized for this flavor of data.

DataDomain in this case only consumes reduced capacity onthe backend and saves large amounts of capacity in the Storage System.

Using DataDomain as Online Storage will show weak perofmrance in transactional IOs, but good performance in streaming IOs.

EMC also delivers Online Storage DeDuplication in their Celerra Products Code for Fileservices only, and also uses a Post-DeDuplication process (automated process)

EM;C Combines also Compressioin with this to gain better Reduction as they only use Single Instancing.

Benefit out of this concept seems to be lower consumption of COU Ressources running the DeDuplication.

So in fact it's not easy choosing the right vendor and concept as you first need to concentrate on you benefits of each solution and method and the Tradeoffs of this concept.

You never will get all-in-one from any vendor at all--> you never can be left and right once (unless you get cloned) 😉

inx_russel
32,973 Views

A key differentiator between NetApp and most other dedupe options is how they determine data is duplicate or not.  Most vendors use "safe enough" hash based matching, while having false positives is rare as the amount of data being deduped increases so does the risk of hash collisions.  NetApp, being post-process, can afford a more complex matching alogorithm.  NetApp only uses hashes to detect /possible/ duplicate data, it then does a comparison of those possible matchines byte by byte...which is far safer.

amiller_1
32,973 Views

So....there's lot of answers here given the varying array of products...no "ONTap to rule them all model" (and I don't claim to know them all...just the general topics).

For primary storage....

  • Symmetrix and/or V-Max -- nope, no dedup.
  • Clariion -- nope again.
  • Celerra -- yep and compression too but there's different variations that I frankly don't understand. Googling on "celerra deduplication" seems like the best path....3rd hit is an EMC white paper....albeit 7 months old.
    • If you do have a Symmetrix or Clariion and have it front-ended with a NAS gateway (i.e. what a Celerra really is....a Celerra = Clariion + NAS gateway bundled together), there may be some dedup possibilities on the NAS protocol side.

For backups....

  • Data Domain -- yep...good tech IMHO and if not for that little bidding war would have been a NetApp product after all.
  • Avamar -- yep again....source side dedup which has some interesting implications.

I'm sure there's more out there but I can't claim to be an EMC expert....I just try to make sure I know the general lay of the land though.

chriszurich
33,321 Views

One Dedupe drawback I've run across while reading and learning about NetApp (I'm a former EMC guy...) is that deduplication on a volume which has snapshots turned on negates the dedupe savings.  I've read that the following happens:

Scenario

Volume created and populated with data (LUN or NFS/CIFs share etc.).  As the data changes snapshots are generated for changed blocks. Lets say that deduplication is then turned on after a certain period of time.  At the point where deduplication is turned on all of the blocks which are consolidated via the dedupe process are saved within the snapshot performed prior to the deduplication process being kicked off.  This essentially negates the decuplication savings since the deduplicated blocks still exist on your primary storage within the prior snapshot.  In order to fix this conundrum you must delete all snapshots performed before the deduplication process is kicked off.  This of course has the drawback of destroying your recovery capabilities for that volume.

Can anyone refute this? I see this as a huge flaw in the NetApp primary storage dedupe solution since data integrity/recovery is the primary goal of most production storage solutions with deduplication and optimization as a close second.

radek_kubka
32,974 Views

That is correct, as currently snapshots are not 'de-dupe aware'. Whether it will change in the nearest future is beyond my knowledge.

This is arguably a drawback, but not a show-stopper IMHO:

- if you don't de-dupe a volume with existing snapshots, you are not gaining anything

- if you de-dupe a volume with existing snapshots, you are not gaining anything until your 'pre-de-dupe' snapshots expire & get deleted.

Ideally de-dupe should be run when you load a lot of data into a freshly created volume with no snapshots & after that you start taking snapshots (e.g. as a SnapMirror baseline).

Yes, it means a bit of a jiggery here & there, but at the end of the day NetApp ASIS is the only widely deployed primary data deduplication operating on block-level.

Regards,
Radek

chriszurich
32,973 Views

Thanks for confirming.  I thought of the second solution you mentioned after I finished that post.  It's not a huge deal breaker since as you mentioned you just need to wait for the pre-dedupe snapshots to roll off.

One additional argument against dedupe on primary storage which I've seen within my environment is that dedupe processes take up production resources.  As our environment has grown we've had to redistribute dedupe jobs so that CPU utilization doesn't spike into the 90-100% range.  It's yet one more thing to monitor to ensure the "optimization" built into Data ONTAP doesn't undermine our core capability of serving I/O in a timely manner to hosts.

amiller_1
11,099 Views

Twould be wonderful if it were a non-issue...but practically speaking is usually feasible to schedule the dedup runs before the snapshots (at least when doing daily snapshots). Just did that earlier today actually while finishing up a project (enabled dedup on some new volumes and then a nightly snapshot schedule 2 hours after the dedup run).

triantos
10,108 Views

Ok so lets talk about this. While it is true that Deduplication does not work on Snapshoted data, it is not accurate that if you have snapshots you will not see dedup savings. You will, it's just that you will not see some of the savings immediately.

Like with any other backup, snapshots too have retention periods which means they are not kept indefinately on a system. What that means is that over time, as snapshots taken *prior* to deduplication are released/removed/deleted, these blocks will be deduplicated.

Hope that makes sense

mheimberg
11,445 Views

This is something I will never understand:

....

The VSeries is a one stop shop. We can vertualize all data regardless of vendor, provide snap shot capability, dedupe the data, and simplify management.

Buy a NetApp because of its (obviously) undisputed simplicity in management, superior snapshot-capabilites, dedupe for primary *and* backup data, not to forget the true unified architecture...but the data is not trusted to be hosted on NetApp storage, instead you give away a lot of potential in further simplifying management, backup and restore, update management etc.

Why do people go only half the way? This will always be a mystery to me.

Mark

thomas_glodde
11,445 Views

Because you might have spend a few grants on your old storage and kinda want to "reuse" the hard disks. Then you might use a V-Series.

Besides, V-Series is sometimes used NAS only on existing high-end FC SANs to properly cover fileservices.

mheimberg
11,446 Views
Because you might have spend a few grants on your old storage and kinda want to "reuse" the hard disks. Then you might use a V-Series.

Yes, of course. But sooner or later your disks are end of warranty, EOL and out of support. Then comes the time when I am glad that I can attach also NetApp disk shelves to a VSeries and smoothly migrate to a NetApp-only environment.

The point is, Wade wrote: "Over time we'll be bringing multiple storage vendors into our environment" - it is this multi-vendor-multi-anything-thing I decline.

But of course, there might be companies that really need the "best of breed" in any domain - because I have none in my portfolio doesn't mean they don't exist. But you may also agree that sometimes customers tend to exaggerated solutions, and then there must be someone questioning the approach....

Besides, V-Series is sometimes used NAS only on existing high-end FC SANs to properly cover fileservices.

To me this is a very funny point: you need a NAS access, but unfortunetaly you only have this super powerful FC SAN in your datacenter because the sales of this FC SAN always told you, you will never need NAS and this whole "unified architecture" babble is only marketing cramp from competitor? HeHe, good do we have VSeries! And then...see above...

Mark

thomas_glodde
11,448 Views

Markus,

you are preaching to the evangelist 😉

Its sort of nice entry tho as you can say to the potential customer you can have the features NOW and you can sell the disks later, kinda smooth migration.

IHAC who sold his soul to HP for Clients, Servers, Switches, Printers and Storage. He is not willing to add NetApp storage as he doesnt have to bother with compatiblity matrix and 2 companys case finger pointing when adding SAN applications, regardless of 0h recovery functions etc. But we could convice him to at least buy a V-Series for proper NAS functionality as well as to give the unique NetApp features at least a try in his SAN environment.

brendanheading
10,680 Views

This seems like as good a place as any to pose an interesting question.

What's the hardware difference between a v-series and the corresponding FAS ? eg what's the difference between a V3100 and a FAS3100 ? Surely the FC interconnect between the controllers and the shelves is the same, especially given that the V-series can use native disk shelves.

I can understand why NetApp would want to carefully control the connection of third-party shelves but surely this can be accomplished with a license key ?

thomas_glodde
10,454 Views

brendan,

the vseries is colored black and comes without disk shelves.

kind regards

thomas

brendanheading
10,454 Views

It sounds a bit like I suspected, the regular filers are crippled in software not to support connection of the third-party shelves supported by VSeries.

wade_pawless
10,109 Views

What's the hardware difference between a v-series and the corresponding FAS ? eg what's the difference between a V3100 and a FAS3100 ?

The hardware is the same. I had this discussion with our NetApp engineer and sales team. The only difference is a licensing capability for the storage virtualization within the v-series.

I asked the question: Can I purchase the v-series, but not purchase the virtulization license up front? This way I could put FAS capabilities at my DR site at a cheaper price and then when more funding is available I could purchase the virtuliation license to enable storage consolidation at my DR site. The answer was no.

Public