Re: The fastest deduplication on the planet is performed by .....

cebulrdcis · ‎2009-03-09

Falconstor/Sun wins speediest dedupe race

So who's suffering from ingestion?

By Chris Mellor • Get more from this author

Posted in Storage, 9th March 2009 11:00 GMT

Comment: The fastest deduplication on the planet is performed by an 8-node Sun cluster using Falconstor deduplication software, according to a vendor-neutral comparison.

Backup expert W Curtis Preston has compared the deduplication performance of different vendors' products. He uses suppliers' own performance numbers and disregards multi-node deduplication performance if each node has its own individual index.

Preston says that a file stored on one with no previous history would not be deduplicated against the same file stored on other deduplication products in the same set of systems, because each one is blind to what the others store.

A Data Domain array is an example of an array of deduplication systems that do not share a global index. Preston says: "NetApp, Quantum, EMC & Dell, (also) have only local dedupe... Diligent, Falconstor, and Sepaton all have multi-node/global deduplication."

Nodes in a 5-node Sepaton deduplication array, for example, share a global index and the nodes co-operate to increase the deduplication ratio. In this situation a multi-node deduplication setup acts as a single, global deduplication system.

Preston compares the rated speeds for an 8-hour backup window, looking at the data ingest rate and the deduplication rate. As some vendors deduplicate inline, at data ingest time, and others deduplicate after data ingestion, known as post-process, these two numbers may well differ.

He compared deduplication speeds from EMC (Disk Library), Data Domain, FalconStor/Sun, IBM/Diligent, NetApp, Quantum/Dell and Sepaton/HP. (HP OEMs the Sepaton product.)

The Falconstor/Sun combo topped the ingest scores at 11,000MB/sec using an 8-node cluster and Fibre Channel drives. It was followed by Sepaton/HP with 3,000MB/sec and then EMC with 1,100MB/sec. Quantum/Dell ingested at 800MB/sec with deduplication deferred to post-process and not run inline.

NetApp was the slowest, ingesting data at 600MB/sec. The configuration was a 2-node one but each node deduplicated data on its own. Quantum/Dell would ingest at 500MB/sec if deduplication was inline

The fastest deduplication engine was the Falconstor/Sun one, rated at 3,200MB/sec. It was followed by Sepaton/HP at 1,500MB/sec, then by IBM/Diligent at 900MB/sec, Data Domain at 750MB/sec with EMC trailing at 400MB/sec. Preston couldn't find any NetApp deduplication speed numbers.

Preston also looked at the numbers for a 12-hour backup window. If vendors have an ingest rate that is more than twice their deduplication rate, they would need more than 24 hours to ingest and then deduplicate 12 hours worth of ingested data. This means their effective ingest rate for a 12-hour backup run can only be twice their deduplication rate.

He also has a discussion of restore speeds for deduplicated data, known as inflation or rehydration. The sources for his numbers and the products used are listed on his blog.

This is the first comprehensive and vendor-neutral deduplication speed comparison, and is well worth a look. ®

cebulrdcis · ‎2009-03-09

http://www.backupcentral.com/index.php?option=com_content&task=view&id=229&Itemid=47

Performance Comparison of Deduped Disk Vendors

Written by W. Curtis Preston

Thursday, 05 March 2009

This blog entry is, to my knowledge, the first article or blog entry to compare all these numbers side by side. I decided to do this comparison while writing my response to Scott Waterhouse's post about how wonderful the 3DL 4000 is, but then I realized that this part was enough that it should be in a separate post. Click Read More to see a table that compares backup and dedupe performance of the various dedupe products.

First, let's talk about the whole "global dedupe" thing, because it's really germane to the topic at hand. Global dedupe only comes into play with multi-node systems. A quick definition of global dedupe is when a dedupe system will dedupe everything against everything, regardless of which head/node it arrives at. So if you have a four-node appliance, and the same file gets back up to node A and node B, the file will only get stored once. Without global dedupe (otherwise known as local dedupe), the file would get stored twice.

Let's talk about Data Domain, as they currently own the dedupe market hands down. But they have local dedupe, not global dedupe. (This is not to be confused with Data Domain's term global compression, which is what they called dedupe before there was a term for it.) When you hit the performance limit of a single Data Domain box, their answer is to buy another box, and two DD boxes sitting next to each other have no knowledge of each other; they do not dedupe data together; they don’t share storage; you cannot load balance backups across them, or you will store each backup twice. You send Exchange backups to the first box and Oracle backups to the second box. If your Oracle backups outgrow the second box, you’ll need to move some of them to a third box. It is the Data Domain way. They are telling me they'll have global dedupe in 2010, but they don't have it yet.

What Data Domain is doing, however, is shipping the DDX “array,” which is nothing but markitecture. It is 16 DDX controllers in the same rack. They refer to this as an “array” or “an appliance” which can do 42 TB/hr, but it is neither an array nor an appliance. It is 16 separate appliances stacked on top of each other. It’s only an array in the general sense, as in “look at this array of pretty flowers.” I have harped on this “array” since the day it came out and will continue to do so until Data Domain comes out with a version of their OS that supports global deduplication. Therefore, I do not include this "array's" performance in the table at the end of this blog article.

When talking about the DDX "array," a friend of mine likes to say, "Why stop at 16? If you're going to stack a bunch of boxes together and call them an appliance, why not stack a 100 of them? Then you could say you have an appliance that does 50,000 MB/s! It would be just as much of an appliance as the DDX is." I have to agree.

In contrast, Diligent, Falconstor, and SEPATON all have multi-node/global deduplication. Diligent supports two nodes, Falconstor eight, and SEPATON five. So when Diligent says they have “a deduplication appliance” that dedupes 900 MB/s with two nodes, or SEPATON says their VTL can dedupe 1500 MB/s with five nodes, or Falconstor says they can dedupe 3200 MB/s with eight nodes, I agree with those statements – because all data is compared to all data regardless of which node/head it was sent to. (I'm not saying I've verified their numbers; I'm just saying that I agree that they can add the performance of their boxes together like that if they have global dedupe.)

By the way, despite what you may have heard, I’m not pushing global dedupe because I want everything compared to everything, such as getting Oracle compared with Exchange. I just want Exchange always compared to Exchange, and Oracle to Oracle – regardless of which head/node it went to. I want you to be able to treat deduped storage the same way you treat non-deduped storage or tape; just send everything over there and let it figure it out.

NetApp, Quantum, EMC & Dell, have only local dedupe. That is, each engine will only know about data sent to that engine; if you back up the same database or filesystem to two different engines, it will store the data twice. (Systems with global dedupe would store the data only once.) I therefore do not refer to two dedupe engines from any of these companies as “an appliance.” I don’t care if they’re in the same rack or managed via a single interface, they’re two different boxes as far as dedupe is concerned.

Backup and Dedupe Speed

No attempt was made to verify any of these numbers. If a vendor is flat out lying or if their product simply doesn't work, this post is not going to talk about that. (If I believed the FUD I heard, I'd think that none of them worked.) I just wanted to put into one place all the numbers from all the vendors of what they say they can do.

For the most part, I used numbers that were published on the company's website. In the case of EMC, I used an employee (although unofficial) blog. Then I applied some math to standardize the numbers. In a few cases, I have also used numbers supplied to me via an RFI that I sent to vendors. If the vendor had global/multi-node/clustered dedupe, then I gave the throughput number for their maximum supported configuration. But if they don’t have global dedupe, then I give the number for one head only, regardless of how many heads they may put in a box and call it “an appliance.”

For EMC, I used the comparison numbers found on this web page. EMC declined to answer the performance questions of my RFI, and they haven't officially published dedupe speeds, so I had to use the performance numbers published this blog entryon Scott Waterhouse's blog for dedupe speed. He says that each dedupe engine can dedupe at 1.5 TB/hr. The 4106 is one Falconstor-based engine on the front and one Quantum-based dedupe engine on the back. The 4206 and the 4406 have two of each, but each Falconstor-based VTL engine and each Quantum-based dedupe engine is its own entity and they do not share dedupe knowledge. I therefore divided the numbers for the 4206 and the 4406 in half. The 4406’s 2200 MB/s divided by two is the same as the 4106 at 1100 MB/s. (The 4206, by that math, is slower.) And 1.5 TB/hr of dedupe speed translates into 400 MB/s.

Data Domain publishes their performance numbers in this table. Being an inline appliance, their ingest rate is the same as their dedupe rate. They publish 2.7 TB/hr, or 750 MB/s for their DD690, but say that this requires OST. It’s still the fastest number they publish, so that’s what I put here. I would have preferred to use a non-OST number, but this is what I have.

Exagrid's numbers were taken from this data sheet, where they specify that their fastests box can ingest 680 MB/s. I do not have their dedupe rate numbers on me right now. I'll update it when I have them. They use local dedupe, so I am only including numbers from one box.

For Falconstor, I used this data sheet where they state that each node can back up data at 1500 MB/s, or 5 TB per hour, and that they support 8 nodes in a deduped cluster. They have not published dedupe speed numbers, but they did respond to my RFI. They said that each node could do 250 MB/s if you were using SATA drives, and 400 MB/s if you were using Fibre Channel drives. I used the fastest number and noted in the table that it required FC drives. (That will certainly affect cost.)

IBM/Diligent says here that they can do 450 MB/s per node, and they support a two-node cluster. They are also an inline box, so their ingest and dedupe rates will be the same. One important thing to note is that IBM/Diligent requires FC disks to get these numbers. They do not publish SATA-based numbers. That makes me wonder about all these XIV-based configs that people are looking at and what performance they're likely to get.

NetApp has this data sheet that says that they do 4.3 TB/hr with their 1400. However, this is like the EMC 4400 where it's two nodes that don't talk to each other from a dedupe perspective, so I divide that number in half to make to make 2150 GB/hr, or just under 600 MB/s. They do not publish their dedupe speeds, but I have asked for a meeting where we can talk about them.

Quantum publishes this data sheet that says they can do 3.2 TB/hr in fully deferred mode and 1.8 TB/hr in adaptive mode. (Deferred mode is where you delay dedupe until all backups are done, and adaptive dedupe runs while backups are coming in.) I used the 3.2 TB/hr for the ingest speed and the 1.8 TB/hr for the dedupe speed, which translates into 880 and 500 MB/s, respectively.

Finally, with SEPATON, I used this data sheet where they say that each node has a minimum speed of 600 MB/s, and this data sheet where they say that each dedupe node can do 25 TB/day, or 1.1 TB/hr, or 300 MB/s. Since they support up to 5 nodes in the same dedupe domain, I multiplied that times 5 to get 3000 MB/s of ingest and 1500 MB/s of dedupe speed.

Backup & dedupe rates for an 8-hour backup window

Vendor	Ingest Rate (MB/s)	Dedupe Rate (MB/s)	Caveats
EMC	1100	400	2 node data cut in half (no global dedupe)
Data Domain	750	750	Max performance with OST only, NFS/CIFS/VTL performance appx 25% less
Exagrid	188	Not avail.	1 node only (no globale dedupe)
Falconstor/Sun	11000	3200	8 node cluster, requires FC disk
IBM/Diligent	900	900	2 node cluster, requires FC disk
NetApp	600	Not avail.	2 node data cut in half (no global dedupe)
Quantum/Dell	880	500	Ingest rate assumes fully deferred mode (would be 500 otherwise)
SEPATON/HP	3000	1500	5 nodes with global dedupe

However, many customers that I've worked with are backing up more than 8 hours a day; they are often backing up 12 hours a day. If you're backing up 12 hours a day, and you plan to dedupe everything, then the numbers above change. (This is because some vendors have a dedupe rate that is less than half their ingest rate, and they would need 24 hours to dedupe 12 hours of data.) If that's the case, what's the maximum throughput each box could take for 12 hours and still finish it's dedupe within 24 hours? (I'm ignoring maintenance windows for now.) This means that the ingest rate can't be any faster than twice that of the dedupe rate, if the dedupe is allowed to run while backups are coming in.

This meant I had to change the Quantum number because the original number assumed that I was deferring dedupe until after the backup was done. If I did that, I would only have 12 hours to dedupe my 12 hour backup. Therefore, I switched to its adaptive mode, where the dedupe is happening while the backup is coming in.

Backup & dedupe rates for a 12-hour backup window


Vendor	Ingest Rate (MB/s)	Dedupe Rate (MB/s)	Caveats
EMC	800	400	2 node data cut in half (no global dedupe)
Data Domain	750	750	Max performance with OST only, NFS/CIFS/VTL performance appx 25% less
Exagrid
Falconstor/Sun	6400	3200	8 node cluster, requires FC disk
IBM/Diligent	900	900	2 node cluster, requires FC disk
NetApp	600	Not avail.	2 node data cut in half (no global dedupe)
Quantum/Dell	500	500	Had to switch to adaptive mode
SEPATON/HP	3000	1500	5 nodes with global dedupe

Dedupe everything?

Some vendors will probably want to point out that my numbers for the 12-hour window only apply if you are deduping everything, and not everybody wants to do that. Not everything dedupes well enough to bother deduping it. I agree, and so I like dedupe systems like that support policy-based dedupe. (So far, only post-process vendors allow this, BTW.) Most of these systems support doing this only only at the tape level. For example, you can say to dedupe only the backups that go to these tapes, but not the backups that go to those tapes. The best that I've seen in this regard is SEPATON, where they automatically detect the data type. You can tell a SEPATON box to dedupe Exchange, but not Oracle. But I don't want to do tables that say "what if you were only deduping 75%, or 50%," etc. For comparison-sake, we'll just say we're deduping everything. If you're deduping less than that, do your own table.

Restore Speed

When data is restored, it must be re-hydrated, re-duped, or whatever you want to call it. Most vendors claim that restore performance is roughly equivalent to backup performance, or maybe 10-20% less.

One vendor that's different, if you press them on it, is Quantum, and by association, EMC and Dell. They store data in its deduped format in what they call the block store. They also store the data in its original un-deduped, or native format, in what they call the cache. If restores are coming from the cache, their speed is roughly equivalent to that of the backup. However, if you are restoring from the block pool, things can change significantly. I'm being told from multiple sources that performance can drop by as much as 75%. They made this better in the 1.1 release of their code (improving it to 75%), and will make it better again in a month, and supposedly much better in the summer. We shall see what we shall see. Right now, I see this as a major limitation of this product.

Their response is simply to keep things in the cache if you care about restore speed, and that you tend to restore more recent data anyway. Yes, but just because I'm restoring the filesystem or application to the way it looked yesterday doesn't meaning I'm only restoring from backups I made yesterday. I'm restoring from the full backup from a week ago, and the incrementals since then. If I only have room for one day of cache, only the incremental would be in there. Therefore, if you don't want to experience this problem, I would say that you need at least a week of cache if you're using weekly full backups. But having a week of cache costs a lot of money, so I'm back to it being a major limitation.

Summary

Well, there you go! The first table that I've seen that summarizes the performance of all of these products side-by-side. I know I left off a few, and I'll add them as I get numbers, but I wanted to get this out as soon as I could.

jeredfloyd · ‎2009-03-10

As I said over at Curtis' blog, this is a really great resource, although there are two concerns that I have.

First, all the numbers are vendor-published. Deduplication is particularly prone to the "best case scenario" problem -- many of these vendors can dedupe endless streams of zeroes (or any other repeating pattern) much more quickly than random data!

Second, dedupe is for more than just backup. Dedupe is a great way of cutting costs in VTL and D2D backup, but also integrating deduplication much earlier in the data lifecycle can cut costs significantly. Performance and functionality requirements are very different in dedupe for backup vs. dedupe for archive, and so none of the products listed really serve the archive market well.

Right now data is written to tier 1 primary storage at a cost of up to $30 to $50/GB, and then backed up at an aggregate total cost of several dollars more. Much of this data can be moved much earlier to an archive primary storage tier at $3/GB or less, and an effective cost even lower with deduplication. Replication can reduce or eliminate the need for backup of this tier entirely. When you're talking about petabytes of data, you can't always afford to be down for the restore period. With economic pressures, any business would be remiss not to look at deploying an effective archive tier.

At Permabit we developed our Enterprise Archive product specifically to serve these needs, and believe we have developed the only truly scalable deduplication solution for archive data, while also providing levels of data protection far beyond what is available with RAID. I talk a little more about the underlying economics over at my blog in the article at http://blog.permabit.com/?p=77.

Regards,
Jered Floyd
CTO, Permabit

rickymartin · ‎2009-03-15

Jared,

I'm not entirely sure that hawking your products on another vendors community forums is what I would consider to be "good form", especially after having called NetApp "Slimy" in comments on a news article.

In any case, most of the economics of arcihving dont really apply to NetApp plaftorms. I wrote a blog post about that around a month ago which you can find here

http://communities.netapp.com/people/martinj/blog

Regards

John Martin