A Passion for Open Source and Solving Big Problems: A Conversation with Doug Cutting, Hadoop Creator

As big data continues to push and stretch the limits of conventional database and data processing technologies, Hadoop is emerging as an innovative, transformative and cost effective solution to tackle the big data challenges. Hadoop, the open-source software framework that supports data-intensive distributed applications was created by Doug Cutting in 2006 while at Yahoo. In this interview Doug shares his insight about the genesis and future of Hadoop.

 

  • According to some estimates, the total addressable market for big data is about $100B
  • Enterprise Strategy Group research shows that 50% of IT organizations are either doing something with Hadoop or planning to do something with Hadoop in the next 12 to 18 months.
  • One of the most searched terms on the Gartner website is Hadoop. In the last 12 months, those searches spiked over 600%.

 

Did you ever imagine Hadoop was going to get that big?

No, no, not at all. I was very lucky to happen across something that was becoming a big trend in computing. At the time, I thought, “There’s this wonderful technology at Google. I would love to be able to use it but I can’t because I don’t work at Google.” Then I thought, “There are probably a lot of other people who feel that same way, and open source is a great way to get technology to everyone.” I understood from the beginning that by using Apache-style open source, we could build something that would become the standard implementation of that technology. That was the goal from the outset, to build a commodity implementation of GFS and MapReduce.

 

Let’s talk about your job as Chief Architect at Cloudera, what do you do at Cloudera?

I have three things that I do, but I don’t really think of any of them as being an architect.
I spend about a third of my time working on software sales as an engineer. I like to keep my hands dirty. Ever since I was in my 20s, I’ve thought that pretty soon I’d be too old to program, and every decade I’ve kept programming. I’m about to turn 50, and we’ll see if I keep on programming through my 50s. I don't know. So that’s about a third of my time. 

I spend roughly another third doing what I call politics, which is largely working at Apache, helping to keep things running smoothly. I’m chair of the board of directors, so it’s my responsibility to put together the agenda for the meetings, to run the meetings, and to try to resolve the issues that come up. 

My third role is as part of the marketing team for Cloudera and for big data and Hadoop generally.

 

When you created Hadoop in 2006, the term “big data” hadn’t even been coined, what was the problem you were trying to solve back then?
Most of the technology that we named Hadoop in 2006 was actually stuff that we’d been building since about 2003 in a project called Nutch. The problem I was trying to solve was a very specific problem of crawling the web, collecting all of these web pages and building indexes for them and maintaining them.

For the Nutch project we needed distributed computing technology, we needed to store datasets that were much bigger than we could store on one computer, and we needed to have processes that would run and be coordinated across multiple computers. We saw the papers from Google, the GFS paper and the MapReduce paper, and we thought, “That’s the right platform.” So we set about rebuilding that platform. 

The problem we were trying to solve was a very specific problem of building search engines. Then people said, “Hey, this would be useful in a lot of other problems,” although initially, when we turned it into Hadoop, that was at the behest of Yahoo!, who weren’t interested in the search part. They were just interested in the general- purpose distributed platform. So we decided to take that part and call it Hadoop. But Yahoo! was interested in it for web search, for the same problem that we were already using it for. That was what drove the decision by Yahoo! to adopt this technology. 

 

Do you think business and IT leaders understand the value and potential of Hadoop? Do they get it?

I think they’re beginning to get it, yes. There’s been a lot of good writing about this trend, and I think people recognize the trends that are leading us here. With hardware becoming more and more affordable, and keeping in mind Moore’s Law, you can afford a huge amount of computation, a huge amount of storage, yet the conventional approaches don’t really let you exploit that. On the other hand, more and more of our business is becoming online business. Businesses are generating vast quantities of data. If you want to have a picture of your business, you need to save that data; and you need to save it affordably and be able to analyze it in a wide variety of ways.

 

In reference to Geoffrey Moore’s “crossing the chasm” theory, do you think that Hadoop has crossed the chasm from being an early adopter project to an “early majority”?

I haven't seen a real chasm that we need to cross or a trough that we need to get through. There’s a lot of tension, a lot of hype, but I believe that the level of adoption is steadily increasing. People’s expectations are reasonably well matched. They understand that it’s a new technology, and they’re cautious about moving to it because it generally involves a big investment in hardware, and you have to train people. So they start with maybe 15-node clusters, maybe not production, exploring the technology. But the next year they double the size of their cluster, or they start another cluster. It appears to be a relatively steady rate of adoption, rather than the classic hype curve, where people assume that it’s going to be much bigger than it is and then they’re disappointed. After that, it actually becomes a stable part of the estimated platforms. With Hadoop, I don’t see that overhang, where the adoption is over anticipated. Maybe I’m blind to it because I’m right in the middle of it, but it seems like the expectation is that it will be a big part of computing. 

 

What can CIOs do to explore what’s possible with Hadoop?

IT should facilitate experiments.  It should deploy a test cluster with representative datasets from around the organization.  These might just be subsets of full datasets if the full datasets are too big, or they might be slightly out-of-date.  The point is to permit folks to try analyses that weren't possible before, either because the datasets were in different silos or because the systems that hosted them didn't support certain analyses, e.g., machine learning algorithms.  If you provide such an experimental playground to the smart folks in a business they can try out ideas to find things that help the business

 

Providing an experimental playground is important but you also need Hadoop expertise, which is in very high demand these days –How does Cloudera help bridge that gap?

That’s right; we do a lot of training. Training has become a huge part of our business, because people need to be trained in order for the technology to grow. We have courses in everything from cluster administration to programming in the various components, and we’ve just added a course in data science. There are a couple of generations of people out there in the industry who learned a certain way of doing things, and it can be hard to transition. But people can and do learn the new skills they need. They’re not so different. System administrators who have Linux and UNIX skills tend to be the most successful, because it’s basically built on UNIX. So people who are familiar with that world tend to pick up the new skills pretty quickly. I believe that many companies already have people who can become data scientists, people who know their business problems, who understand the data to analyze in the company, who know a little statistics and can put those skills together and come up with solutions for the company.

 

In the context of big data, when is Hadoop an appropriate platform and when is it not?

It’s appropriate in more and more cases. Originally it was very much a vast computing engine, so it was appropriate for off-line analyses of massive datasets, analyses that businesses couldn't afford to do. Many of our customers were able to keep the last month of their data online and visualize it and analyze it, but with Hadoop, they’re able to take five years of data as the basis of their analysis. That allows them to see a lot more trends with better precision and to do a better job of marketing or selling or manufacturing, whatever they’re trying to improve, but as a batch off-line process. We added Apache HBase to the stack, so that companies are able to keep online key value stores on which they can do batch analyses, and can also update them and look at them in real time. Now with Impala we can do complex queries over multiple tables interactively, and that opens up yet more applications.

 

There are lingering questions about Hadoop’s security and reliability. Why do you think Hadoop is ready for the enterprise?

We’ve been attacking those very problems. We started with authentication and authorization, and now that’s pretty much right across the platform. Now we’re deploying encryption across the platform. We’re not quite all the way there yet, but we’re getting closer to our goal of being able to encrypt data end to end, as well as at rest. This is very much driven by enterprise needs.

In terms of reliability, we’ve spent a lot of time over the past year working on high availability for HDFS, and that’s been out in customers’ hands for six months or so. Now we’re working on snapshots to provide better disaster recovery so that businesses can do replication in multiple data centers more affordably. Those are some of the ways in which enterprise demands are driving the development of software today.

 

What is the role of vendors like Cloudera in helping customers realize the value of Hadoop?

Fundamentally, Cloudera's role is filling the gap between what bunch of open source projects deliver and what an enterprise wants.  The first part of that is a well-tested, integrated, software distribution.  Then there's support for that distribution, answering questions, helping to resolve difficulties and promptly supplying bug fixes.  Finally there's our Cloudera Manager software that helps to configure & monitor clusters.  Combined, these let enterprises focus on using Hadoop to solve their problems, not wasting their time figuring out how to install, configure and debug it.

 

What would you like to share with CIOs about Hadoop’s future and why they should be investing in Hadoop now?

Businesses should be investing in Hadoop because it can help them solve the problems that they have today. In the long term, I think all the trends point to Hadoop becoming a mainstay of enterprise data computing, if not the mainstay. It’s a general-purpose platform that will be able to handle  most of the workloads that businesses are now doing with other systems, as well as new kinds of workloads that weren't possible before and different kinds of analyses that weren’t practical on earlier systems. There’s definitely value in getting started with Hadoop and finding that first application, but the first application should also be something that’s useful.

 

Consider Hadoop and mobile technology, there’s a potential to create a massively distributed architecture using mobile technology: compute, storage, and network in a smart phone. Can you imagine using Hadoop in a way that’s akin to SETI @ home?

Well, the place where Hadoop shines the most is where there are huge amounts of data that needs to be moved around. So the SETI approach of having lots of computers all over the world connected wouldn’t work so well with Hadoop because it’s hard to move the data to all these different places. What tends to work well is to have all the data in one data center and have very fast networks between those nodes and do the computing there. Although cell phones are the new commodity hardware for consumers.

 

Seven billion of them! The potential is there.

Yes and the processors in cell phones are much more cost effective. You need maybe 10 of them to take the place of one traditional CPU, but they still use much less power. We’re already starting to see clusters built with ARM processors, for example, and that will make things much less expensive. But I don't know that we’ll see the data-intensive computing that Hadoop is known for, spread out to 7 billion cellphones. The front end will be cellphones, but the data movement is going to stay in the data center.

 

On a personal level, I’m curious to know what inspires you.  What drives you to solve big challenges? 

I like to think about technologies that will make a difference. I’ve always loved open source because it’s such a tremendous lever. What I look for is a way to find the smallest thing I can do, with the least amount of work that will have the most impact. Where is the leverage point? Hadoop came out of that. We needed to do some vast computing, but I also saw a lot of other workloads that could benefit from this.

 

What interesting books have you read lately?

The last novel I read was Anathem, by Neal Stephenson. It’s a great science fiction book. I gave it to my 12-year-old son and he loved it too.

 

What is the book about?

At the beginning, it’s about a convent of philosophers, a bunch of people who live separate from the rest of society. They think about knowledge—all the big, fundamental questions. They’re philosophers, and they study all the different theories about the world. They don’t settle on any particular theory, but they talk about all of these different philosophies and their merits and the consequences of understanding them. It’s a fun thing to think about all these different belief systems without being dogmatic and saying, “Oh, no, you have to believe this.” And then it turns into a great swashbuckling adventure with aliens and rockets and other things.