The Emergence of Hadoop-Frastructure

It’s been a year since I keynoted at the 2011 Hadoop Summit and on the eve of the 2012 edition, I am in a reflective mood on how far Hadoop has matured inside the Enterprise.  Last year there were lots of open questions about the technical and business viability of Enterprise-Class Infrastructure  (Storage, Networking, Servers, Hypervisors) for Hadoop cluster deployments.  Since then, we’ve seen some great research published by Cisco, NetApp and VMware to address these questions at a detailed level with lots of Hadoop and HDFS profiling information.


Shared Nothing?

If data is the lifeblood of Hadoop, then the network is most certainly its circulatory system.  It is also holds the odd distinction of being a critical shared resource of what is considered a “Shared Nothing” parallel processing and distributed storage architecture.  A few months ago Cisco decided to perform some deep profiling of network traffic in a Hadoop cluster running the popular TeraSort workload. The resulting detailed report is an excellent reference for Enterprise Infrastructure Architects planning to deploy Hadoop.  Note the overwhelming amount of network utilization (bar on left) devoted to “data rebalancing and replication factor to compensate for the failed nodes” in Figure 11 below.


What to expect when you’re expecting

Hadoop was designed expect hardware failures as a routine matter of course.  As a result it is extremely tolerant of data node and disk outages.  However as Cisco’s study referenced above proves – qualify of service levels, job / query completion times and overall predictability of the cluster suffers under these conditions.  Following up Cisco’s research, we published our own results for how to better anticipate and control the outcome of expected Hadoop cluster failure events when running on the NetApp Open Solution for Hadoop (NOSH).  The extra performance and capacity is an added bonus

Probing Deeper

Finally, our colleagues at VMware rounded out this Hadoop Enterprise research trifecta by leveraging their cool vProbes technology to instrument the I/O of multiple Hadoop Data Nodes (inside guest VM’s) in unison.  They observed some extremely useful HDFS insights about the breakdown of disk bandwidth across typical Map-Reduce workloads, while raising excellent questions about the importance of disk locality as network infrastructure evolves.

These network, storage and performance-oriented reports represent tangible positive signs of Hadoop’s maturity in the Enterprise.  The related Track I’m chairing at this year’s Hadoop Summit (starting tomorrow) will contain even more proof points we can dig into.  I’m really looking forward to seeing you there!


If for whatever reason you can’t make it to the show in person, you can still catch me live at SiliconAngle’s theCUBE @ 1:00pm Pacific Wednesday June 13th.


Great summary of data-driven vendor-generation Hadoop collateral Val! Since VMware posted their Serengeti blog after you wrote this, any early opinions on how NOSH will integrate? Or will you integrate via FAS?