As Hadoop enters the data center world, a key question that I hear constantly is whether Hadoop is resilient enough to run demanding and time-critical analytics workloads. As Hadoop adoption and use cases continue to increase, so do reports of its gaps and shortcomings. Depending on your specific use case, workload or business requirements, you may need to choose between: open-source Hadoop with commodity hardware or enterprise-grade features and functionality. It’s important to understand these big differences before you make a decision on which Hadoop to implement in your data center. The intent of this blog is to focus on enterprise Hadoop, and specifically, the NetApp solution.
Let’s look at the NetApp Open Solution for Hadoop (NOSH):
The NetApp Open Solution for Hadoop consists of two NetApp storage arrays—the E2660, which provides hardware RAID storage for the Hadoop data nodes, and the FAS2040, which offers system resilience and metadata protection capabilities for the Hadoop name node (a single point of failure in Hadoop). The FAS2040 with shared NFS stores the name node metadata information, while the E2660 stores the HDFS data nodes. The E2660 storage arrays are attached to the data node servers using a 6Gbps SAS direct connect, and the FAS2040 is connected to the name node server using a 1Gbps connection. The servers in the cluster are connected using a high-performance 10Gbps Ethernet network. See figure 1 below for detailed view of NOSH.
Fig 1. NetApp Open Source Hadoop architecture
Why two storage arrays? NetApp combined the true and tested enterprise robustness of the traditional Fabric Attached Storage (FAS) with the power of the E-series storage architecture (massive bandwidth and performance) and open Hadoop (Cloudera, Hortonworks) to create a best-of-breed Hadoop architecture. The E-Series Storage (E2660) serves the HDFS data nodes; its built-in RAID offers protection from disk failures and preserves Hadoop’s native shared nothing architecture. In traditional HDFS implementations with commodity storage, a disk failure will initiate a data copy and a restart of the current job with adverse impact to the business. According to a recent Enterprise Strategy Group (IT advisory and research firm) survey, “...indicated that three hours or less of data analytics platform downtime would result in significant revenue loss". See ESG Lab Validation Report for additional information and key findings.
The E2660 houses a total of 60 disks per enclosure, configured as four volumes of direct attached storage (DAS). Each data node has its own non‐shared set of disks and is allocated fourteen disks within the E2660 array, as well as “array intelligence” – dual array controllers with hardware-assisted computation of RAID parity. Because the storage is decoupled from the compute layer, this architecture (external DAS) provides the flexibility to scale compute or storage independent of each other.
The FAS2040 provides the name node with enterprise data protection and redundancy, and eliminates this as a single point of failure. In order to provide better metadata protection, the NetApp solution keeps a copy of critical name node data on the NetApp FAS system. So if there is a failure of the name node server, all the data can be recovered instantaneously with minimum downtime.
On the software side, NetApp has partnered with Cloudera and Hortonworks to deliver ready-to-deploy open (not proprietary) Hadoop solutions that offer our customers the options and flexibility to choose the best solution for their business.
In summary, we believe that there are legitimate use cases and workloads that demand the use of commodity hardware; but in instances where enterprise resiliency and robustness are required, the NetApp solution would be a great fit, really!