NetApp Solutions for Apache Spark Outperform JBOD

By Mike McNamara, Sr. Manager, Product Marketing, NetApp 

 

Apache Spark is an open-source cluster computing framework that was developed in response to limitations in the MapReduce cluster computing paradigm. Apache Spark is a relatively new programming framework for writing Hadoop applications that works directly with the Hadoop Distributed File System (HDFS). Spark is production ready, supports processing of streaming data, and is faster than MapReduce.

 

With Spark, you can create applications in Python, Scala, or Java. Spark applications consist of one or more jobs that have one or more tasks. Typical use cases for Spark are streaming data, machine learning, interactive analysis, and fog computation.

 

In recent performance validation tests that were based on industry-standard benchmarking tools, NetApp® Spark solutions demonstrated superior performance relative to a typical just-a-bunch-of-disks (JBOD) system. The following two charts show the strong performance especially of the all-flash NetApp EF-Series and NetApp All Flash FAS (AFF) compared with JBOD. For more detail on the customer use cases and the performance testing that can help you choose an appropriate Spark solution for your deployment, read this report.

 

Spark Scala Wordcount

 

NetApp solutions for Hadoop feature enterprise storage building blocks that are independent of the compute servers to offer an enterprise-class deployment with lower cluster downtime, higher data availability, and linear scalability. If a disk fails, for example, with the NetApp E-Series running Dynamic Disk Pools (DDP) technology, performance is only negligibly affected. And recovery is 10 times faster than with typical RAID schemes on commodity servers with internal storage.

 

With these NetApp solutions, new data nodes can be added nondisruptively, no rebalancing or migration is needed, and external data protection reduces both the storage footprint and the data replication overhead. And with the NetApp FAS NFS Connector for Hadoop, you can swap out HDFS for NFS or run NFS alongside HDFS.