Tech ONTAP Blogs

Big data analytics on StorageGRID: Dremio performs 23 times faster than Apache Hive!

Ben-Houser
NetApp
876 Views

To better understand the performance of data analytic platforms, NetApp benchmarked Apache Hive and Dremio with StorageGRID object storage.

 

Apache Hive is a common metadata store and query engine used with Hadoop. Each platform was benchmarked with TPC-DS against a 1TB dataset. TPC-DS is an industry standard benchmark for data analytics platforms and consists of 99 distinct SQL queries designed to model a standard data analytics workload. For the Dremio test, the TPC-DS data was stored in the Iceberg and Parquet formats. For the Apache Hive test, the data was stored in Parquet format.

 

The following test setup was used:

  • Big data ecosystems​
    • Cluster of 5 VMs, each with 128G RAM and 24 vCPU, SSD storage for system disk​
    • Hadoop 3.3.5 with Hive 3.1.3 (1 name node + 4 data nodes)​
    • Dremio v23 (1 master + 4 executors)​
  • Object storage ​
    • NetApp® StorageGRID® 11.6 with SG1000 + 3 x SG6060​
    • ILM protection – 2 copies​
  • Database size 1000GB​
  • Cache disabled on both ecosystems

 

 

TPC-DS Query Duration (minutes)

Total S3 Requests

Dremio with Iceberg

44.39

4,235,142

Dremio with Parquet

46.95

4,416,504

Apache Hive with Parquet

1084

2,567,390

Dremio completed the 99 TPC-DS queries more than 23 times faster than Apache Hive, showing a tremendous improvement in query performance!

 

When analyzing the S3 requests received by StorageGRID during the Dremio and Hive benchmarks, it became clear why Dremio achieves the highest performance. The following table breaks down the S3 request counts during the benchmarks:

 

Apache Hive with Parquet

Dremio with Parquet

GET (range-read)

1,117,184

4,414,227

LIST

312,053

240

HEAD (non-existent object)

156,027

192

HEAD (existent object)

982,126

1,845

Total requests

2,567,390

4,416,504

 

Although Dremio sent more S3 requests to StorageGRID in total, it performed significantly faster than Hive. This is due to several factors. Hive sent 312,000 LIST requests during the benchmark – LIST is a slow operation for any object store, because it requires listing the entire contents of a bucket. With buckets containing millions of objects, the response times of LIST requests can quickly add up. Conversely, Dremio sent only 240 list requests. Dremio was also more efficient in its use of range-read GETs. Dremio utilized parallel processing, sending concurrent range-read GET requests to download multiple ranges of the same objects simultaneously. Taking advantage of StorageGRID’s high-concurrency performance, Dremio was able to achieve 2000-2300 range-read GET/s, compared with Hive’s 50-100 range-read GET/s.

 

In addition to S3 optimizations, the Dremio Sonar engine uses query-acceleration technology to achieve interactive-speed response times. Dremio supports Columnar Cloud Caching (C3), which uses NVMe SSD technology built into cloud compute instances to achieve NVMe-level I/O performance. These tools, and more, make Dremio the world’s fastest lakehouse engine, and this speed opens the traditionally deep-and-cheap data lake to meaningful BI and data analysis.

 

To learn more about this performance testing, watch StorageGRID technical marketing engineer Angela Cheng’s presentation “Boost performance for your big data with NetApp StorageGRID” on NetAppTV (NetAppTV login required).

 

To learn more about Dremio and StorageGRID, download the solution brief and watch the webinar where NetApp ActiveIQ Technical Director Aaron Sims shares his experience building a data lake with StorageGRID and Dremio.

Public