Big data analytics on StorageGRID: Dremio performs 23 times faster than Apache Hive!

Ben-Houser · ‎2023-12-12

To better understand the performance of data analytic platforms, NetApp benchmarked Apache Hive and Dremio with StorageGRID object storage.

Apache Hive is a common metadata store and query engine used with Hadoop. Each platform was benchmarked with TPC-DS against a 1TB dataset. TPC-DS is an industry standard benchmark for data analytics platforms and consists of 99 distinct SQL queries designed to model a standard data analytics workload. For the Dremio test, the TPC-DS data was stored in the Iceberg and Parquet formats. For the Apache Hive test, the data was stored in Parquet format.

The following test setup was used:

Big data ecosystems

Cluster of 5 VMs, each with 128G RAM and 24 vCPU, SSD storage for system disk
Hadoop 3.3.5 with Hive 3.1.3 (1 name node + 4 data nodes)
Dremio v23 (1 master + 4 executors)

Object storage

NetApp® StorageGRID® 11.6 with SG1000 + 3 x SG6060
ILM protection – 2 copies

Database size 1000GB
Cache disabled on both ecosystems

	TPC-DS Query Duration (minutes)	Total S3 Requests
Dremio with Iceberg	44.39	4,235,142
Dremio with Parquet	46.95	4,416,504
Apache Hive with Parquet	1084	2,567,390

Dremio completed the 99 TPC-DS queries more than 23 times faster than Apache Hive, showing a tremendous improvement in query performance!

When analyzing the S3 requests received by StorageGRID during the Dremio and Hive benchmarks, it became clear why Dremio achieves the highest performance. The following table breaks down the S3 request counts during the benchmarks:

	Apache Hive with Parquet	Dremio with Parquet
GET (range-read)	1,117,184	4,414,227
LIST	312,053	240
HEAD (non-existent object)	156,027	192
HEAD (existent object)	982,126	1,845
Total requests	2,567,390	4,416,504

Although Dremio sent more S3 requests to StorageGRID in total, it performed significantly faster than Hive. This is due to several factors. Hive sent 312,000 LIST requests during the benchmark – LIST is a slow operation for any object store, because it requires listing the entire contents of a bucket. With buckets containing millions of objects, the response times of LIST requests can quickly add up. Conversely, Dremio sent only 240 list requests. Dremio was also more efficient in its use of range-read GETs. Dremio utilized parallel processing, sending concurrent range-read GET requests to download multiple ranges of the same objects simultaneously. Taking advantage of StorageGRID’s high-concurrency performance, Dremio was able to achieve 2000-2300 range-read GET/s, compared with Hive’s 50-100 range-read GET/s.

In addition to S3 optimizations, the Dremio Sonar engine uses query-acceleration technology to achieve interactive-speed response times. Dremio supports Columnar Cloud Caching (C3), which uses NVMe SSD technology built into cloud compute instances to achieve NVMe-level I/O performance. These tools, and more, make Dremio the world’s fastest lakehouse engine, and this speed opens the traditionally deep-and-cheap data lake to meaningful BI and data analysis.

To learn more about this performance testing, watch StorageGRID technical marketing engineer Angela Cheng’s presentation “Boost performance for your big data with NetApp StorageGRID” on NetAppTV (NetAppTV login required).

To learn more about Dremio and StorageGRID, download the solution brief and watch the webinar where NetApp ActiveIQ Technical Director Aaron Sims shares his experience building a data lake with StorageGRID and Dremio.