Tech ONTAP Blogs

Trino and StorageGRID: Empowering Big Data Analytics

angelacheng
NetApp
171 Views

Trino Overview: The Distributed SQL Query Engine

Trino, formerly known as PrestoSQL, is an open-source, distributed SQL query engine designed specifically for big data analytics. Its versatility allows users to run fast, interactive queries on large datasets across various data sources. Here are the key points:

Key Features

  1. Distributed Query Execution: Trino efficiently executes queries across multiple nodes, ensuring high performance and scalability.
  2. ANSI SQL Compliance: Trino supports standard SQL, making it accessible to users familiar with SQL syntax.
  3. Connector Architecture: It seamlessly integrates with a wide range of data sources, including Hadoop, Cassandra, MySQL, PostgreSQL, and more.
  4. Interactive and Batch Queries: Trino handles both ad-hoc, interactive queries and long-running batch queries effectively.

Pros

  • High Performance: Trino is optimized for lightning-fast query execution on large datasets.
  • Scalability: Easily scale out by adding more nodes to the Trino cluster.
  • Flexibility: Trino supports various data sources and formats, making it adaptable to diverse use cases.
  • Community Support: Benefit from an active open-source community and extensive documentation.

Use Cases

  1. Data Warehousing: Trino is ideal for querying and analyzing large datasets stored in data warehouses.
  2. Data Lake Analytics: Efficiently query data stored in data lakes, such as those based on Hadoop or S3.
  3. Business Intelligence: Integrate Trino with BI tools like Tableau, Power BI, and Superset for interactive data analysis.
  4. ETL Processes: Leverage Trino in Extract, Transform, Load (ETL) workflows to process and move large volumes of data.

NetApp StorageGRID: Secure, Scalable Object-Based Storage

NetApp StorageGRID complements Trino by providing a robust, software-defined, object-based storage solution. Here’s what you need to know:

  • Architecture: StorageGRID supports industry-standard object APIs, including the Amazon S3 API and OpenStack Swift API.
  • Single Namespace: Create a unified namespace across up to 16 data centers globally, ensuring seamless data access.
  • Customizable Service Levels: Define metadata-driven object lifecycle policies to tailor storage services to your needs.

StorageGRID and Trino: A Powerful Combination

  • Use Cases: StorageGRID is particularly suitable for big data analytics. Many NetApp customers successfully use Trino with StorageGRID.
  • TPCDS Benchmark
  • Benchmark Tool - TPC-DS - https://www.tpc.org/tpcds/
  • Trino: Cluster of 5 VMs, each with 128G RAM and 24 vCPU, SSD storage for system disk.  1 Master/co-coordinator and 4 workers.
  • NetApp® StorageGRID® 11.8 with 3 x SG6060 + 1x SG1000 load balancer.  This is minimum number of nodes with spinning disk in a StorageGRID system
  • Database size 1000GB
  • Results: Trino running on StorageGRID completed 99 complex SQL queries and 1.48 million S3 GET requests in approximately 36 minutes. This performance outshines other popular Data Warehouse or Lake House platforms.

For more details, explore the StorageGRID TPCDS benchmark results with other Data Warehouse/Lake House in our article.

Public