Tech ONTAP Blogs

Trino/Starburst and StorageGRID: Empowering Big Data Analytics

angelacheng
NetApp
333 Views

Trino Overview: The Distributed SQL Query Engine

Trino, formerly known as PrestoSQL, is an open-source, distributed SQL query engine designed specifically for big data analytics. Its versatility allows users to run fast, interactive queries on large datasets across various data sources.  Here are the key points:

Key Features

  1. Distributed Query Execution: Trino efficiently executes queries across multiple nodes, ensuring high performance and scalability.
  2. ANSI SQL Compliance: Trino supports standard SQL, making it accessible to users familiar with SQL syntax.
  3. Connector Architecture: It seamlessly integrates with a wide range of data sources, including Hadoop, Cassandra, MySQL, PostgreSQL, and more.
  4. Interactive and Batch Queries: Trino handles both ad-hoc, interactive queries and long-running batch queries effectively.

Pros

  • High Performance: Trino is optimized for lightning-fast query execution on large datasets.
  • Scalability: Easily scale out by adding more nodes to the Trino cluster.
  • Flexibility: Trino supports various data sources and formats, making it adaptable to diverse use cases.
  • Community Support: Benefit from an active open-source community and extensive documentation.

Use Cases

  1. Data Warehousing: Trino is ideal for querying and analyzing large datasets stored in data warehouses.
  2. Data Lake Analytics: Efficiently query data stored in data lakes, such as those based on Hadoop or S3.
  3. Business Intelligence: Integrate Trino with BI tools like Tableau, Power BI, and Superset for interactive data analysis.
  4. ETL Processes: Leverage Trino in Extract, Transform, Load (ETL) workflows to process and move large volumes of data.

Trino vs Starburst:

Trino is an open-source, developed and maintained by a community of contributors.  Starburst provides additional features and support for enterprise uses.  

  • Enhanced Performance: Offers performance improvements like accelerated Parquet, materialized views, and smart indexing.
  • Security and Governance: Includes advanced security features such as role-based access control (RBAC) and query auditing.

NetApp StorageGRID: Secure, Scalable Object-Based Storage

NetApp StorageGRID complements Trino/Starburst by providing a robust, software-defined, object-based storage solution. Here’s what you need to know:

  • Architecture: StorageGRID supports industry-standard object APIs, including the Amazon S3 API and OpenStack Swift API.
  • Single Namespace: Create a unified namespace across up to 16 data centers globally, ensuring seamless data access.
  • Customizable Service Levels: Define metadata-driven object lifecycle policies to tailor storage services to your needs.

StorageGRID and Trino/Starburst: A Powerful Combination

  • Use Cases: StorageGRID is particularly suitable for big data analytics. Many NetApp customers successfully use Trino or Starburst with StorageGRID.
  • TPCDS Benchmark
  • Benchmark Tool - TPC-DS - https://www.tpc.org/tpcds/
  • Trino: Cluster of 6 VMs, each with 128G RAM and 24 vCPU, SSD storage for system disk.  1 co-coordinator and 5 workers.
  • Starburst: Cluster of 6 VMs, each with 128G RAM and 24 vCPU, SSD storage for system disk.  1 co-coordinator and 5 workers.
  • NetApp® StorageGRID® 11.8 with 3 x SG6060 + 1x SG1000 load balancer.  This is minimum number of nodes with spinning disk in a StorageGRID system
  • Database size 1000GB
  • Trino TPCDS result summary

    Data format

    Parquet

    Iceberg

    Total number of S3 GET requests

    1.5 million

    938K

    Total times to complete 99 SQL queries

    31 min 51 sec

    28 min 18 sec

  • Starburst TPCDS result summary

    Data format

    Parquet

    Iceberg

    Total number of S3 GET requests

    1.5 million

    931K

    Total times to complete 99 SQL queries

    28 min 18 sec

    22 min 15 sec

    Starburst performance outshines other popular Data Warehouse or Lake House platforms.

For more details, explore the StorageGRID TPCDS benchmark results with other Data Warehouse/Lake House in our article.

Public