Tech ONTAP Blogs

NetApp + Starburst: Designing a Data Lakehouse for AI at scale

tsathish
NetApp
242 Views

In my previous blog, “AI-First Data Strategy”, I discussed how the partnership between NetApp and Starburst reimagines enterprise AI architecture to address the challenges of running hybrid analytics and AI workloads without the high cost of moving fragmented data. While the AI application layer remains volatile—with models, performance standards, SLAs, and costs changing monthly, many teams still struggle to locate datasets, merge structured and unstructured data and operationalize an open data Lakehouse. Traditional data centralization often leads to high costs, slow pipelines, and inconsistent governance. To address these challenges, Starburst and NetApp have validated a joint solution designed to stabilize both ends of the architecture. I am pleased to share that Starburst Enterprise now supports NetApp object storage, including both ONTAP-based systems and StorageGRID.

 

Why Starburst and NetApp

Starburst absorbs the volatility through an open federated query engine built on Trino and Apache Iceberg, applying consistent governance, business definitions, and access controls across every data source, LLM, and agentic workflow. That foundation powers AIDA, Starburst's AI Data Assistant. AIDA replaces static dashboards and delayed reporting cycles with a conversational interface where users, applications, and AI agents ask questions and get trusted answers grounded in governed enterprise data products, not hallucinations. Through Starburst's MCP server, enterprises embed AIDA anywhere, bring their own agents, and power multi-agent ecosystems with enterprise-grade security and full auditability. The result: Starburst was recognized as the most innovative agent-ready data platform for enabling customers to securely access and govern enterprise data wherever it resides — without compromising compliance.  On the other end of the architecture NetApp ONTAP and/or StorageGRID anchor data where gravity demands stay governed, sovereign, and economically held. NetApp brings the simplicity and flexibility of cloud to your data center and brings all the enterprise capabilities of your data center to the public cloud. NetApp ONTAP-based systems (AFX, AFF A-Series, AFF C-Series) and StorageGRID are fully compatible with Amazon S3 and can support any catalog using one of the following connectors:

  • Delta Lake connector
  • Hive connector
  • Iceberg connector

With ONTAP using the S3 protocol you can also create NAS buckets that enable applications to access data on FlexCache, a caching technology which creates read-optimized cache volume in the cloud that is logically linked to your primary ONTAP volume on-premises, brings faster throughput and provides a seamless hybrid experience for AI at scale. Furthermore, NetApp offers a wide variety of MCP servers. These range from the Harvest MCP server for providing ONTAP system logs to AI agents, to the ONTAP-MCP server which provides MCP clients and large language models (LLMs) access to NetApp ONTAP storage systems to enable self-service provisioning and lifecycle management. The directory of currently available NetApp MCP servers can be found here: https://github.com/NetApp/mcp

 

Screenshot 2026-06-09 173606.png

 

Solution deployment & validation

In our lab setup we used Starburst Enterprise Platform (SEP) with an Iceberg catalog, Starburst Data Catalog (SDC) as the metadata plane, and NetApp ONTAP S3 as the object storage plane. For support, the most useful model is that metadata problems usually sit in SEP ↔ SDC, while read/write/data path problems usually sit in SEP ↔ ONTAP S3. The backend PostgreSQL services are the third, which separates the dependency for SEP Insights and SDC persistence. The functional validation deployment included:

  • Starburst coordinator node, installed via tarball
  • Worker installed via tarball deployment
  • PostgreSQL installed manually for backend services
  • Starburst Data Catalog deployed on k3s/k8s/Helm
  • On NetApp side:
    • Step 1 – Locate cluster
      • ONTAP version 9.17.1 shown as cluster1, with S3 bucket <s3store> on storage VM <nas_svm >
      • Picture1.png
    • Step 2, 3 – Locate bucket and set permissions
      • The bucket is exposed at https://s3.demo.netapp.com/s3store, labeled with access type S3. The permissions for the bucket include GetObject, PutObject, DeleteObject, ListBucket, and bucket policy / lifecycle operations against s3store and s3store/*.
      • Picture2.png
  • On Starburst side:
    • The SEP catalog configuration – A key support artifact
      • connector.name=iceberg
      • iceberg.catalog.type=glue_v2
      • hive.metastore.glue.endpoint-url=http://192.168.0.205/api/v1/glue
      • hive.metastore.glue.catalogid=starcat_iceberg
      • fs.native-s3.enabled=true
      • s3.endpoint=https://s3.demo.netapp.com
      • s3.path-style-access=true
      • iceberg.security=system
      • iceberg.register-table-procedure.enabled=true
    • Starburst Data Catalog (SDC) configured with:
      • a file-based credentials provider
      • a local credentials file at /etc/starburst/catalog-credentials.json
      • PostgreSQL persistence at jdbc:postgresql://sdc-postgresql:5432/starburst_catalog

Most importantly, the SDC config defines emulated credentials, and the SEP Iceberg catalog explicitly says the Glue credentials must match those emulated credentials. Indicates this design has two separate credential domains:

  1. SEP ↔ SDC (Glue emulation credentials)
  2. SEP ↔ NetApp ONTAP S3 (object store credentials)

 

Starburst + NetApp: Customer value

Traditional big data approaches often require frequent data movement between silos, leading to inefficiency and increased risk. Our partnership eliminates these challenges, and the key value propositions include:

  • Cost Efficiency: Query data directly from existing storage to avoid duplicative costs and unnecessary data movement.
  • Seamless Analytics: Starburst’s SQL engine ensures high-speed, fault-tolerant analytics at any scale.
  • AI-Enablement: Real-time preparation of structured and unstructured datasets supports agentic workflows without additional infrastructure investment.
  • Sustainability: Running analytics in place reduces energy consumption and infrastructure costs.
  • Integrated Data Lifecycle Management: NetApp ONTAP S3 and StorageGRID creates a unified data lake, reducing fragmentation and simplifying metadata governance.
  • Operational Simplicity: Centralized policies, automation, and proactive monitoring streamline operations across hybrid cloud environments.

Data Lakehouse for AI at scale - The combination of Starburst Enterprise Platform (SEP), Starburst Data Catalog (SDC), and NetApp’s intelligent data infrastructure include:

  • Unified Metadata Management: SDC ensures that all metadata is stored in a single, centralized repository, enabling consistent governance and faster queries.
  • Dynamic Storage Management: NetApp ONTAP S3’s object storage is built to handle massive datasets dynamically, providing flexibility, robust access control, and real-time analytics.
  • Cross-Silo Data Access: SEP connects structured and unstructured data sources, erasing the boundaries between on-premises, cloud, and edge systems.
  • Scalability Without Bottlenecks: The system scales linearly to support petabytes of data while maintaining performance.
  • Simplified Troubleshooting: The unified infrastructure and metadata layer make it easier to diagnose and resolve issues before they impact production workloads.
  • AI/ML Integration: Flexible, high-speed access to datasets ensures that enterprises can execute AI/ML workflows without any additional infrastructure modifications.

Conclusion

In today's enterprise landscape, organizations face risk from two directions at once. Below, data gravity makes the existing estate, fragmented across on-premises systems, private clouds, and public clouds, expensive and risky to consolidate. Above, the AI and application layer is volatile: models, agents, costs, and performance benchmarks shift faster than procurement cycles can absorb. With NetApp and Starburst, enterprises use the storage infrastructure they already own to analyze both structured and unstructured data — without consolidating, migrating, or betting on a single AI stack. The Starburst validation with NetApp ONTAP S3 and StorageGRID confirms an architecture that stabilizes both ends, anchoring data where it lives while serving it into whatever the AI layer becomes next.

Public