Tech ONTAP Blogs

Modern Data platform – Databricks with NetApp ONTAP (FSxN-AWS)

Subbuj
NetApp
1,497 Views

Introduction

This document explains the integration of Databricks with NetApp ONTAP 1P FSxN (AWS). We assume you are already familiar with Databricks and NetApp. Our focus here is to help data players—whether Data Engineers, Data Intelligence Analysts, or Data Scientists—learn how to quickly and securely access data stored in NetApp backends without undergoing major data movement. The combination of Databricks and NetApp allows you to perform daily data activities, build ETL/ELT pipelines, and develop AIML and NLP (RAG LLM) use cases using your trusted NetApp storage environment.

 

Objective

The key objectives are:

  • Enable and engage customers to leverage their data using Databricks with NetApp.
  • Allow quick and easy access to the data stored on NetApp ONTAP 1P FSxN (AWS).
  • Build data pipelines for ETL/ELT, AIML, and exploratory data analysis (EDA) so that your data can drive effective business decisions.
  • Avoid data movement by accessing data in situ (in-place or on-site), thereby reducing costs and preventing data silos, and maintain higher security since the data isn't transferred over networks to another storage system.

Details

Integrating Databricks with NetApp products is very similar to connecting to AWS native S3—only the connection string varies while the core code remains common across ETL/ELT, AIML, and RAG LLMs use cases.

 

Sample Connection Strings

Below are sample connection strings for the different NetApp and AWS environments:

 

ONTAP S3

s3a://<ontap-bucket-name>/

Subbuj_0-1748909030699.png

 

AWS Native S3

s3a://<aws-s3-bucket-name>/

Subbuj_1-1748909030706.png

 

Before diving into code snippets, consider these cost and performance factors:

  1. No engineering effort is required to move data from NetApp to a cloud storage account—this saves time and reduces engineering resources.
  2. There is no need for new contracts or subscriptions for additional cloud storage or tools.
  3. You reduce data transfer costs related to I/O and network usage (TPS & bandwidth) while ensuring high-speed access.
  4. The main advantage is hassle-free, secure data access. Data remains safely on NetApp (a highly trusted, intelligent storage platform), thus preventing data leaks and avoiding the creation of data silos.

This approach is significantly better in terms of safety, security, robustness, and performance when compared to physically moving the data. With Databricks, you can transform raw data into a “gold” state, enabling you to derive business value with confidence.

 

Code Sample Snippets

Once the connection is established, the following Python code examples (using PySpark) illustrate how to:

  • Read multiple source files in diverse formats from a raw data bucket.
  • Convert these files into a single, popular format (such as Parquet).
  • Clean, transform, and enrich the data.
  • Support further Exploratory Data Analysis (EDA) and AIML use cases.

 

Reading and Converting Data to Parquet

Subbuj_2-1748909030713.png

 

Enriching and Joining Datasets

Once the data is in a uniform format, you can perform further operations like joining with other datasets.

Subbuj_3-1748909030717.png

 

Performing Exploratory Data Analysis (EDA)

Once data is in the “gold” state (well-structured and enriched), you can explore and analyze it for various AIML use cases.

Subbuj_4-1748909030719.png

 

Demo

For additional typical streamlines related to ETL/ELT, AIML use cases, and NLP (RAG LLMs), please refer to our high-level demo:

 

Conclusion

This document has outlined an approach that allows you to harness the combined power of Databricks and NetApp ONTAP. You can build sophisticated, cost-effective data pipelines, conduct AIML and EDA without data movement, and maintain data security—all while leveraging your current NetApp infrastructure. We highly encourage you to explore this approach as it minimizes risk, reduces cost, and ensures that you get the best performance from your data management capabilities.

 

Learn more

Comments
Public