Tech ONTAP Blogs

Catalog Your Data in ONTAP

vahlkamp
NetApp
393 Views

With over half the world’s data stored in ONTAP, it’s no wonder so many customers request “I need a catalog for my data.” However, there are different types of catalogs used to achieve different ends. Data engineers, scientists, and analysts need to identify schemas, tables, columns, and data types to accelerate exploratory data analysis.

 

There are two types of catalogs: a business catalog and a technical catalog. For this example we used Open Metadata, an open-source business catalog built with the Data Mesh in mind. There are some commonalities required for all configurations in whichever cataloging system you use.

  • First you must enable S3 in ONTAP and create your buckets.
  • Then create an S3 compatible connection in your catalog.
  • Begin exploring your data!

Some considerations:

  • An S3 server is configured per SVM. You cannot have multiple S3 servers per SVM nor is that necessary.
  • There are two functions of ONTAP S3. One is for tiering data and the other is for S3 applications. We are configuring for S3 applications which is a highly valuable ONTAP service!
  • There are two types of S3 buckets:
    • NAS buckets are multi-protocol buckets applied to a NAS (NFS and/or SMB) volume.
    • S3 buckets are S3 only with no attachment to a NAS volume. S3 buckets will automatically provision a Flexgroup and manage the capacity for you which is really helpful.

The first step is to enable S3 in ONTAP:

  • Create S3 server
    • The easiest way to do this is by the command line interface (CLI)
      •  vserver object-store-server create -vserver <server_name> -object-store-server <name_can_be_different_from_server_name -is-https-enabled true | false
      •  In ONTAP’s System Manager graphical user interface (GUI). I am using Cloud Volumes ONTAP (CVO) in Microsoft Azure.
        • In your CVO instance click “switch to advanced view” to enable System Manager
        • Storage > Storage VMs > select your SVM > settings

 

vahlkamp_0-1733420538710.png

 

  • Enable and configure your S3 server
  • Connect to AD/LDAP or create local S3 accounts
    • Support for external directories requires ONTAP 9.14.1. If you have SMB or multi-protocol volumes already you are likely already connecting to AD, but you must configure LDAPS for ONTAP S3 with AD/LDAP to work.
    • The SVM must have S3 enabled, and a bucket created.
    • AD/LDAP requires DNS to be configured in ONTAP.
    • A root certification authority of the LDAP server must be installed on the SVM.
    • An LDAP client must be configured with TLS enabled on the SVM and it must be associated with the SVM.
    • Test access using the AWS CLI
      •  In your AWS config file:
        • Access Key: prepend “NTAPFASTBIND” to (base64-encode(ldapuser:password)
        • Secret Key: 0123456789123456 (any random 16 digit key)
        • Location: make sure to include the region in AWS format. IT WILL FAIL WITHOUT THIS.
      •  aws s3 ls --endpoint-url –https://<ip_address_of_data_lif> to list your test bucket.
    • Now configure your NAS Buckets.
      • You must be running at least ONTAP 9.12 to deploy NAS buckets.
      • If you support for external directories is required, you must be at version 9.14.1.
      • It’s recommended to configure S3 NAS buckets AFTER you’ve deployed your NAS volume with NFS or SMB.
      • In System Manager > Storage > buckets > click Add Bucket.
      • Select “More Options” to expand configuration options
      • “Browse” to the “folder” which is the NFS and/or SMB volume you will map to the S3 bucket.
      • Assign permissions as needed and save.

vahlkamp_2-1733420538717.png

 

Run your aws cli test to see your new S3 NAS bucket: aws s3 ls --endpoint-url –https://<ip_address_of_data_lif> 

 

The second step is to define the connection in your catalog. We are using Open Metadata for the catalog. This pattern is common, however, to all other catalogs.

  • Define connection:
    • In Open Metadata go Settings > Services > Databases (NOT STORAGES – yes that’s counterintuitive!) > Add New Service

vahlkamp_3-1733420538719.png

 

  • Choose DataLake Service. Again, this is counterintuitive but trust me it works!

 

vahlkamp_4-1733420538726.png

 

  • In the drop down choose S3 Config. We are configuring an S3 compatible data lakehouse which is why this is defined as a database in this catalog.

 

vahlkamp_5-1733420538731.png

 

  • Configure your AWS access key, secret and region as well as your S3 endpoint which is https://<ip_address_of_S3_data_lif>. The region is necessary, or it will fail! This is the region of our CVO instance.

vahlkamp_6-1733420538737.png

 

  • "Test connection".

Now start exploring your data making sure to add the relevant ownership and descriptions that are searchable for your data customers.

  • Go to Settings > databases to see all the databases configured or search for data assets in the search bar
  • You will see the supported files listed in the tables. Click a file to see the schema, sample data, data lineage, data quality and profile, set the domain ownership, and a host of other important information.

 

vahlkamp_7-1733420538747.png

Tables listed in bucket.

 

vahlkamp_8-1733420538759.png

Sample Data.

To get this without a catalog a data engineer will have to connect to the database and do a series of SQL selects or jq statements for semi-structured data or head statements for unstructured data. Having this data at one’s fingertips save enormous amounts of exploratory time.

 

vahlkamp_9-1733420538767.png

Schema of the data.

Understanding the schema is required for all data engineering, science, and analytics.

 

vahlkamp_10-1733420538779.png

Data Profile.

This will help a data engineer and scientist understand the volume of the data as well as how much and what kind of work will be necessary for wrangling the data.

 

By enabling S3 in on NAS volumes in ONTAP we have now unleashed our data in a useful format to all data consumers. ONTAP now becomes a key data source in the data mesh, enabled for self-serve exploratory data analysis, and enabling a vast resource to train models and serve the needs of analytics consumers.

Public