Introducing Dataset Manager – NetApp DataOps Toolkit v3.0

moglesby · ‎2026-03-11

The NetApp DataOps Toolkit has always been about making data management simple for developers, data scientists, and data engineers. With v3.0, we're introducing a feature that takes that mission even further: Dataset Manager.

Dataset Manager is a new, higher-level abstraction built on top of the DataOps Toolkit's existing volume management capabilities. Instead of thinking about ONTAP volumes, junction paths, and NFS mounts, you now work with datasets - simple, named directories that appear instantly on your local filesystem, backed by all the enterprise-grade power of NetApp ONTAP.

In this post, I'll walk through what Dataset Manager is, how it works, and why it's a game-changer for AI/ML and data engineering workflows.

The problem Dataset Manager solves

Anyone who has used the NetApp DataOps Toolkit knows that the core volume management APIs are powerful, but they require you to think in storage terms: volume names, junction paths, mountpoints, export policies. For data scientists and data engineers, this is cognitive overhead that gets in the way of doing actual data work.

Here's what creating and using a dataset looked like before v3.0:

from netapp_dataops.traditional import create_volume, mount_volume

# Create a volume
create_volume(volume_name="my_training_data", volume_size="500GB")

# Mount it locally
mount_volume(volume_name="my_training_data", mountpoint="/mnt/my_training_data")

# Work with the data...
# And later: manually manage unmounts, snapshots, clones, etc.

With Dataset Manager, the same thing looks like this:

from netapp_dataops.traditional.datasets import Dataset

# Create a dataset - volume creation and mounting handled automatically
dataset = Dataset(name="my_training_data", max_size="500GB")

# Start working immediately
print(f"Data lives here: {dataset.local_file_path}")

No volume configuration. No manual mount commands. The dataset appears as a directory on your filesystem, ready to use.

How It works: the root volume architecture

Dataset Manager uses a clean hierarchical design. During initial setup, a single "root" ONTAP volume is created and mounted permanently on your host. Every dataset you create becomes its own ONTAP volume, automatically junctioned as a subdirectory of that root.

Root Volume (e.g., "dataset_mgr_root")
    └── Mounted at: /mnt/datasets
        ├── training_data_v1/          ← Dataset 1 (its own ONTAP volume)
        │   ├── images/
        │   └── labels.csv
        ├── training_data_v2/          ← Dataset 2 (ONTAP volume)
        │   ├── images/
        │   └── labels.csv
        └── inference_data/            ← Dataset 3 (ONTAP volume)
            └── input_data.parquet

The NFS client sees one continuous directory tree. New datasets appear instantly; no remounting required. Under the hood, ONTAP handles everything through junction paths. It's transparent, automatic, and it just works.

Setup is done once through the DataOps Toolkit configuration wizard:

netapp_dataops_cli.py config

During configuration, you can create a new root volume or point Dataset Manager at an existing one. Either way, you only have to do this once.

Key capabilities

Creating and accessing datasets

Creating a new dataset is a single call:

from netapp_dataops.traditional.datasets import Dataset

dataset = Dataset(name="training_images_v1", max_size="500GB")
print(f"Dataset ready at: {dataset.local_file_path}")

Accessing an existing dataset is equally simple, just omit max_size:

# Bind to an existing dataset
dataset = Dataset(name="training_images_v1")

print(f"Name: {dataset.name}")
print(f"Size: {dataset.max_size}")
print(f"Is Clone: {dataset.is_clone}")

Once you have a dataset, you work with it exactly like any directory on your filesystem. You can use pandas, numpy, shutil, or whatever tools you already use. No special Dataset Manager APIs required for I/O.

Listing all datasets

from netapp_dataops.traditional.datasets import get_datasets

datasets = get_datasets()

for ds in datasets:
    print(f"  - {ds.name} ({ds.max_size})")
    if ds.is_clone:
        print(f"    - Clone of: {ds.source_dataset_name}")

Example output:

  - training_data_v1 (500GB)
  - inference_data (50GB)
  - training_data_v2 (500GB)
    - Clone of: training_data_v1

Snapshots: point-in-time dataset versioning

One of the most powerful aspects of Dataset Manager is how easy it makes dataset versioning. Under the hood, it uses NetApp Snapshot technology. That means that saving a new "version" is space-efficient and instant. From the user's perspective, it's just one method call.

dataset = Dataset(name="training_data")

# Snapshot with automatic timestamp-based name
snapshot_name = dataset.snapshot()
print(f"Created: {snapshot_name}")
# Created: training_data_20240212_143022

# Or with a descriptive name
dataset.snapshot(name="before_preprocessing")

You can also retrieve all snapshots for a dataset:

snapshots = dataset.get_snapshots()
for snap in snapshots:
    print(f"  - {snap['name']}  ({snap['create_time']})")

Clones: instant full copies for experimentation

Thanks to NetApp FlexClone technology, cloning a dataset is near-instantaneous regardless of how large it is, and the clone initially consumes almost no additional storage space, sharing unchanged data blocks with the source.

source = Dataset(name="production_dataset")
experiment = source.clone(name="experiment_v2")

# Modify the clone freely (source is untouched)
run_experimental_pipeline(experiment.local_file_path)

# Clean up when done
experiment.delete()

File inspection

Dataset Manager also includes a get_files() method to enumerate all files in a dataset along with size metadata. This is useful for auditing, reporting, or understanding what's inside a pre-existing dataset before processing it.

dataset = Dataset(name="training_data")
files = dataset.get_files()

for f in files:
    print(f"  {f['filename']}  ({f['size_human']})")
    print(f"    - {f['filepath']}")

Example output:

  training_data.csv  (245.3 MB)
    - /mnt/datasets/training_data/training_data.csv
  features.npy  (1.2 GB)
    - /mnt/datasets/training_data/arrays/features.npy
  model_config.json  (2.1 KB)
    - /mnt/datasets/training_data/model_config.json

Real-world ML workflow example

Here's how Dataset Manager comes together in a typical ML experiment workflow:

from netapp_dataops.traditional.datasets import Dataset, get_datasets
from <your_project> import download_raw_data, run_preprocessing, apply_augmentation, train_model

# 1. Create the source dataset and load raw data
raw = Dataset(name="raw_imagenet_2024", max_size="2TB")
download_raw_data(raw.local_file_path)

# 2. Snapshot before any processing
raw.snapshot(name="original_download")

# 3. Clone for preprocessing — don't touch the raw data
preprocessed = raw.clone(name="preprocessed_imagenet_2024")
run_preprocessing(preprocessed.local_file_path)
preprocessed.snapshot(name="preprocessing_complete")

# 4. For each experiment, clone the preprocessed version
for experiment_id in ["exp_lr_001", "exp_lr_002", "exp_lr_003"]:
    exp_data = preprocessed.clone(name=f"experiment_{experiment_id}")

    # Apply experiment-specific augmentation
    apply_augmentation(exp_data.local_file_path, config=experiment_id)

    # Snapshot before training for full reproducibility
    exp_data.snapshot(name="pre_training")

    # Train
    results = train_model(exp_data.local_file_path, experiment_id)
    print(f"{experiment_id}: {results}")

    # Clean up experiment clone when done
    exp_data.delete()

The whole workflow (versioning, cloning, experimentation, cleanup) becomes a natural part of your Python code, not a separate storage administration task.

Getting started

Dataset Manager is available in the NetApp DataOps Toolkit for Traditional Environments v3.0.

System requirements

Host operating system

Linux (RHEL, CentOS, Ubuntu, Debian, etc.)
macOS

Python version and utilities

Python 3.8–3.13
venv
pip

Required host utilities

On Linux: nfs-common (Debian/Ubuntu) or nfs-utils (RHEL/CentOS)

Storage system/service

NetApp AFX, AFF, or FAS appliance
Amazon FSx for NetApp ONTAP
NetApp Cloud Volumes ONTAP
NetApp ONTAP Select

Install

python3 -m venv ~/netapp-dataops-venv
source ~/netapp-dataops-venv/bin/activate
pip install netapp-dataops-traditional

Configure (run once)

netapp_dataops_cli.py config

During configuration you'll set up the Dataset Manager root volume. The wizard walks you through it. It takes about a minute.

Full documentation, including detailed installation instructions, API reference, and a troubleshooting guide, is available in the Dataset Manager README on GitHub.

Conclusion

Dataset Manager is the simplest way to use NetApp storage for Python-based data science and data engineering workflows. It bridges the gap between enterprise storage infrastructure and the Python-native, filesystem-based tools that practitioners actually use. With instant clones, zero-overhead snapshots, and automatic NFS management, it lets you focus on your data, not your storage.

Give it a try and let us know what you think in the comments below. The full NetApp DataOps Toolkit is open-source and available at github.com/NetApp/netapp-dataops-toolkit.