Tech ONTAP Blogs
Tech ONTAP Blogs
The NetApp DataOps Toolkit has always been about making data management simple for developers, data scientists, and data engineers. With v3.0, we're introducing a feature that takes that mission even further: Dataset Manager.
Dataset Manager is a new, higher-level abstraction built on top of the DataOps Toolkit's existing volume management capabilities. Instead of thinking about ONTAP volumes, junction paths, and NFS mounts, you now work with datasets - simple, named directories that appear instantly on your local filesystem, backed by all the enterprise-grade power of NetApp ONTAP.
In this post, I'll walk through what Dataset Manager is, how it works, and why it's a game-changer for AI/ML and data engineering workflows.
Anyone who has used the NetApp DataOps Toolkit knows that the core volume management APIs are powerful, but they require you to think in storage terms: volume names, junction paths, mountpoints, export policies. For data scientists and data engineers, this is cognitive overhead that gets in the way of doing actual data work.
Here's what creating and using a dataset looked like before v3.0:
from netapp_dataops.traditional import create_volume, mount_volume
# Create a volume
create_volume(volume_name="my_training_data", volume_size="500GB")
# Mount it locally
mount_volume(volume_name="my_training_data", mountpoint="/mnt/my_training_data")
# Work with the data...
# And later: manually manage unmounts, snapshots, clones, etc.
With Dataset Manager, the same thing looks like this:
from netapp_dataops.traditional.datasets import Dataset
# Create a dataset - volume creation and mounting handled automatically
dataset = Dataset(name="my_training_data", max_size="500GB")
# Start working immediately
print(f"Data lives here: {dataset.local_file_path}")
No volume configuration. No manual mount commands. The dataset appears as a directory on your filesystem, ready to use.
Dataset Manager uses a clean hierarchical design. During initial setup, a single "root" ONTAP volume is created and mounted permanently on your host. Every dataset you create becomes its own ONTAP volume, automatically junctioned as a subdirectory of that root.
Root Volume (e.g., "dataset_mgr_root")
└── Mounted at: /mnt/datasets
├── training_data_v1/ ← Dataset 1 (its own ONTAP volume)
│ ├── images/
│ └── labels.csv
├── training_data_v2/ ← Dataset 2 (ONTAP volume)
│ ├── images/
│ └── labels.csv
└── inference_data/ ← Dataset 3 (ONTAP volume)
└── input_data.parquet
The NFS client sees one continuous directory tree. New datasets appear instantly; no remounting required. Under the hood, ONTAP handles everything through junction paths. It's transparent, automatic, and it just works.
Setup is done once through the DataOps Toolkit configuration wizard:
netapp_dataops_cli.py config
During configuration, you can create a new root volume or point Dataset Manager at an existing one. Either way, you only have to do this once.
Creating a new dataset is a single call:
from netapp_dataops.traditional.datasets import Dataset
dataset = Dataset(name="training_images_v1", max_size="500GB")
print(f"Dataset ready at: {dataset.local_file_path}")
Accessing an existing dataset is equally simple, just omit max_size:
# Bind to an existing dataset
dataset = Dataset(name="training_images_v1")
print(f"Name: {dataset.name}")
print(f"Size: {dataset.max_size}")
print(f"Is Clone: {dataset.is_clone}")
Once you have a dataset, you work with it exactly like any directory on your filesystem. You can use pandas, numpy, shutil, or whatever tools you already use. No special Dataset Manager APIs required for I/O.
from netapp_dataops.traditional.datasets import get_datasets
datasets = get_datasets()
for ds in datasets:
print(f" - {ds.name} ({ds.max_size})")
if ds.is_clone:
print(f" - Clone of: {ds.source_dataset_name}")
Example output:
- training_data_v1 (500GB)
- inference_data (50GB)
- training_data_v2 (500GB)
- Clone of: training_data_v1
One of the most powerful aspects of Dataset Manager is how easy it makes dataset versioning. Under the hood, it uses NetApp Snapshot technology. That means that saving a new "version" is space-efficient and instant. From the user's perspective, it's just one method call.
dataset = Dataset(name="training_data")
# Snapshot with automatic timestamp-based name
snapshot_name = dataset.snapshot()
print(f"Created: {snapshot_name}")
# Created: training_data_20240212_143022
# Or with a descriptive name
dataset.snapshot(name="before_preprocessing")
You can also retrieve all snapshots for a dataset:
snapshots = dataset.get_snapshots()
for snap in snapshots:
print(f" - {snap['name']} ({snap['create_time']})")
Thanks to NetApp FlexClone technology, cloning a dataset is near-instantaneous regardless of how large it is, and the clone initially consumes almost no additional storage space, sharing unchanged data blocks with the source.
source = Dataset(name="production_dataset")
experiment = source.clone(name="experiment_v2")
# Modify the clone freely (source is untouched)
run_experimental_pipeline(experiment.local_file_path)
# Clean up when done
experiment.delete()
Dataset Manager also includes a get_files() method to enumerate all files in a dataset along with size metadata. This is useful for auditing, reporting, or understanding what's inside a pre-existing dataset before processing it.
dataset = Dataset(name="training_data")
files = dataset.get_files()
for f in files:
print(f" {f['filename']} ({f['size_human']})")
print(f" - {f['filepath']}")
Example output:
training_data.csv (245.3 MB)
- /mnt/datasets/training_data/training_data.csv
features.npy (1.2 GB)
- /mnt/datasets/training_data/arrays/features.npy
model_config.json (2.1 KB)
- /mnt/datasets/training_data/model_config.json
Here's how Dataset Manager comes together in a typical ML experiment workflow:
from netapp_dataops.traditional.datasets import Dataset, get_datasets
from <your_project> import download_raw_data, run_preprocessing, apply_augmentation, train_model
# 1. Create the source dataset and load raw data
raw = Dataset(name="raw_imagenet_2024", max_size="2TB")
download_raw_data(raw.local_file_path)
# 2. Snapshot before any processing
raw.snapshot(name="original_download")
# 3. Clone for preprocessing — don't touch the raw data
preprocessed = raw.clone(name="preprocessed_imagenet_2024")
run_preprocessing(preprocessed.local_file_path)
preprocessed.snapshot(name="preprocessing_complete")
# 4. For each experiment, clone the preprocessed version
for experiment_id in ["exp_lr_001", "exp_lr_002", "exp_lr_003"]:
exp_data = preprocessed.clone(name=f"experiment_{experiment_id}")
# Apply experiment-specific augmentation
apply_augmentation(exp_data.local_file_path, config=experiment_id)
# Snapshot before training for full reproducibility
exp_data.snapshot(name="pre_training")
# Train
results = train_model(exp_data.local_file_path, experiment_id)
print(f"{experiment_id}: {results}")
# Clean up experiment clone when done
exp_data.delete()
The whole workflow (versioning, cloning, experimentation, cleanup) becomes a natural part of your Python code, not a separate storage administration task.
Dataset Manager is available in the NetApp DataOps Toolkit for Traditional Environments v3.0.
python3 -m venv ~/netapp-dataops-venv
source ~/netapp-dataops-venv/bin/activate
pip install netapp-dataops-traditional
netapp_dataops_cli.py config
During configuration you'll set up the Dataset Manager root volume. The wizard walks you through it. It takes about a minute.
Full documentation, including detailed installation instructions, API reference, and a troubleshooting guide, is available in the Dataset Manager README on GitHub.
Dataset Manager is the simplest way to use NetApp storage for Python-based data science and data engineering workflows. It bridges the gap between enterprise storage infrastructure and the Python-native, filesystem-based tools that practitioners actually use. With instant clones, zero-overhead snapshots, and automatic NFS management, it lets you focus on your data, not your storage.
Give it a try and let us know what you think in the comments below. The full NetApp DataOps Toolkit is open-source and available at github.com/NetApp/netapp-dataops-toolkit.