Hadoop Tiering to NetApp NFS (HOT) and S3 (COLD) with NetApp XCP — End‑to‑End Guide & Automation

nkarthik · ‎2025-12-19

As Hadoop clusters expand, efficiently managing storage costs and performance becomes increasingly important. Data tiering helps by transferring frequently accessed (hot) data to high-performance storage, while moving infrequently accessed (cold) data to more affordable object storage solutions. This blog discusses an enterprise customer who maintained petabytes of mixed hot and cold HDFS data on costly on-premises disks. There was a goal to cut storage expenses by 30–40%, boost query speeds, and maintain governance standards. The article describes how the customer used NetApp XCP to migrate HDFS data to NetApp NFS for hot storage and StorageGRID S3 for cold storage, integrating with Cloudera Hadoop and automating the process using Oozie workflows.

Why Tier Hadoop Data?

Different datasets require distinct storage classes. Frequently accessed ("hot") data should reside on high‑performance storage, whereas archival or infrequently accessed ("cold") datasets are better suited for cost-effective, scalable object storage solutions. HDFS provides tiered storage policies (HOT/WARM/COLD) to optimise on‑cluster data placement, and Cloudera offers support for the S3A filesystem to integrate with object stores. NetApp XCP will be utilised to perform high‑throughput data migrations from the Hadoop cluster to NFS and S3, while Oozie workflows will facilitate automation throughout the process.

It offers scalable, high-throughput migration capabilities and ensures data integrity through its verify feature. The S3 connector enables seamless migration from HDFS to S3, supporting both profile and endpoint configurations.

Architecture

Flow:

Use Hadoop mover and tier storage policies to sort HDFS files by their modified time (age), placing them into either /hot or /cold directories.
Employ XCP to transfer data: move files from /hot to NetApp NFS, and files from /cold to NetApp S3.
Perform XCP verification to ensure data integrity at both destinations.

Prerequisites

The Hadoop cluster should be configured to use HDFS mode by setting fs.defaultFS to hdfs://<namenode>:8020, and ensuring that the HDFS daemons (NameNode and DataNodes) are running. If you run hadoop fs -df -h and see file:/// instead of hdfs://, that's local mode—switch your configuration to HDFS mode.
Create HDFS directories if needed. Set HDFS storage policies for on‑cluster hygiene, and Hadoop mover enforces on‑cluster placement; we’ll use XCP for off‑cluster migration.

[root@rhel9nkarthik1 ~]# id
uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[root@rhel9nkarthik1 ~]# sudo -u hdfs hadoop fs -mkdir /data
[root@rhel9nkarthik1 ~]# sudo -u hdfs hadoop fs -mkdir /hot
[root@rhel9nkarthik1 ~]# sudo -u hdfs hadoop fs -mkdir /cold
[root@rhel9nkarthik1 ~]# sudo -u hdfs hadoop fs -chmod 777 /data /hot /cold
[root@rhel9nkarthik1 ~]# sudo -u hdfs hdfs storagepolicies -setStoragePolicy -path /hot -policy HOT
Set storage policy HOT on /hot
[root@rhel9nkarthik1 ~]# sudo -u hdfs hdfs storagepolicies -setStoragePolicy -path /cold -policy COLD
Set storage policy COLD on /cold
[root@rhel9nkarthik1 ~]#
[root@rhel9nkarthik1 ~]# hdfs storagepolicies -getStoragePolicy -path /hot
The storage policy of /hot:
BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
[root@rhel9nkarthik1 ~]# hdfs storagepolicies -getStoragePolicy -path /cold
The storage policy of /cold:
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], replicationFallbacks=[]}
[root@rhel9nkarthik1 ~]#

Install on a gateway host or a Cloudera cluster node and run it as root for optimal performance and permissions. Set up an XCP catalog to store XCP operation metadata, which is used during sync and cutover for updated files. You can download the xcp software from this link: https://mysupport.netapp.com/site/products/all/details/netapp-xcp/downloads-tab
NetApp NFS Target: Configure a NetApp NFS volume in ONTAP and export it (for example, netapp-vserver-data-lif-ip:/nfs_hot) to ensure accessibility from the XCP host. Validate the NetApp NFS volume using the command “showmount -e <data_lif_ip>” on the XCP host to display available NFS volumes. XCP supports HDFS to NFS migrations for both copy and verify operations (excluding sync).
For a NetApp S3 target, create a NetApp bucket (for example, s3://analytics-cold) in ONTAP or StorageGRID, ensuring your AWS profile is set up with the necessary access key and secret key. The XCP S3 connector requires both the AWS profile and the endpoint, and it supports copy and verify operations (but not sync). Be sure to test the 'aws s3' command to confirm that the bucket is accessible from the XCP host.
To enable XCP to function with HDFS as a source, it is necessary to set JAVA_HOME, CLASSPATH, NHDFS_LIBHDFS_PATH, and NHDFS_LIBJVM_PATH on the XCP host. Please refer to the sample configuration details below:

[root@rhel9nkarthik1 ~]# dirname $(dirname $(readlink $(readlink $(which javac))))
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.472.b08-1.el9.x86_64
[root@rhel9nkarthik1 ~]# export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))
[root@rhel9nkarthik1 ~]# export NHDFS_LIBJVM_PATH=`find $JAVA_HOME -name "libjvm.so"`
[root@rhel9nkarthik1 ~]# echo $NHDFS_LIBJVM_PATH
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.472.b08-1.el9.x86_64/jre/lib/amd64/server/libjvm.so
[root@rhel9nkarthik1 ~]# find / -name 'libhdfs.so'
/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/lib64/libhdfs.so
[root@rhel9nkarthik1 ~]# export NHDFS_LIBHDFS_PATH='/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/lib64/libhdfs.so'
[root@rhel9nkarthik1 ~]# echo $NHDFS_LIBHDFS_PATH
/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/lib64/libhdfs.so
[root@rhel9nkarthik1 ~]#
[root@rhel9nkarthik1 ~]# export CLASSPATH="$(hadoop classpath --glob)"

Before actual migration, validate them manually to make sure the setup is ready

Create Sample Data (to Validate End‑to‑End)

[root@rhel9nkarthik1 ~]# sudo -u hdfs hdfs dfs -mkdir -p /data/hot_src/2025/15/11
[root@rhel9nkarthik1 ~]# sudo -u hdfs hdfs dfs -mkdir -p /data/cold_src/2025/08/01
[root@rhel9nkarthik1 ~]# for i in $(seq 1 10); do
  echo "hot $i $(date)" | sudo -u hdfs hdfs dfs -put - /data/hot_src/2025/15/11/hotfile_$i.txt
done
[root@rhel9nkarthik1 ~]# for i in $(seq 1 10); do
  echo "cold $i $(date)" | sudo -u hdfs hdfs dfs -put - /data/cold_src/2025/08/01/coldfile_$i.txt
done
[root@rhel9nkarthik1 ~]#

Verify accessibility of the NetApp NFS share by executing the Showmount command.

[root@rhel9nkarthik1 ~]# showmount -e 10.63.150.161
Export list for 10.63.150.161:
/nfs_hot                                              (everyone)
/openshifticeberg_data_perf_data_pg_metastore_0_d1747 (everyone)
/openshifticeberg_default_hive_mysql_pvc_f79c9        (everyone)
/openshifticeberg_trident_test_pvc_01543              (everyone)
/rps_test_karthik                                     (everyone)
/test_ai_data_plateform_logs_kafka_controller_1_5bde3 (everyone)
/trident_pvc_29acb31d_08ea_42b5_9a05_af361c748d8b     (everyone)
/trident_pvc_50a48ff8_e65c_4e34_ab95_5c5d6e1ba9c8     (everyone)
/trident_pvc_6b1116aa_331f_4d5e_bb2d_35a2031da8b6     (everyone)
/trident_pvc_768ea30b_5719_49e9_a279_ba6e5b87841f     (everyone)
/trident_pvc_e15a339e_4d8e_4c42_a722_25c005250cb0     (everyone)
/trident_pvc_f9853c93_18aa_4f38_a269_992880d29cfb     (everyone)
/                                                     (everyone)
/xcp_catalog                                          (everyone)
[root@rhel9nkarthik1 ~]#
[root@rhel9nkarthik1 ~]# mkdir /xcp_catalog
[root@rhel9nkarthik1 ~]# mount 10.63.150.161:/xcp_catalog /xcp_catalog
[root@rhel9nkarthik1 ~]# touch /xcp_catalog/test
[root@rhel9nkarthik1 ~]# ls -ltrah /xcp_catalog/
total 8.0K
drwxrwxrwx.  2 nobody nobody 4.0K Dec 15 17:25 .
-rw-r--r--.  1 nobody nobody    0 Dec 15 17:25 test
dr-xr-xr-x. 26 root   root   4.0K Dec 15 17:27 ..
[root@rhel9nkarthik1 ~]#

Verify that the xcp catalog is set correctly in /opt/NetApp/xFiles/xcp/xcp.ini.

[root@rhel9nkarthik1 ~]# cat /opt/NetApp/xFiles/xcp/xcp.ini
# Sample xcp config
[xcp]
catalog = 10.63.150.161:/xcp_catalog
[root@rhel9nkarthik1 ~]#

Please perform the xcp copy operation and verify the data transfer from HDFS to NFS.

XCP copy between HDFS and NetApp NFS volume:

[root@rhel9nkarthik1 ~]# /usr/src/xcp/linux/xcp copy -newid hot_test hdfs:///hot 10.63.150.161:/nfs_hot
XCP 1.9.4P1; (c) 2025 NetApp, Inc.; Licensed to Karthikeyan Nagalingam [NetApp Inc] until Tue Dec 15 18:48:05 2026

Job ID: Job_hot_test_2025-12-15_19.16.40.750569_copy
25/12/15 19:16:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 1 scanned, 0 in (0/s), 0 out (0/s), 5s
25/12/15 19:16:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/15 19:16:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 10 scanned, 2 copied, 576 in (114/s), 428 out (84.4/s), 10s
25/12/15 19:16:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/15 19:16:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 12 scanned, 4 copied, 1.12 KiB in (113/s), 844 out (81.6/s), 16s
25/12/15 19:16:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/15 19:16:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 13 scanned, 5 copied, 1.41 KiB in (56.7/s), 1.03 KiB out (40.9/s), 21s
Sending statistics...
Xcp command : xcp copy -newid hot_test hdfs:///hot 10.63.150.161:/nfs_hot
Stats       : 13 scanned, 12 copied, 13 indexed
Speed       : 134 KiB in (6.20 KiB/s), 226 KiB out (10.4 KiB/s)
Total Time  : 21s.
Migration ID: hot_test
Job ID      : Job_hot_test_2025-12-15_19.16.40.750569_copy
Log Path    : /opt/NetApp/xFiles/xcp/xcplogs/Job_hot_test_2025-12-15_19.16.40.750569_copy.log
STATUS      : PASSED
[root@rhel9nkarthik1 ~]#

XCP verification between HDFS and NetApp NFS volume:

[root@rhel9nkarthik1 ~]# xcp verify hdfs:///hot 10.63.150.161:/nfs_hot
XCP 1.9.4P1; (c) 2025 NetApp, Inc.; Licensed to Karthikeyan Nagalingam [NetApp Inc] until Tue Dec 15 18:48:05 2026
xcp: WARNING: No index name has been specified, creating one with name: XCP_verify_2025-12-15_19.19.39.391272
Job ID: Job_2025-12-15_19.19.39.391272_verify
25/12/15 19:19:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/15 19:19:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 1 scanned, 1 found, 9.95 KiB in (1.97 KiB/s), 13.6 KiB out (2.68 KiB/s), 5s
25/12/15 19:19:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/15 19:19:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 10 scanned, 4 found, 10.7 KiB in (152/s), 14.0 KiB out (95.1/s), 10s
25/12/15 19:19:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/15 19:19:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 12 scanned, 6 found, 11.2 KiB in (102/s), 14.3 KiB out (61.9/s), 15s
25/12/15 19:19:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/15 19:19:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 13 scanned, 13 found, 7 compared, 7 same data, 263 KiB in (49.3 KiB/s), 16.6 KiB out (447/s), 20s
Sending statistics...

Xcp command : xcp verify hdfs:///hot 10.63.150.161:/nfs_hot
Stats       : 13 scanned, 13 indexed, 100% found (7 have data), 7 compared, 100% verified (data, attrs, mods)
Speed       : 265 KiB in (12.5 KiB/s), 110 KiB out (5.20 KiB/s)
Total Time  : 21s.
Job ID      : Job_2025-12-15_19.19.39.391272_verify
Log Path    : /opt/NetApp/xFiles/xcp/xcplogs/Job_2025-12-15_19.19.39.391272_verify.log
STATUS      : PASSED
[root@rhel9nkarthik1 ~]#

To migrate XCP data from HDFS to S3, professional support from NetApp is required.

[root@rhel9nkarthik1 ~]#
xcp copy -newid cold_test \
  -s3.profile default -s3.endpoint http://10.63.150.28 \
  hdfs:///cold s3://analytics-cold
XCP 1.9.4P1; (c) 2025 NetApp, Inc.; Licensed to Karthikeyan Nagalingam [NetApp Inc] until Tue Dec 15 18:48:05 2026

Job ID: Job_cold_test_2025-12-15_19.22.49.291514_copy
25/12/15 19:22:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ERROR: S3Connect 's3://analytics-cold': To use S3 feature please contact Netapp PSE
Sending statistics...
Error sending statistics
Xcp command : xcp copy -newid cold_test -s3.profile default -s3.endpoint http://10.63.150.28 hdfs:///cold s3://analytics-cold
1 error
Migration ID: cold_test
Job ID      : Job_cold_test_2025-12-15_19.22.49.291514_copy
Log Path    : /opt/NetApp/xFiles/xcp/xcplogs/Job_cold_test_2025-12-15_19.22.49.291514_copy.log
STATUS      : FAILED
[root@rhel9nkarthik1 ~]#

The Sample Script: Classify by Age + XCP Migrations

What it does

The system conducts a recursive scan of the /data directory, categorizing files based on their modification time (mtime) thresholds. Data age is classified according to established requirements: files modified within the past 30 days are identified as hot, those aged between 30 and 90 days are considered warm, and files older than 90 days are designated as cold. In the provided sample script, only hot and cold data classifications are addressed.
Recent files are transferred to /hot, while older files are moved to /cold.
XCP copy is executed, transferring data from /hot to NetApp NFS and from /cold to NetApp S3.
XCP verify is subsequently run on both destinations.

#!/usr/bin/env bash
# tiering_xcp.sh — Classify HDFS files by age, move to /hot or /cold,
# then migrate /hot -> NetApp NFS and /cold -> S3 using NetApp XCP.
# Docs:
# - XCP HDFS migration: (copy/verify/resume)
# - XCP S3 connector: (profile/endpoint; no sync)
# - XCP NFS migrate: (copy/sync/verify)
# - XCP best practices: (run as root; catalog; performance)

set -euo pipefail
SOURCE_ROOT="/data"
HOT_DAYS=30
COLD_DAYS=90
HOT_DIR="/hot"
COLD_DIR="/cold"
WARM_DIR=""              # optional mid-tier
DRY_RUN="false"
NFS_TARGET="10.63.150.161:/nfs_hot"            # e.g., netapp-vserver:/nfs_hot
S3_BUCKET="s3://analytics-cold"             # e.g., s3://analytics-cold
S3_PROFILE="default"            # e.g., prodprofile
S3_ENDPOINT="http://10.63.150.28"           # e.g., https://s3.amazonaws.com
CATALOG_PATH="file:///xcp_catalog"
HOT_COPY_ID="hot_$(date +%Y%m%d%H%M%S)"
COLD_COPY_ID="cold_$(date +%Y%m%d%H%M%S)"
LOG_FILE="${LOG_FILE:-/var/log/tiering_xcp-$(date +%Y%m%d%H%M%S).log}"

# --- parse args ---
    while [[ $# -gt 0 ]]; do
      case "$1" in
         --source)       SOURCE_ROOT="$2"; shift 2 ;;
         --hot-days)     HOT_DAYS="$2"; shift 2 ;;
         --cold-days)    COLD_DAYS="$2"; shift 2 ;;
         --hot-dir)      HOT_DIR="$2"; shift 2 ;;
         --cold-dir)     COLD_DIR="$2"; shift 2 ;;
         --warm-dir)     WARM_DIR="$2"; shift 2 ;;
         --dry-run)      DRY_RUN="true"; shift 1 ;;
         --nfs-target)   NFS_TARGET="$2"; shift 2 ;;
         --s3-bucket)    S3_BUCKET="$2"; shift 2 ;;
         --s3-profile)   S3_PROFILE="$2"; shift 2 ;;
         --s3-endpoint)  S3_ENDPOINT="$2"; shift 2 ;;
         --catalog)      CATALOG_PATH="$2"; shift 2 ;;
         --hot-copy-id)  HOT_COPY_ID="$2"; shift 2 ;;
         --cold-copy-id) COLD_COPY_ID="$2"; shift 2 ;;
         --log-file)     LOG_FILE="$2"; shift 2 ;;
         *) echo "Unknown option: $1"; exit 1 ;;
       esac
     done

     log() { printf '%s %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*" | tee -a "$LOG_FILE" ; }

     # --- validate ---
     [[ -z "$NFS_TARGET" ]] && { echo "ERROR: --nfs-target is required"; exit 2; }
     [[ -z "$S3_BUCKET" ]] && { echo "ERROR: --s3-bucket is required"; exit 2; }
     [[ -z "$S3_PROFILE" ]] && { echo "ERROR: --s3-profile is required"; exit 2; }
     [[ -z "$S3_ENDPOINT" ]] && { echo "ERROR: --s3-endpoint is required"; exit 2; }

     if ! sudo -u hdfs hdfs dfs -test -d "$SOURCE_ROOT"; then
       log "ERROR: HDFS source root does not exist: $SOURCE_ROOT"; exit 3;
     fi

     ensure_dir() {
       local dir="$1"
       if [[ -n "$dir" ]]; then
         if ! sudo -u hdfs hdfs dfs -test -d "$dir"; then
           log "Creating HDFS dir: $dir"
           [[ "$DRY_RUN" == "true" ]] || sudo -u hdfs hdfs dfs -mkdir -p "$dir"
         fi
       fi
     }

     ensure_dir "$HOT_DIR"
     ensure_dir "$COLD_DIR"
     [[ -n "$WARM_DIR" ]] && ensure_dir "$WARM_DIR"

     # --- classify by mtime + move ---
     log "Classifying by mtime: SOURCE_ROOT=$SOURCE_ROOT HOT_DAYS=$HOT_DAYS COLD_DAYS=$COLD_DAYS DRY_RUN=$DRY_RUN"
     now_epoch=$(date +%s)
     moved_hot=0; moved_cold=0; moved_warm=0

     sudo -u hdfs hdfs dfs -ls -R "$SOURCE_ROOT" 2>/dev/null | awk '$1 ~ /^-/' | while read -r perm repl owner group size date time path; do
       mod_epoch=$(date -d "$date $time" +%s 2>/dev/null || echo 0)
       [[ "$mod_epoch" == "0" ]] && continue
       age_days=$(( (now_epoch - mod_epoch) / 86400 ))

       target=""
       if (( age_days <= HOT_DAYS )); then
         target="$HOT_DIR"
       elif (( age_days >= COLD_DAYS )); then
         target="$COLD_DIR"
       else
         [[ -n "$WARM_DIR" ]] && target="$WARM_DIR" || target=""
       fi

       [[ -z "$target" ]] && continue

       rel="${path#$SOURCE_ROOT/}"
       tgt_dir="$target/$(dirname "$rel")"

       if [[ "$DRY_RUN" == "true" ]]; then
         log "[DRY] age=${age_days}d  $path  ->  $target/$rel"
       else
         sudo -u hdfs hdfs dfs -mkdir -p "$tgt_dir"
         sudo -u hdfs hdfs dfs -mv "$path" "$target/$rel"
         log "Moved age=${age_days}d $path -> $target/$rel"
       fi

       if [[ "$target" == "$HOT_DIR" ]]; then ((moved_hot++)); fi
       if [[ "$target" == "$COLD_DIR" ]]; then ((moved_cold++)); fi
       if [[ "$target" == "$WARM_DIR" ]]; then ((moved_warm++)); fi
     done

     log "Classification summary: moved_hot=$moved_hot moved_cold=$moved_cold moved_warm=$moved_warm"

     # --- XCP migrations ---
     export XCP_CATALOG="$CATALOG_PATH"
     HOT_CMD=( xcp copy -newid "$HOT_COPY_ID" "hdfs://$HOT_DIR" "$NFS_TARGET" )
     COLD_CMD=( xcp copy -newid "$COLD_COPY_ID" -s3.profile "$S3_PROFILE" -s3.endpoint "$S3_ENDPOINT" "hdfs://$COLD_DIR" "$S3_BUCKET" )

     if [[ "$DRY_RUN" == "true" ]]; then
       log "[DRY] ${HOT_CMD[*]}"
       log "[DRY] ${COLD_CMD[*]}"
     else
       log "Running HOT baseline: ${HOT_CMD[*]}"
       "${HOT_CMD[@]}"
       log "HOT copy completed. ID=$HOT_COPY_ID"
       log "Running COLD baseline: ${COLD_CMD[*]}"
       "${COLD_CMD[@]}"
       log "COLD copy completed. ID=$COLD_COPY_ID"
     fi

     # --- Verify both targets ---
     if [[ "$DRY_RUN" == "true" ]]; then
       log "[DRY] xcp verify hdfs://$HOT_DIR $NFS_TARGET"
       log "[DRY] xcp verify -s3.profile $S3_PROFILE -s3.endpoint $S3_ENDPOINT hdfs://$COLD_DIR $S3_BUCKET"
     else
       xcp verify "hdfs://$HOT_DIR" "$NFS_TARGET"
       xcp verify -s3.profile "$S3_PROFILE" -s3.endpoint "$S3_ENDPOINT" "hdfs://$COLD_DIR" "$S3_BUCKET"
       log "Verify completed for HOT & COLD"
     fi


     # --- Optional: incremental sync for NFS ---
     if [[ "$DRY_RUN" == "true" ]]; then
       log "[DRY] xcp sync -id $HOT_COPY_ID"
     fi

     log "Tiering with XCP completed."

Run the Script (Baseline + Verify)

Dry Run: This step confirms the integrity of the copy process prior to execution.
Production Execution: Conduct testing within the production environment.
Validation:
- For NFS: Review the content and perform verification using `xcp verify hdfs:///hot netapp-vserver_data_lif:/nfs_hot`.
- For S3: Validate with `xcp verify -s3.profile prodprofile -s3.endpoint https://<netapp_s3_endpoint> hdfs:///cold s3://analytics-cold`.

Automate with Oozie (Workflow + Coordinator)

This section helps automate migration through a high-level sample workflow, and adjustments may be required to suit your environment. The Oozie service has been installed as an integrated component of the Cloudera platform. Make sure the action is executed on a host where XCP is installed and available in the system PATH; many teams choose to run this action on a gateway node.

workflow.xml

     <workflow-app name="hdfs-age-tiering-with-xcp" xmlns="uri:oozie:workflow:0.5">
    
         <parameters>
             <parameter name="appPath"/>
             <parameter name="nameNode"/>
             <parameter name="resourceManager"/>
             <parameter name="sourceRoot" default="/data"/>
             <parameter name="hotDays"     default="30"/>
             <parameter name="coldDays"    default="90"/>
             <parameter name="hotDir"      default="/hot"/>
             <parameter name="coldDir"     default="/cold"/>
             <parameter name="warmDir"     default=""/>

             <parameter name="xcpCatalog"  default="file:///xcp_catalog"/>
             <parameter name="nfsTarget"/>
             <parameter name="s3Bucket"/>
             <parameter name="s3Profile"/>
             <parameter name="s3Endpoint"/>


             <parameter name="hotCopyId"   default="hot_${nominalTime}"/>
             <parameter name="coldCopyId"  default="cold_${nominalTime}"/>
         </parameters>


         <start to="classify-and-migrate"/>

         <!-- Single shell action: runs the script (classification + XCP copy + verify + NFS sync) -->
         <action name="classify-and-migrate">
             <shell xmlns="uri:oozie:shell-action:0.2">
                 <resource-manager>${resourceManager}</resource-manager>
                 <name-node>${nameNode}</name-node>
                 <configuration>
                     <property>
                       <name>mapreduce.job.user.classpath.first</name>
                       <value>true</value>
                     </property>
                 </configuration>
                 <exec>tiering_xcp.sh</exec>
                 <argument>--source</argument><argument>${sourceRoot}</argument>
                 <argument>--hot-days</argument><argument>${hotDays}</argument>
                 <argument>--cold-days</argument><argument>${coldDays}</argument>
                 <argument>--hot-dir</argument><argument>${hotDir}</argument>
                 <argument>--cold-dir</argument><argument>${coldDir}</argument>
                 <argument>--warm-dir</argument><argument>${warmDir}</argument>
                 <argument>--nfs-target</argument><argument>${nfsTarget}</argument>
                 <argument>--s3-bucket</argument><argument>${s3Bucket}</argument>
                 <argument>--s3-profile</argument><argument>${s3Profile}</argument>
                 <argument>--s3-endpoint</argument><argument>${s3Endpoint}</argument>
                 <argument>--catalog</argument><argument>${xcpCatalog}</argument>
                 <argument>--hot-copy-id</argument><argument>${hotCopyId}</argument>
                 <argument>--cold-copy-id</argument><argument>${coldCopyId}</argument>
                 <file>${appPath}/bin/tiering_xcp.sh#tiering_xcp.sh</file>
             </shell>
             <ok to="end"/>
             <error to="fail"/>
         </action>

         <kill name="fail">
             <message>Tiering workflow failed. See Oozie logs and XCP output.</message>
         </kill>
         <end name="end"/>

     </workflow-app>

job.properties

     nameNode=hdfs://rhel9nkarthik1.sddc.netapp.com:8020
     resourceManager=rhel9nkarthik1.sddc.netapp.com:8032
    
     appPath=${nameNode}/apps/oozie/tiering-xcp
     sourceRoot=/data
     hotDays=30
     coldDays=90
     hotDir=/hot
     coldDir=/cold
     warmDir=

     xcpCatalog=file:///xcp_catalog
     nfsTarget=netapp-vserver:/nfs_hot
     s3Bucket=s3://analytics-cold
     s3Profile=prodprofile
     s3Endpoint=https://s3.amazonaws.com
     hotCopyId=hot_${nominalTime}
     coldCopyId=cold_${nominalTime}
     oozie.wf.application.path=${appPath}/workflow.xml

Deploy & run

     hdfs dfs -mkdir -p /apps/oozie/tiering-xcp/bin
     hdfs dfs -put /opt/tiering/bin/tiering_xcp.sh /apps/oozie/tiering-xcp/bin/tiering_xcp.sh
     hdfs dfs -put workflow.xml               /apps/oozie/tiering-xcp/workflow.xml
     hdfs dfs -put job.properties             /apps/oozie/tiering-xcp/job.properties

     oozie job -oozie http://<oozie-host>:11000/oozie \

       -config /apps/oozie/tiering-xcp/job.properties -run

Optional: Daily schedule with coordinator

     <!-- coordinator.xml -->
     <coordinator-app name="hdfs-age-tiering-xcp-coord" frequency="${coord:days(1)}"
                      start="${startTime}" end="${endTime}" timezone="UTC"
                      xmlns="uri:oozie:coordinator:0.5">
         <action>
             <workflow>
                 <app-path>${appPath}/workflow.xml</app-path>
                 <configuration>
                     <property><name>nameNode</name><value>${nameNode}</value></property>
                     <property><name>resourceManager</name><value>${resourceManager}</value></property>
                     <property><name>appPath</name><value>${appPath}</value></property>
                     <property><name>sourceRoot</name><value>${sourceRoot}</value></property>
                     <property><name>hotDays</name><value>${hotDays}</value></property>
                     <property><name>coldDays</name><value>${coldDays}</value></property>
                     <property><name>hotDir</name><value>${hotDir}</value></property>
                     <property><name>coldDir</name><value>${coldDir}</value></property>
                     <property><name>warmDir</name><value>${warmDir}</value></property>
                     <property><name>xcpCatalog</name><value>${xcpCatalog}</value></property>
                     <property><name>nfsTarget</name><value>${nfsTarget}</value></property>
                     <property><name>s3Bucket</name><value>${s3Bucket}</value></property>
                     <property><name>s3Profile</name><value>${s3Profile}</value></property>
                     <property><name>s3Endpoint</name><value>${s3Endpoint}</value></property>
                     <property><name>hotCopyId</name><value>hot_${nominalTime}</value></property>
                     <property><name>coldCopyId</name><value>cold_${nominalTime}</value></property>
                 </configuration>
             </workflow>
         </action>
     </coordinator-app>

     # coordinator.properties
     nameNode=hdfs://rhel9nkarthik1.sddc.netapp.com:8020
     resourceManager=rhel9nkarthik1.sddc.netapp.com:8032
     appPath=${nameNode}/apps/oozie/tiering-xcp
    
     sourceRoot=/data
     hotDays=30
     coldDays=90
     hotDir=/hot
     coldDir=/cold
     warmDir=    
     xcpCatalog=file:///xcp_catalog
     nfsTarget=netapp-vserver:/nfs_hot
     s3Bucket=s3://analytics-cold
     s3Profile=prodprofile
     s3Endpoint=https://s3.amazonaws.com
    
     startTime=2025-12-12T02:00Z
     endTime=2026-12-12T02:00Z
     oozie.coord.application.path=${appPath}/coordinator.xml

Upload & start

     hdfs dfs -put coordinator.xml /apps/oozie/tiering-xcp/coordinator.xml
     oozie job -oozie http://<oozie-host>:11000/oozie \
       -config /apps/oozie/tiering-xcp/coordinator.properties -run

Operational Tips

For each host, run XCP as the root user, ensuring a single instance per host and assigning a unique -newid for every execution; maintain clean catalogs to facilitate resuming operations.
When incrementally transferring from HDFS to NFS or S3, remember that the XCP HDFS to NFS or S3 connector does not support synchronization—either repeat the copy process or implement object lifecycle/versioning measures as suitable.
To optimise large-scale performance, divide your workload into smaller segments, such as monthly partitions, and set up multiple jobs; to the XCP host if handling more than 10 million files or experiencing high rates of change. Refer to the XCP best practice guidelines for information on XCP sizing, tuning, and additional details here: https://docs.netapp.com/us-en/netapp-solutions-dataops/xcp/xcp-bp-introduction.html
For downstream use, reference s3a://bucket/path in Spark or Hive, and activate S3A committers to ensure write safety.

Data Tiering Workflow Conclusion

Hot datasets are stored on NetApp NFS storage, providing low-latency access for analytics workloads that require rapid data processing and retrieval. This placement ensures that frequently accessed data can be analyzed efficiently without delays.

Cold datasets, which are accessed less frequently, are moved to NetApp S3. Storing these datasets in S3 offers significant cost savings, as S3 is optimized for efficient storage of large volumes of data that do not require immediate access.

After each data transfer operation, xcp verify is run to validate the integrity of the datasets. This verification step ensures that the data has been accurately and completely moved, maintaining trust in the workflow.

The entire workflow is designed to be auditable, repeatable, and schedulable using Oozie. This means that every step of the process can be tracked and reviewed, jobs can be re-executed as needed, and scheduling functionality is available for automation and operational consistency.

By default, HDFS operates on Direct Attached Storage (DAS) with a replication factor of three to ensure high availability and data protection. After migrating data to NetApp NFS and NetApp Object Storage, only a single copy of the data is required, as NetApp enterprise storage offers high availability and data protection at the storage layer. With NetApp's erasure coding and additional storage management features, some extra space is necessary for data management. Overall, customers can achieve 40–60% capacity savings when storing their data.

Contact and Next Steps

If you have any questions regarding the data tiering workflow or the operational tips provided, please do not hesitate to reach out to our team. We are available to address your concerns and provide additional clarification as needed.

Additionally, if you have proof of concepts (POCs) that you would like to discuss or move forward with, we encourage you to contact us so we can assist you in advancing to the next step in the process.