Tech ONTAP Blogs

jharrod

Introduction Kubernetes is built for resilience, but node failures can still disrupt stateful applications—especially when persistent storage is involved. Traditionally, moving workloads off a failed node required manual intervention and careful coordination to avoid data corruption. NetApp Trident’s new Automated-Failover feature changes the game. Now, when a node fails, Trident automatically and safely moves your workloads to a healthy node. This is achieved by: Automatically removing pods and volume attachments from the failed node Updating ONTAP volume export policies to block the failed node’s access Allowing the volume (and pod) to be safely mounted to a healthy node Graceful reintroduction of failed nodes after they recover This process minimizes downtime and eliminates the risk of data corruption or multi-attach issues—delivering true high availability for your stateful Kubernetes workloads. What is Automated-Failover? Automated-Failover is an enhancement to Trident’s Force-Detach capability, designed to orchestrate rapid, safe failover of workloads from failed nodes to healthy ones. Key Benefits: Rapid failover: Workloads are quickly rescheduled to healthy nodes. Data safety: ONTAP export policies are updated to block failed nodes, preventing split-brain and data corruption. Automation: No manual intervention required—failover is triggered and managed automatically. Granular control: Customize which pods are removed using annotations. How Automated-Failover Works Workflow Overview Automated-Failover integrates with the open source operator Node-Healthcheck-Operator (NHC). When a node fails, NHC detects the failure and creates a TridentNodeRemediation (TNR) custom resource in Trident’s namespace. Trident then orchestrates the following sequence: Mark the failed node as “dirty” (preventing new volume publications). Remove eligible pods and their volume attachments from the failed node. Update ONTAP export policies to block the failed node’s access to the affected volumes. Kubernetes reschedules the pods to a healthy node, where Trident safely attaches the volumes. Graceful node reintroduction when the failed node comes back online. Trident will verify all expected volumes are unmounted and detached from the node before allowing new attachemnts. This ensures that failed nodes cannot write to volumes, eliminating the risk of data corruption and enabling safe, fast failover. Diagram: Flowchart: Selective Pod Removal and Volume Handling Automated-Failover is selective—it only removes pods that can be safely failed over: Pod is removed if: All their volumes/PVCs are supported by force-detach (see Force-Detach Documentation). Supported volumes: NAS and NAS-economy volumes using auto-export policies (excluding SMB) SAN and SAN-economy volumes Pod is NOT removed if: Pod has any volume not supported by force-detach Stateless pods (no PVCs) Volume Access Control A critical step in the failover process is updating ONTAP export policies (NAS), igroups (SAN), and subsystems (NVMe). When Trident removes a pod and its volume attachment from a failed node, it also modifies ONTAPs access control to ensure the failed node has no processes that can write to the volume. This ensures: The failed node cannot write to the volume, even if it comes back online unexpectedly. The volume can be safely attached and written to by a new pod on a healthy node. Risks of split-brain and data corruption are eliminated. Customizing Failover Behavior You can control which pods are removed during failover using the trident.netapp.io/podRemediationPolicy annotation: retain : Pod will not be removed from the failed node. delete : Pod will be removed from the failed node. Example: apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: template: metadata: annotations: trident.netapp.io/podRemediationPolicy: "retain" Note: While you can use these annotations on any pod, only volumes that support force-detach will have ONTAP export policies updated to block the failed node’s access. For pods with unsupported volumes, Trident cannot guarantee safe failover or prevent potential data corruption—see the Force-Detach Documentation for details. This annotation is especially useful for stateless pods (those with no persistent storage), allowing you to control their removal during failover events. Installation and Setup 1. Enable Force-Detach in Trident tridentctl installs: Use the --enable-force-detach flag during installation. Helm/Operator installs: Enable enableForceDetach in the TridentOrchestrator spec. 2. Install Node-Healthcheck-Operator (NHC) Prerequisites: Install operator-sdk Install OLM: operator-sdk olm install Install NHC: kubectl create -f https://operatorhub.io/install/node-healthcheck-operator.yaml 3. Configure NodeHealthCheck CR Here’s a recommended template: apiVersion: remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: name: <CR name, e.g. nhc-worker> spec: selector: matchExpressions: - key: node-role.kubernetes.io/control-plane operator: DoesNotExist - key: node-role.kubernetes.io/master operator: DoesNotExist remediationTemplate: apiVersion: trident.netapp.io/v1 kind: TridentNodeRemediationTemplate namespace: <Trident installation namespace> name: trident-node-remediation-template minHealthy: 0 # Trigger force-detach upon one or more node failures unhealthyConditions: - type: Ready status: "False" duration: 0s - type: Ready status: Unknown duration: 0s Note: This configuration triggers failover immediately when a worker node is marked Ready: false or Unknown . Operational Considerations Maintenance and Upgrades Pause automated-failover during planned maintenance to avoid unnecessary remediation: kubectl patch NodeHealthCheck <cr-name> --patch '{"spec":{"pauseRequests":["maintenance"]}}' --type=merge Remove pauseRequests from the spec after maintenance to resume automated-failover. Limitations I/O is only blocked on failed nodes for force-detach supported volumes. Pods with unsupported volumes are not automatically removed. Delayed failover if the node hosting the trident-controller fails. Trident plans to address this in an upcoming release. Integrating Custom Node Health Check Solutions Node-Healthcheck-Operator can be replaced with alternative node failure detection tools if desired. To ensure compatibility with the automated failover mechanism, your custom solution should: Create a TNR when a node failure is detected, using the failed node’s name as the TNR CR name. Delete the TNR when the node has recovered and the TNR is in the Succeeded state. Conclusion Automated-Failover in NetApp Trident delivers fast, safe, and automated migration of stateful workloads from failed nodes to healthy ones. By combining intelligent pod and volume management with ONTAP export policy updates, Trident ensures your data remains safe and your applications highly available—even in the face of node failures. Ready to experience seamless failover? Enable Automated-Failover in your Trident deployment today and keep your Kubernetes workloads running smoothly. Learn more: Automated-Failover Documentation

DavidvonThenen

GraphRAG has taken off fast, mostly because teams want AI systems that can explain themselves... But building and operating a full knowledge graph means managing schemas, ontologies, and graph infrastructure before you even know if the use case will pay off. This post walks through what sits in the middle; using a BM25-based retrieval mechanism. It looks at Hybrid RAG in practice and explains why treating retrieval as a first-class, controllable step matters more than throwing more embeddings at the problem. The focus is on how retrieval choices shape answers, trust, and long-term reliability without turning your stack into a research project.

Chahat

If you’re looking to migrate or to analyze your data and free up storage, then you’re in the right place In today’s data-driven world, it’s critical for you to effectively manage and protect your data. Manual data classification is not only impractical but also prone to human error. Enter the automated NetApp® Data Classification service, which is an absolute game-changer in managing your data. Let's delve a little deeper into the great benefits that you get with Data Classification!

dblackwe

Looking for a New Year’s resolution you can actually keep? Look no further! NetApp has been your go-to for certified Ansible modules, and now, with the release of the latest StorageGRID Ansible collection (version 21.16.0), automating your StorageGRID environment has never been easier. 🚀 Whether you're new to automation or a seasoned pro, this is the perfect time to dive into using these powerful modules. Our detailed guide walks you through onboarding a new tenant with a single Ansible playbook, making complex tasks a breeze. From creating tenants and buckets to generating access keys, we've got you covered. Check out the full blog post for a step-by-step breakdown and start your automation journey today!

nkarthik

This blog provides a comprehensive guide to implementing data tiering in Hadoop environments using NetApp XCP, NFS, and S3 storage solutions. It covers setup, migration, verification, and automation strategies to optimize storage costs and performance. • Benefits of Hadoop data tiering: Data tiering moves frequently accessed “hot” data to high-performance storage and infrequently accessed “cold” data to cost-effective object storage, optimizing storage costs and query performance while maintaining governance. • Role of NetApp XCP: XCP facilitates high-throughput, scalable migrations from HDFS to NetApp NFS (hot storage) and S3 (cold storage), ensuring data integrity through verification features and supporting integration with Hadoop clusters. • Architecture and process flow: The workflow involves classifying HDFS files by modification time into /hot and /cold directories, migrating these to NetApp NFS and S3 respectively using XCP, followed by verification of data integrity. • Prerequisites and environment setup: The Hadoop cluster must be configured in HDFS mode with appropriate directories and storage policies (/hot as HOT, /cold as COLD). NetApp NFS and S3 targets must be configured and accessible from the XCP host, which requires specific environment variables for Java and Hadoop libraries. • Data migration and verification examples: Sample commands demonstrate copying data from HDFS /hot to NetApp NFS and verifying the transfer using XCP. Migration to S3 requires professional support and proper configuration of AWS profiles and endpoints. • Automated tiering script: A provided bash script classifies files by age, moves them to /hot or /cold, and runs XCP copy and verify commands for NFS and S3 targets. It supports dry-run mode and configurable parameters for flexible operation. • Oozie workflow integration: The guide includes sample Oozie workflow and coordinator XML configurations to automate the tiering process on a scheduled basis, enabling repeatable and auditable execution within Cloudera Hadoop environments. • Operational recommendations and outcomes: Running XCP as root with unique migration IDs and clean catalogs is advised. The process yields 40–60% storage capacity savings by reducing replicated data copies on enterprise storage, while maintaining high availability and data protection through NetApp features.

Blog Activity

Automated-Failover in NetApp Trident: Fast, Safe Workload Migration After Node Failure

Hybrid RAG in the Real World: Graphs, BM25, and the End of Black-Box Retrieval

Elevate your data management with NetApp Data Classification service through NetApp Console

StorageGRID Automation: A Resolution You Can Keep

Hadoop Tiering to NetApp NFS (HOT) and S3 (COLD) with NetApp XCP — End‑to‑End Guide & Automation