Discover how NetApp’s AI Data Guardrails turn governance into a living system—enabling secure, compliant, and scalable AI platforms. From risk managem ...read more
By Mohammad Hossein Hajkazemi, Bhushan Jain, and Arpan Chowdhry
Introduction
Google Cloud NetApp Volumes is a fully managed, cloud-native storage s ...read more
NetApp Console delivers HIPAA (Health Insurance Portability and Accountability Act)- compliant data intelligence without storing ePHI
NetApp Console n ...read more
NetApp Console delivers simplicity with Console agent
NetApp® Console agent is the secure and trusted software from NetApp that enables the workflows ...read more
Introduction
Kubernetes is built for resilience, but node failures can still disrupt stateful applications—especially when persistent storage is involved. Traditionally, moving workloads off a failed node required manual intervention and careful coordination to avoid data corruption.
NetApp Trident’s new Automated-Failover feature changes the game. Now, when a node fails, Trident automatically and safely moves your workloads to a healthy node. This is achieved by:
Automatically removing pods and volume attachments from the failed node
Updating ONTAP volume export policies to block the failed node’s access
Allowing the volume (and pod) to be safely mounted to a healthy node
Graceful reintroduction of failed nodes after they recover
This process minimizes downtime and eliminates the risk of data corruption or multi-attach issues—delivering true high availability for your stateful Kubernetes workloads.
What is Automated-Failover?
Automated-Failover is an enhancement to Trident’s Force-Detach capability, designed to orchestrate rapid, safe failover of workloads from failed nodes to healthy ones.
Key Benefits:
Rapid failover: Workloads are quickly rescheduled to healthy nodes.
Data safety: ONTAP export policies are updated to block failed nodes, preventing split-brain and data corruption.
Automation: No manual intervention required—failover is triggered and managed automatically.
Granular control: Customize which pods are removed using annotations.
How Automated-Failover Works
Workflow Overview
Automated-Failover integrates with the open source operator Node-Healthcheck-Operator (NHC). When a node fails, NHC detects the failure and creates a TridentNodeRemediation (TNR) custom resource in Trident’s namespace. Trident then orchestrates the following sequence:
Mark the failed node as “dirty” (preventing new volume publications).
Remove eligible pods and their volume attachments from the failed node.
Update ONTAP export policies to block the failed node’s access to the affected volumes.
Kubernetes reschedules the pods to a healthy node, where Trident safely attaches the volumes.
Graceful node reintroduction when the failed node comes back online. Trident will verify all expected volumes are unmounted and detached from the node before allowing new attachemnts.
This ensures that failed nodes cannot write to volumes, eliminating the risk of data corruption and enabling safe, fast failover.
Diagram:
Flowchart:
Selective Pod Removal and Volume Handling
Automated-Failover is selective—it only removes pods that can be safely failed over:
Pod is removed if: All their volumes/PVCs are supported by force-detach (see Force-Detach Documentation).
Supported volumes:
NAS and NAS-economy volumes using auto-export policies (excluding SMB)
SAN and SAN-economy volumes
Pod is NOT removed if:
Pod has any volume not supported by force-detach
Stateless pods (no PVCs)
Volume Access Control
A critical step in the failover process is updating ONTAP export policies (NAS), igroups (SAN), and subsystems (NVMe). When Trident removes a pod and its volume attachment from a failed node, it also modifies ONTAPs access control to ensure the failed node has no processes that can write to the volume. This ensures:
The failed node cannot write to the volume, even if it comes back online unexpectedly.
The volume can be safely attached and written to by a new pod on a healthy node.
Risks of split-brain and data corruption are eliminated.
Customizing Failover Behavior
You can control which pods are removed during failover using the trident.netapp.io/podRemediationPolicy annotation:
retain : Pod will not be removed from the failed node.
delete : Pod will be removed from the failed node.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
annotations:
trident.netapp.io/podRemediationPolicy: "retain"
Note: While you can use these annotations on any pod, only volumes that support force-detach will have ONTAP export policies updated to block the failed node’s access. For pods with unsupported volumes, Trident cannot guarantee safe failover or prevent potential data corruption—see the Force-Detach Documentation for details.
This annotation is especially useful for stateless pods (those with no persistent storage), allowing you to control their removal during failover events.
Installation and Setup
1. Enable Force-Detach in Trident
tridentctl installs: Use the --enable-force-detach flag during installation.
Helm/Operator installs: Enable enableForceDetach in the TridentOrchestrator spec.
2. Install Node-Healthcheck-Operator (NHC)
Prerequisites:
Install operator-sdk
Install OLM:
operator-sdk olm install
Install NHC: kubectl create -f https://operatorhub.io/install/node-healthcheck-operator.yaml
3. Configure NodeHealthCheck CR
Here’s a recommended template:
apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
name: <CR name, e.g. nhc-worker>
spec:
selector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
- key: node-role.kubernetes.io/master
operator: DoesNotExist
remediationTemplate:
apiVersion: trident.netapp.io/v1
kind: TridentNodeRemediationTemplate
namespace: <Trident installation namespace>
name: trident-node-remediation-template
minHealthy: 0 # Trigger force-detach upon one or more node failures
unhealthyConditions:
- type: Ready
status: "False"
duration: 0s
- type: Ready
status: Unknown
duration: 0s
Note: This configuration triggers failover immediately when a worker node is marked Ready: false or Unknown .
Operational Considerations
Maintenance and Upgrades
Pause automated-failover during planned maintenance to avoid unnecessary remediation:
kubectl patch NodeHealthCheck <cr-name> --patch '{"spec":{"pauseRequests":["maintenance"]}}' --type=merge
Remove pauseRequests from the spec after maintenance to resume automated-failover.
Limitations
I/O is only blocked on failed nodes for force-detach supported volumes.
Pods with unsupported volumes are not automatically removed.
Delayed failover if the node hosting the trident-controller fails. Trident plans to address this in an upcoming release.
Integrating Custom Node Health Check Solutions
Node-Healthcheck-Operator can be replaced with alternative node failure detection tools if desired. To ensure compatibility with the automated failover mechanism, your custom solution should:
Create a TNR when a node failure is detected, using the failed node’s name as the TNR CR name.
Delete the TNR when the node has recovered and the TNR is in the Succeeded state.
Conclusion
Automated-Failover in NetApp Trident delivers fast, safe, and automated migration of stateful workloads from failed nodes to healthy ones. By combining intelligent pod and volume management with ONTAP export policy updates, Trident ensures your data remains safe and your applications highly available—even in the face of node failures.
Ready to experience seamless failover? Enable Automated-Failover in your Trident deployment today and keep your Kubernetes workloads running smoothly.
Learn more: Automated-Failover Documentation
... View more
GraphRAG has taken off fast, mostly because teams want AI systems that can explain themselves... But building and operating a full knowledge graph means managing schemas, ontologies, and graph infrastructure before you even know if the use case will pay off. This post walks through what sits in the middle; using a BM25-based retrieval mechanism. It looks at Hybrid RAG in practice and explains why treating retrieval as a first-class, controllable step matters more than throwing more embeddings at the problem. The focus is on how retrieval choices shape answers, trust, and long-term reliability without turning your stack into a research project.
... View more
If you’re looking to migrate or to analyze your data and free up storage, then you’re in the right place
In today’s data-driven world, it’s critical for you to effectively manage and protect your data. Manual data classification is not only impractical but also prone to human error. Enter the automated NetApp® Data Classification service, which is an absolute game-changer in managing your data. Let's delve a little deeper into the great benefits that you get with Data Classification!
... View more
Looking for a New Year’s resolution you can actually keep? Look no further! NetApp has been your go-to for certified Ansible modules, and now, with the release of the latest StorageGRID Ansible collection (version 21.16.0), automating your StorageGRID environment has never been easier. 🚀
Whether you're new to automation or a seasoned pro, this is the perfect time to dive into using these powerful modules. Our detailed guide walks you through onboarding a new tenant with a single Ansible playbook, making complex tasks a breeze. From creating tenants and buckets to generating access keys, we've got you covered.
Check out the full blog post for a step-by-step breakdown and start your automation journey today!
... View more
This blog provides a comprehensive guide to implementing data tiering in Hadoop environments using NetApp XCP, NFS, and S3 storage solutions. It covers setup, migration, verification, and automation strategies to optimize storage costs and performance.
• Benefits of Hadoop data tiering: Data tiering moves frequently accessed “hot” data to high-performance storage and infrequently accessed “cold” data to cost-effective object storage, optimizing storage costs and query performance while maintaining governance.
• Role of NetApp XCP: XCP facilitates high-throughput, scalable migrations from HDFS to NetApp NFS (hot storage) and S3 (cold storage), ensuring data integrity through verification features and supporting integration with Hadoop clusters.
• Architecture and process flow: The workflow involves classifying HDFS files by modification time into /hot and /cold directories, migrating these to NetApp NFS and S3 respectively using XCP, followed by verification of data integrity.
• Prerequisites and environment setup: The Hadoop cluster must be configured in HDFS mode with appropriate directories and storage policies (/hot as HOT, /cold as COLD). NetApp NFS and S3 targets must be configured and accessible from the XCP host, which requires specific environment variables for Java and Hadoop libraries.
• Data migration and verification examples: Sample commands demonstrate copying data from HDFS /hot to NetApp NFS and verifying the transfer using XCP. Migration to S3 requires professional support and proper configuration of AWS profiles and endpoints.
• Automated tiering script: A provided bash script classifies files by age, moves them to /hot or /cold, and runs XCP copy and verify commands for NFS and S3 targets. It supports dry-run mode and configurable parameters for flexible operation.
• Oozie workflow integration: The guide includes sample Oozie workflow and coordinator XML configurations to automate the tiering process on a scheduled basis, enabling repeatable and auditable execution within Cloudera Hadoop environments.
• Operational recommendations and outcomes: Running XCP as root with unique migration IDs and clean catalogs is advised. The process yields 40–60% storage capacity savings by reducing replicated data copies on enterprise storage, while maintaining high availability and data protection through NetApp features.
... View more