Tech ONTAP Blogs
Tech ONTAP Blogs
Kubernetes is built for resilience, but node failures can still disrupt stateful applications—especially when persistent storage is involved. Traditionally, moving workloads off a failed node required manual intervention and careful coordination to avoid data corruption.
NetApp Trident’s new Automated-Failover feature changes the game.
Now, when a node fails, Trident automatically and safely moves your workloads to a healthy node. This is achieved by:
This process minimizes downtime and eliminates the risk of data corruption or multi-attach issues—delivering true high availability for your stateful Kubernetes workloads.
Automated-Failover is an enhancement to Trident’s Force-Detach capability, designed to orchestrate rapid, safe failover of workloads from failed nodes to healthy ones.
Key Benefits:
Automated-Failover integrates with the open source operator Node-Healthcheck-Operator (NHC). When a node fails, NHC detects the failure and creates a TridentNodeRemediation (TNR) custom resource in Trident’s namespace. Trident then orchestrates the following sequence:
This ensures that failed nodes cannot write to volumes, eliminating the risk of data corruption and enabling safe, fast failover.
Diagram:
Flowchart:
Automated-Failover is selective—it only removes pods that can be safely failed over:
A critical step in the failover process is updating ONTAP export policies (NAS), igroups (SAN), and subsystems (NVMe).
When Trident removes a pod and its volume attachment from a failed node, it also modifies ONTAPs access control to ensure the failed node has no processes that can write to the volume.
This ensures:
You can control which pods are removed during failover using the trident.netapp.io/podRemediationPolicy annotation:
retain: Pod will not be removed from the failed node.delete: Pod will be removed from the failed node.Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
annotations:
trident.netapp.io/podRemediationPolicy: "retain"
|
Note:
While you can use these annotations on any pod, only volumes that support force-detach will have ONTAP export policies updated to block the failed node’s access. For pods with unsupported volumes, Trident cannot guarantee safe failover or prevent potential data corruption—see the Force-Detach Documentation for details.This annotation is especially useful for stateless pods (those with no persistent storage), allowing you to control their removal during failover events.
tridentctl installs:
Use the --enable-force-detach flag during installation.
Helm/Operator installs:
Enable enableForceDetach in the TridentOrchestrator spec.
operator-sdk olm install
kubectl create -f https://operatorhub.io/install/node-healthcheck-operator.yamlHere’s a recommended template:
apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
name: <CR name, e.g. nhc-worker>
spec:
selector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
- key: node-role.kubernetes.io/master
operator: DoesNotExist
remediationTemplate:
apiVersion: trident.netapp.io/v1
kind: TridentNodeRemediationTemplate
namespace: <Trident installation namespace>
name: trident-node-remediation-template
minHealthy: 0 # Trigger force-detach upon one or more node failures
unhealthyConditions:
- type: Ready
status: "False"
duration: 0s
- type: Ready
status: Unknown
duration: 0s
|
Note: This configuration triggers failover immediately when a worker node is marked
Ready: falseorUnknown.
Pause automated-failover during planned maintenance to avoid unnecessary remediation:
kubectl patch NodeHealthCheck <cr-name> --patch '{"spec":{"pauseRequests":["maintenance"]}}' --type=merge
Remove pauseRequests from the spec after maintenance to resume automated-failover.
Node-Healthcheck-Operator can be replaced with alternative node failure detection tools if desired.
To ensure compatibility with the automated failover mechanism, your custom solution should:
Succeeded state.
Automated-Failover in NetApp Trident delivers fast, safe, and automated migration of stateful workloads from failed nodes to healthy ones. By combining intelligent pod and volume management with ONTAP export policy updates, Trident ensures your data remains safe and your applications highly available—even in the face of node failures.
Ready to experience seamless failover?
Enable Automated-Failover in your Trident deployment today and keep your Kubernetes workloads running smoothly.
Learn more: Automated-Failover Documentation