Automated-Failover in NetApp Trident: Fast, Safe Workload Migration After Node Failure

jharrod · ‎2026-01-09

Introduction

Kubernetes is built for resilience, but node failures can still disrupt stateful applications—especially when persistent storage is involved. Traditionally, moving workloads off a failed node required manual intervention and careful coordination to avoid data corruption.

NetApp Trident’s new Automated-Failover feature changes the game.
Now, when a node fails, Trident automatically and safely moves your workloads to a healthy node. This is achieved by:

Automatically removing pods and volume attachments from the failed node
Updating ONTAP volume export policies to block the failed node’s access
Allowing the volume (and pod) to be safely mounted to a healthy node
Graceful reintroduction of failed nodes after they recover

This process minimizes downtime and eliminates the risk of data corruption or multi-attach issues—delivering true high availability for your stateful Kubernetes workloads.

What is Automated-Failover?

Automated-Failover is an enhancement to Trident’s Force-Detach capability, designed to orchestrate rapid, safe failover of workloads from failed nodes to healthy ones.

Key Benefits:

Rapid failover: Workloads are quickly rescheduled to healthy nodes.
Data safety: ONTAP export policies are updated to block failed nodes, preventing split-brain and data corruption.
Automation: No manual intervention required—failover is triggered and managed automatically.
Granular control: Customize which pods are removed using annotations.

How Automated-Failover Works

Workflow Overview

Automated-Failover integrates with the open source operator Node-Healthcheck-Operator (NHC). When a node fails, NHC detects the failure and creates a TridentNodeRemediation (TNR) custom resource in Trident’s namespace. Trident then orchestrates the following sequence:

Mark the failed node as “dirty” (preventing new volume publications).
Remove eligible pods and their volume attachments from the failed node.
Update ONTAP export policies to block the failed node’s access to the affected volumes.
Kubernetes reschedules the pods to a healthy node, where Trident safely attaches the volumes.
Graceful node reintroduction when the failed node comes back online. Trident will verify all expected volumes are unmounted and detached from the node before allowing new attachemnts.

This ensures that failed nodes cannot write to volumes, eliminating the risk of data corruption and enabling safe, fast failover.

Diagram:

Flowchart:

Selective Pod Removal and Volume Handling

Automated-Failover is selective—it only removes pods that can be safely failed over:

Pod is removed if:
All their volumes/PVCs are supported by force-detach (see Force-Detach Documentation).
- Supported volumes:
  - NAS and NAS-economy volumes using auto-export policies (excluding SMB)
  - SAN and SAN-economy volumes
Pod is NOT removed if:
- Pod has any volume not supported by force-detach
- Stateless pods (no PVCs)

Volume Access Control

A critical step in the failover process is updating ONTAP export policies (NAS), igroups (SAN), and subsystems (NVMe).
When Trident removes a pod and its volume attachment from a failed node, it also modifies ONTAPs access control to ensure the failed node has no processes that can write to the volume.
This ensures:

The failed node cannot write to the volume, even if it comes back online unexpectedly.
The volume can be safely attached and written to by a new pod on a healthy node.
Risks of split-brain and data corruption are eliminated.

Customizing Failover Behavior

You can control which pods are removed during failover using the trident.netapp.io/podRemediationPolicy annotation:

retain: Pod will not be removed from the failed node.
delete: Pod will be removed from the failed node.

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        trident.netapp.io/podRemediationPolicy: "retain"

Note:
While you can use these annotations on any pod, only volumes that support force-detach will have ONTAP export policies updated to block the failed node’s access. For pods with unsupported volumes, Trident cannot guarantee safe failover or prevent potential data corruption—see the Force-Detach Documentation for details.

This annotation is especially useful for stateless pods (those with no persistent storage), allowing you to control their removal during failover events.

Installation and Setup

1. Enable Force-Detach in Trident

tridentctl installs:
Use the --enable-force-detach flag during installation.

Helm/Operator installs:
Enable enableForceDetach in the TridentOrchestrator spec.

2. Install Node-Healthcheck-Operator (NHC)

Prerequisites:
- Install operator-sdk
- Install OLM:
  operator-sdk olm install
Install NHC:
kubectl create -f https://operatorhub.io/install/node-healthcheck-operator.yaml

3. Configure NodeHealthCheck CR

Here’s a recommended template:

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: <CR name, e.g. nhc-worker>
spec:
  selector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist
      - key: node-role.kubernetes.io/master
        operator: DoesNotExist
  remediationTemplate:
    apiVersion: trident.netapp.io/v1
    kind: TridentNodeRemediationTemplate
    namespace: <Trident installation namespace>
    name: trident-node-remediation-template
  minHealthy: 0 # Trigger force-detach upon one or more node failures
  unhealthyConditions:
    - type: Ready
      status: "False"
      duration: 0s
    - type: Ready
      status: Unknown
      duration: 0s

Note: This configuration triggers failover immediately when a worker node is marked Ready: false or Unknown.

Operational Considerations

Maintenance and Upgrades

Pause automated-failover during planned maintenance to avoid unnecessary remediation:

kubectl patch NodeHealthCheck <cr-name> --patch '{"spec":{"pauseRequests":["maintenance"]}}' --type=merge

Remove pauseRequests from the spec after maintenance to resume automated-failover.

Limitations

I/O is only blocked on failed nodes for force-detach supported volumes.
Pods with unsupported volumes are not automatically removed.
Delayed failover if the node hosting the trident-controller fails. Trident plans to address this in an upcoming release.

Integrating Custom Node Health Check Solutions

Node-Healthcheck-Operator can be replaced with alternative node failure detection tools if desired.
To ensure compatibility with the automated failover mechanism, your custom solution should:

Create a TNR when a node failure is detected, using the failed node’s name as the TNR CR name.
Delete the TNR when the node has recovered and the TNR is in the Succeeded state.

Conclusion

Automated-Failover in NetApp Trident delivers fast, safe, and automated migration of stateful workloads from failed nodes to healthy ones. By combining intelligent pod and volume management with ONTAP export policy updates, Trident ensures your data remains safe and your applications highly available—even in the face of node failures.

Ready to experience seamless failover?
Enable Automated-Failover in your Trident deployment today and keep your Kubernetes workloads running smoothly.

Learn more: Automated-Failover Documentation