Tech ONTAP Blogs

NetApp Trident protect metrics and monitoring

PatricU
NetApp
68 Views

NetApp® Trident™ protect provides advanced application data management capabilities that enhance the functionality and availability of stateful Kubernetes applications supported by NetApp ONTAP storage systems and the NetApp Trident Container Storage Interface (CSI) storage provisioner. It is compatible with a wide range of fully managed and self-managed Kubernetes offerings (see the supported Kubernetes distributions and storage back ends), making it an optimal solution for protecting your Kubernetes services across various platforms and regions.
In this blog post, I will demonstrate how to scrape and visualize the metrics provided by Trident and Trident Protect using the popular open-source monitoring and visualization frameworks Prometheus and Grafana.

Prerequisites

To follow along with this guide, ensure you have the following:

  •  A Kubernetes cluster with the latest versions of Trident and Trident protect installed, and their associated kubeconfig files
  • A NetApp ONTAP storage back end and Trident with configured storage back ends, storage classes, and volume snapshot classes
  •  A configured object storage buckets for storing backups and metadata information, with bucket replication configured
  •  A workstation with kubectl configured to use kubeconfig
  •  The tridentctl-protect CLI of Trident protect installed on your workstation
  •  Admin user permission on the Kubernetes clusters

Prepare test environment

First, we quickly go through the setup of the test environment that we used throughout the blog.

Sample application

We will use a simple MinIO application with a persistent volume on Azure NetApp Files (ANF) as our sample application for the monitoring tests. The MinIO application is deployed on an Azure Kubernetes Service (AKS) cluster with NetApp Trident 25.06.0 installed and configured:

$ kubectl get all,pvc -n minio
NAME                                 READY   STATUS    RESTARTS   AGE
pod/minio-67dffb8bbd-5rfpm           1/1     Running   0          14m
pod/minio-console-677bd9ddcb-27497   1/1     Running   0          14m

NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/minio           ClusterIP   172.16.61.243   <none>        9000/TCP   14m
service/minio-console   ClusterIP   172.16.95.239   <none>        9090/TCP   14m

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/minio           1/1     1            1           14m
deployment.apps/minio-console   1/1     1            1           14m

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/minio-67dffb8bbd           1         1         1       14m
replicaset.apps/minio-console-677bd9ddcb   1         1         1       14m

NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/minio   Bound    pvc-ec50d895-4048-4a51-a651-5439b2a5ba2a   50Gi       RWO            azure-netapp-files-standard   <unset>                 14m

Create a Trident Protect Application

Create a Trident protect application minio based on the minio namespace with the Trident protect CLI:

$ tridentctl-protect create application minio --namespaces minio -n minio
Application "minio" created.

Create a snapshot minio-snap and a backup minio-bkp:

$ tridentctl-protect create snapshot minio-snap --app minio --appvault demo -n minio
Snapshot "minio-snap" created.

$ tridentctl-protect create backup minio-bkp --app minio --appvault demo -n minio
Backup "minio-bkp" created.

Install kube-state-metrics

Trident protect leverages kube-state-metrics (KSM) to provide information about the health status of its resources. Kube-state-metrics is an open-source add-on for Kubernetes that listens to the Kubernetes API server and generates metrics about the state of various Kubernetes objects.

Install Prometheus ServiceMonitor CRD

First, we install the Custom Resource Definition (CRD) for the Prometheus ServiceMonitor using Helm. Add the Prometheus-community helm repository:

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update

Install and configure kube-state-metrics

Now, we install and configure kube-state-metrics to generate metrics from Kubernetes API communication. Using it with Trident Protect will expose useful information about the state of Trident Protect custom resources in our environment.

Let's create a configuration file for the KSM helm chart to monitor these Trident Protect CRs:

  • Snapshots
  • Backups
  • ExecutionHooksRuns
  • AppVaults (added in a later step)

Let’s take a closer look at the snapshot CR minio-snap that we created earlier.

$ k -n minio get snapshot minio-snap -o yaml
apiVersion: protect.trident.netapp.io/v1
kind: Snapshot
metadata:
  annotations:
    protect.trident.netapp.io/correlationid: 42111244-fdb7-41f1-af39-7b61fdb0c7e1
  creationTimestamp: "2025-08-18T15:25:40Z"
  ...
  name: minio-snap
  namespace: minio
  ownerReferences:
  - apiVersion: protect.trident.netapp.io/v1
    kind: Application
    name: minio
    uid: efc8cdd4-8b20-48e0-8944-eeee8aba98f9
  resourceVersion: "14328"
  uid: c569472c-ae13-4d30-bffd-98acef304abc
spec:
  appVaultRef: demo
  applicationRef: minio
  cleanupSnapshot: false
  completionTimeout: 0s
  reclaimPolicy: Delete
  volumeSnapshotsCreatedTimeout: 0s
  volumeSnapshotsReadyToUseTimeout: 0s
status:
  appArchivePath: minio_efc8cdd4-8b20-48e0-8944-eeee8aba98f9/snapshots/20250818152540_minio-snap_c569472c-ae13-4d30-bffd-98acef304abc
  appVaultRef: demo
  completionTimestamp: "2025-08-18T15:25:58Z"
  ...
  postSnapshotExecHooksRunResults: []
  preSnapshotExecHooksRunResults: []
  state: Completed
  volumeSnapshots:
  - name: snapshot-c569472c-ae13-4d30-bffd-98acef304abc-pvc-ec50d895-4048-4a51-a651-5439b2a5ba2a
    namespace: minio

From its metadata section, we want to expose the name, UID, and creationTimestamp of the snapshot to Prometheus and from the spec and status fields the metrics appVautlRef, applicationRef, and state. The corresponding KSM configuration entry will look like this.

        resources:
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Snapshot"
          version: "v1"
        labelsFromPath:
          snapshot_uid: [metadata, uid]
          snapshot_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: snapshot_info
          help: "Exposes details about the Snapshot state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                status: [status, state]

From the backup CR, which has the same structure as the snapshot CR, we can collect the same information using this KSM configuration entry.

      resources:
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Backup"
          version: "v1"
        labelsFromPath:
          backup_uid: [metadata, uid]
          backup_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: backup_info
          help: "Exposes details about the Backup state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                status: [status, state]

To access those CR fields, KSM needs to have the corresponding RBAC permissions to allow access to the snapshot and backup CRs in all namespaces (since the Trident protect CRs are created in the application namespace). So we add the following parameters to the KSM configuration file.

rbac:
  extraRules:
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["snapshots", "backups"]
    verbs: ["list", "watch"]

# collect metrics from ALL namespaces
namespaces: ""

Collecting the details for the executionHooksRuns works in the same way as for snapshots and backups, so we don’t show the details here. Putting everything together, our first KSM configuration file looks like this.

$ cat metrics-config-backup-snapshot-hooks.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true

customResourceState:
  enabled: true
  config:
    kind: CustomResourceStateMetrics
    spec:
      resources:
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Snapshot"
          version: "v1"
        labelsFromPath:
          snapshot_uid: [metadata, uid]
          snapshot_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: snapshot_info
          help: "Exposes details about the Snapshot state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                status: [status, state]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Backup"
          version: "v1"
        labelsFromPath:
          backup_uid: [metadata, uid]
          backup_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: backup_info
          help: "Exposes details about the Backup state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                status: [status, state]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Exechooksruns"
          version: "v1"
        labelsFromPath:
          ehr_uid: [metadata, uid]
          ehr_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: ehr_info
          help: "Exposes details about the Exec Hook state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                stage: ["spec", stage]
                action: ["spec", action]
                status: [status, state]
rbac:
  extraRules:
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["snapshots"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["backups"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["exechooksruns"]
    verbs: ["list", "watch"]

# collect metrics from ALL namespaces
namespaces: ""

# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
  monitor:
    enabled: true
    additionalLabels:
      release: prometheus

Now we can install the KSM using Helm.

$ helm install trident-protect -f ./metrics-config-backup-snapshot-hooks.yaml prometheus-community/kube-state-metrics --version 5.21.0 -n prometheus
NAME: trident-protect
LAST DEPLOYED: Tue Aug 19 17:54:22 2025
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
The exposed metrics can be found here:
https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics

The metrics are exported on the HTTP endpoint /metrics on the listening port.
In your case, trident-protect-kube-state-metrics.prometheus.svc.cluster.local:8080/metrics

They are served either as plaintext or protobuf depending on the Accept header.
They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.

We check that the KSM ServiceMonitor was correctly deployed in the prometheus namespace.

$ kubectl -n prometheus get smon -l app.kubernetes.io/instance=trident-protect
NAME                                 AGE
trident-protect-kube-state-metrics   90s

$ kubectl get all -n prometheus
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/trident-protect-kube-state-metrics-94d55666c-69j6n   1/1     Running   0          105s

NAME                                         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/trident-protect-kube-state-metrics   ClusterIP   172.16.88.31   <none>        8080/TCP   105s

NAME                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/trident-protect-kube-state-metrics   1/1     1            1           105s

NAME                                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/trident-protect-kube-state-metrics-94d55666c   1         1         1       105s

Prometheus installation

Let’s install Prometheus now on our cluster. Before doing that, we must make sure that the Prometheus server can access the Kubernetes API.

RBAC permissions

The Prometheus server needs access to the Kubernetes API to scrape targets. Therefore, a ServiceAccount is required to provide access to those resources, which must be created and bound to a ClusterRole accordingly. By applying the yaml file below, we create the ServiceAccount prometheus and a ClusterRole prometheus with the necessary privileges, that we bind to the ServiceAccount.

$ cat ./rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
  namespace: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/metrics
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
  namespace: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: prometheus

$ kubectl apply -f ./rbac.yaml
serviceaccount/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created

Now we’re ready to install Prometheus.

Deploy Prometheus

After creating the Prometheus ServiceAccount and giving it access to the Kubernetes API, we can deploy the Prometheus instance.

We’ll use the Prometheus operator for the installation. Following the instructions to install the operator in the prometheus namespace will install the operator in some minutes on our K8s cluster.

This manifest defines the serviceMonitor, NamespaceSelector, serviceMonitorSelector, and podMonitorSelector fields to specify which CRs to include. In this example, the {} value is used to match all existing CRs.

$ cat ./prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: prometheus
spec:
  serviceAccountName: prometheus
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  podMonitorSelector: {}
  resources:
    requests:
      memory: 400Mi

We apply the manifest and check that the Prometheus instance reaches the Running state eventually and a prometheus-operated Service was created:

$ kubectl apply -f ./prometheus.yaml
prometheus.monitoring.coreos.com/prometheus created

$ kubectl get prometheus -n prometheus
NAME         VERSION   DESIRED   READY   RECONCILED   AVAILABLE   AGE
prometheus                       1       True         True        42s

$ kubectl get services -n prometheus
NAME                                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
prometheus-operated                  ClusterIP   None           <none>        9090/TCP   103s
prometheus-operator                  ClusterIP   None           <none>        8080/TCP   7m44s
trident-protect-kube-state-metrics   ClusterIP   172.16.88.31   <none>        8080/TCP   17h

$ kubectl get all -n prometheus
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/prometheus-operator-5d697c648f-22lrz                 1/1     Running   0          6m21s
pod/prometheus-prometheus-0                              2/2     Running   0          20s
pod/trident-protect-kube-state-metrics-94d55666c-69j6n   1/1     Running   0          17h

NAME                                         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/prometheus-operated                  ClusterIP   None           <none>        9090/TCP   20s
service/prometheus-operator                  ClusterIP   None           <none>        8080/TCP   6m21s
service/trident-protect-kube-state-metrics   ClusterIP   172.16.88.31   <none>        8080/TCP   17h

NAME                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prometheus-operator                  1/1     1            1           6m21s
deployment.apps/trident-protect-kube-state-metrics   1/1     1            1           17h

NAME                                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/prometheus-operator-5d697c648f                 1         1         1       6m21s
replicaset.apps/trident-protect-kube-state-metrics-94d55666c   1         1         1       17h

NAME                                     READY   AGE
statefulset.apps/prometheus-prometheus   1/1     20s

To quickly test the Prometheus installation, let’s use port-forwarding.

$ kubectl -n prometheus port-forward svc/prometheus-operated 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090

By pointing a web browser to http://localhost:9090 we can view the Prometheus console:

Screenshot 2025-08-20 at 11.42.19.png

Configure the monitoring tools to work together

After we have now installed all the monitoring tools, we need to configure them to work together. To integrate the kube-state-metrics with Prometheus, we edit our Prometheus configuration file (prometheus.yaml) and add the kube-state-metrics service information to it, saving it as prometheus-ksm.yaml

$ cat ./prometheus-ksm.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: prometheus
spec:
  serviceAccountName: prometheus
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  podMonitorSelector: {}
  resources:
    requests:
      memory: 400Mi
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: trident-protect
data:
  prometheus.yaml: |
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'kube-state-metrics'
        static_configs:
          - targets: ['kube-state-metrics.trident-protect.svc:8080']

$ diff ./prometheus.yaml ./prometheus-ksm.yaml
13a14,27
> ---
> apiVersion: v1
> kind: ConfigMap
> metadata:
>   name: prometheus-config
>   namespace: trident-protect
> data:
>   prometheus.yaml: |
>     global:
>       scrape_interval: 15s
>     scrape_configs:
>       - job_name: 'kube-state-metrics'
>         static_configs:
>           - targets: ['kube-state-metrics.trident-protect.svc:8080']

After applying the manifest, we confirm that the prometheus-config configuration map was created in the trident-protect namespace:

$ kubectl apply -f ./prometheus-ksm.yaml
prometheus.monitoring.coreos.com/prometheus unchanged
configmap/prometheus-config created

$ kubectl -n trident-protect get cm
NAME                         DATA   AGE
kube-root-ca.crt             1      46h
prometheus-config            1      59s
trident-protect-env-config   15     46h

Now we can query the backups, snapshot, and execution hooks run information in Prometheus:

Screenshot 2025-08-20 at 15.45.10.png

This matches the two snapshots and one backup and the six execution hook runs we have in Trident protect:

$ tridentctl-protect get snapshot -A
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+
| NAMESPACE |                    NAME                     |  APP  | RECLAIM POLICY |   STATE   | ERROR |  AGE  |
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+
| minio     | backup-3473b771-caa5-48d2-a9b6-41f4448a049d | minio | Delete         | Completed |       | 1d22h |
| minio     | minio-snap                                  | minio | Delete         | Completed |       | 1d22h |
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+

$ tridentctl-protect get backup -A
+-----------+--------------+-------+----------------+-----------+-------+-------+
| NAMESPACE |     NAME     |  APP  | RECLAIM POLICY |   STATE   | ERROR |  AGE  |
+-----------+--------------+-------+----------------+-----------+-------+-------+
| minio     | minio-backup | minio | Retain         | Completed |       | 1d22h |
+-----------+--------------+-------+----------------+-----------+-------+-------+

$ kubectl get ehr -A
NAMESPACE   NAME                                                 STATE       STAGE   ACTION     ERROR   APP     AGE
minio       post-backup-3473b771-caa5-48d2-a9b6-41f4448a049d     Completed   Post    Backup             minio   46h
minio       post-snapshot-7e7934a4-b51a-4bc4-a981-28a8ba137ff6   Completed   Post    Snapshot           minio   46h
minio       post-snapshot-c569472c-ae13-4d30-bffd-98acef304abc   Completed   Post    Snapshot           minio   46h
minio       pre-backup-3473b771-caa5-48d2-a9b6-41f4448a049d      Completed   Pre     Backup             minio   46h
minio       pre-snapshot-7e7934a4-b51a-4bc4-a981-28a8ba137ff6    Completed   Pre     Snapshot           minio   46h
minio       pre-snapshot-c569472c-ae13-4d30-bffd-98acef304abc    Completed   Pre     Snapshot           minio   46h

Let’s create a 2nd backup:

$ tridentctl-protect create backup minio-bkp-2 --app minio --appvault demo --reclaim-policy Delete -n minio
Backup "minio-bkp-2" created.

Prometheus quickly catches the backup in the Running state, and the Completed state once the backup finishes.

Screenshot 2025-08-20 at 15.56.42.pngScreenshot 2025-08-20 at 15.58.41.png

Add additional metrics and information

Now, we want to add metrics about additional custom resources to Prometheus and see error states (if any) of the monitored custom resources reflected in Prometheus.

AppVault metrics and error details

To include metrics about the appVault CR and its error details, we add the below entries to the KSM configuration file:

      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "AppVault"
          version: "v1"
        labelsFromPath:
          appvault_uid: [metadata, uid]
          appvault_name: [metadata, name]
        metricsFromPath:
          state: [status, state]
          error: [status, error]
          message: [status, message]
        metrics:
        - name: appvault_info
          help: "Exposes details about the AppVault state"
          each:
            type: Info
            info:
              labelsFromPath:
                state: [status, state]
                error: [status, error]
                message: [status, message]

The complete configuration file to catch metrics and error details from snapshot, backups, execHooksRun, and appVault CR is then:

$ cat ./metrics-config-backup-snapshot-hooks-appvault.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true

customResourceState:
  enabled: true
  config:
    kind: CustomResourceStateMetrics
    spec:
      resources:
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Snapshot"
          version: "v1"
        labelsFromPath:
          snapshot_uid: [metadata, uid]
          snapshot_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: snapshot_info
          help: "Exposes details about the Snapshot state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                status: [status, state]
                error: [status, error]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Backup"
          version: "v1"
        labelsFromPath:
          backup_uid: [metadata, uid]
          backup_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: backup_info
          help: "Exposes details about the Backup state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                status: [status, state]
                error: [status, error]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Exechooksruns"
          version: "v1"
        labelsFromPath:
          ehr_uid: [metadata, uid]
          ehr_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: ehr_info
          help: "Exposes details about the Exec Hook state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                stage: ["spec", stage]
                action: ["spec", action]
                status: [status, state]
                error: [status, error]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "AppVault"
          version: "v1"
        labelsFromPath:
          appvault_uid: [metadata, uid]
          appvault_name: [metadata, name]
        metricsFromPath:
          state: [status, state]
          error: [status, error]
          message: [status, message]
        metrics:
        - name: appvault_info
          help: "Exposes details about the AppVault state"
          each:
            type: Info
            info:
              labelsFromPath:
                state: [status, state]
                error: [status, error]
                message: [status, message]
rbac:
  extraRules:
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["snapshots"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["backups"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["exechooksruns"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["appvaults"]
    verbs: ["list", "watch"]

# collect metrics from ALL namespaces
namespaces: ""

# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
  monitor:
    enabled: true
    additionalLabels:
      release: prometheus

We update the KSM configuration:

$ helm upgrade trident-protect prometheus-community/kube-state-metrics -f ./metrics-config-backup-snapshot-hooks-appvault.yaml -n prometheus
Release "trident-protect" has been upgraded. Happy Helming!
NAME: trident-protect
LAST DEPLOYED: Wed Aug 20 16:47:06 2025
NAMESPACE: prometheus
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
The exposed metrics can be found here:
https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics

The metrics are exported on the HTTP endpoint /metrics on the listening port.
In your case, trident-protect-kube-state-metrics.prometheus.svc.cluster.local:8080/metrics

They are served either as plaintext or protobuf depending on the Accept header.
They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.

Now the information about the appVault CR is available in Prometheus.

Screenshot 2025-08-20 at 16.48.40.png

Test AppVault failure

To test Prometheus’ monitoring and error recognition, we test a failure of or appVault CR. To simulate losing access to the object storage bucket behind the appVault CR, we delete the secret with the access credential from the trident-protect namespace.

$ kubectl -n trident-protect delete secret puneptunetest
secret "puneptunetest" deleted

After some seconds, the AppVault CR goes into the Error state.

$ tridentctl-protect get appvault
+------+----------+-------+--------------------------------+---------+-----+
| NAME | PROVIDER | STATE |             ERROR              | MESSAGE | AGE |
+------+----------+-------+--------------------------------+---------+-----+
| demo | Azure    | Error | failed to resolve value for    |         | 2d  |
|      |          |       | accountKey: unable to ...      |         |     |
+------+----------+-------+--------------------------------+---------+-----+

And the error of the appVault CR is also reflected in Prometheus:

Screenshot 2025-08-20 at 17.07.33.png

AppMirrorRelationship metrics

With Trident protect, you can use the asynchronous replication capabilities of NetApp SnapMirror technology to replicate data and application changes from one storage backend to another, on the same cluster or between different clusters. AppMirrorRelationship (AMR) CRs control the replication relationship of an application protect by NetApp Snapmirror with Trident protect, so monitoring its state with Prometheus is essential.

This example config includes snapshot, backup, execHooksRun, appvault, and AMR metrics:

$ cat ./metrics-config-backup-snapshot-hooks-appvault-amr.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true

customResourceState:
  enabled: true
  config:
    kind: CustomResourceStateMetrics
    spec:
      resources:
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Snapshot"
          version: "v1"
        labelsFromPath:
          snapshot_uid: [metadata, uid]
          snapshot_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: snapshot_info
          help: "Exposes details about the Snapshot state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                status: [status, state]
                error: [status, error]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Backup"
          version: "v1"
        labelsFromPath:
          backup_uid: [metadata, uid]
          backup_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: backup_info
          help: "Exposes details about the Backup state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                status: [status, state]
                error: [status, error]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "Exechooksruns"
          version: "v1"
        labelsFromPath:
          ehr_uid: [metadata, uid]
          ehr_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: ehr_info
          help: "Exposes details about the Exec Hook state"
          each:
            type: Info
            info:
              labelsFromPath:
                appVaultReference: ["spec", "appVaultRef"]
                appReference: ["spec", "applicationRef"]
                stage: ["spec", stage]
                action: ["spec", action]
                status: [status, state]
                error: [status, error]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "AppVault"
          version: "v1"
        labelsFromPath:
          appvault_uid: [metadata, uid]
          appvault_name: [metadata, name]
        metricsFromPath:
          state: [status, state]
          error: [status, error]
          message: [status, message]
        metrics:
        - name: appvault_info
          help: "Exposes details about the AppVault state"
          each:
            type: Info
            info:
              labelsFromPath:
                state: [status, state]
                error: [status, error]
                message: [status, message]
      - groupVersionKind:
          group: protect.trident.netapp.io
          kind: "AppMirrorRelationship"
          version: "v1"
        labelsFromPath:
          amr_uid: [metadata, uid]
          amr_name: [metadata, name]
          creation_time: [metadata, creationTimestamp]
        metrics:
        - name: app_mirror_relationship_info
          help: "Exposes details about the AppMirrorRelationship state"
          each:
            type: Info
            info:
              labelsFromPath:
                desiredState: ["spec", "desiredState"]
                destinationAppVaultRef: ["spec", "destinationAppVaultRef"]
                sourceAppVaultRef: ["spec", "sourceAppVaultRef"]
                sourceApplicationName: ["spec", "sourceApplicationName"]
                sourceApplicationUID: ["spec", "sourceApplicationUID"]
                state: ["status", "state"]
                error: ["status", "error"]
                lastTransferStartTimestamp: ["status", "lastTransfer", "startTimestamp"]
                lastTransferCompletionTimestamp: ["status", "lastTransfer", "completionTimestamp"]
                lastTransferredSnapshotName: ["status", "lastTransferredSnapshot", "name"]
                lastTransferredSnapshotCompletionTimestamp: ["status", "lastTransferredSnapshot", "completionTimestamp"]
                destinationApplicationRef: ["status", "destinationApplicationRef"]
                destinationNamespaces: ["status", "destinationNamespaces"]
                promotedSnapshot: ["spec", "promotedSnapshot"]
                recurrenceRule: ["spec", "recurrenceRule"]
                storageClassName: ["spec", "storageClassName"]
                namespaceMapping: ["spec", "namespaceMapping"]
                conditions: ["status", "conditions"]
rbac:
  extraRules:
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["snapshots"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["backups"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["exechooksruns"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["appvaults"]
    verbs: ["list", "watch"]
  - apiGroups: ["protect.trident.netapp.io"]
    resources: ["appmirrorrelationships"]
    verbs: ["list", "watch"]

# collect metrics from ALL namespaces
namespaces: ""

# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
  monitor:
    enabled: true
    additionalLabels:
      release: prometheus

Trident metrics

The metrics provided by Trident enable you to do the following:

  • Keep tabs on Trident's health and configuration. You can examine how successful operations are and if it can communicate with the backends as expected.
  • Examine backend usage information and understand how many volumes are provisioned on a backend and the amount of space consumed, and so on.
  • Maintain a mapping of the number of volumes provisioned on available backends.
  • Track performance. You can look at how long it takes for Trident to communicate to backends and perform operations.

By default, Trident's metrics are exposed on the target port 8001 at the /metrics endpoint. These metrics are enabled by default when Trident is installed.

Create a Prometheus ServiceMonitor for Trident metrics

Prometheus was setup in the previous sections already, so to consume the Trident metrics, we create another Prometheus ServiceMonitor that watches the trident-csi service and listens on the metrics port. A sample ServiceMonitor configuration looks like this:

$ cat ./prometheus-trident-sm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: trident-sm
  namespace: prometheus
  labels:
    release: prom-operator
spec:
  jobLabel: trident
  selector:
    matchLabels:
      app: controller.csi.trident.netapp.io
  namespaceSelector:
    matchNames:
      - trident
  endpoints:
    - port: metrics
      interval: 15s

Let’s deploy the new ServiceMonitor in the prometheus namespace.

$ kubectl apply -f Prometheus/prometheus-trident-sm.yaml
servicemonitor.monitoring.coreos.com/trident-sm created

We can see that the new ServiceMonitor trident-sm is now running in the prometheus namespace:

$ kubectl -n prometheus get all,ServiceMonitor,cm
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/prometheus-operator-5d697c648f-22lrz                 1/1     Running   0          6h1m
pod/prometheus-prometheus-0                              2/2     Running   0          5h55m
pod/trident-protect-kube-state-metrics-99476b548-cv9ff   1/1     Running   0          28m

NAME                                         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/prometheus-operated                  ClusterIP   None           <none>        9090/TCP   5h55m
service/prometheus-operator                  ClusterIP   None           <none>        8080/TCP   6h1m
service/trident-protect-kube-state-metrics   ClusterIP   172.16.88.31   <none>        8080/TCP   23h

NAME                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prometheus-operator                  1/1     1            1           6h1m
deployment.apps/trident-protect-kube-state-metrics   1/1     1            1           23h

NAME                                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/prometheus-operator-5d697c648f                 1         1         1       6h1m
replicaset.apps/trident-protect-kube-state-metrics-94d55666c   0         0         0       23h
replicaset.apps/trident-protect-kube-state-metrics-99476b548   1         1         1       28m

NAME                                     READY   AGE
statefulset.apps/prometheus-prometheus   1/1     5h55m

NAME                                                                      AGE
servicemonitor.monitoring.coreos.com/trident-protect-kube-state-metrics   23h
servicemonitor.monitoring.coreos.com/trident-sm                           32s

NAME                                                                      DATA   AGE
configmap/kube-root-ca.crt                                                1      24h
configmap/prometheus-prometheus-rulefiles-0                               0      5h55m
configmap/trident-protect-kube-state-metrics-customresourcestate-config   1      23h

By checking for available targets in the Prometheus UI (http://localhost:9090/targets) we confirm that the Trident metrics are now available in Prometheus.

Screenshot 2025-08-20 at 17.17.26.png

Query Trident metrics

We can now query the available Trident metrics in Prometheus.

Screenshot 2025-08-20 at 17.19.06.png

For example, we can query the number of Trident snapshots, volumes, and bytes the allocated by Trident volumes in the Prometheus UI.

Screenshot 2025-08-20 at 17.21.39.png

Grafana dashboards

Now that our monitoring system is functional, it’s time to give you an idea how to visualize the monitoring results. Let’s investigate Grafana dashboards!

Install Grafana

We install Grafana using the Grafana helm charts, first adding the Grafana helm repository:

$ helm repo add grafana https://grafana.github.io/helm-charts 

Then we can install Grafana into the namespace grafana, which we create first.

$ kubectl create ns grafana
namespace/grafana created

$ helm install my-grafana grafana/grafana --namespace grafana
NAME: my-grafana
LAST DEPLOYED: Thu Aug 21 14:28:14 2025
NAMESPACE: grafana
STATUS: deployed
REVISION: 1
NOTES:
1. Get your 'admin' user password by running:

   kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo


2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:

   my-grafana.grafana.svc.cluster.local

   Get the Grafana URL to visit by running these commands in the same shell:
     export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}")
     kubectl --namespace grafana port-forward $POD_NAME 3000

3. Login with the password from step 1 and the username: admin
#################################################################################
######   WARNING: Persistence is disabled!!! You will lose your data when   #####
######            the Grafana pod is terminated.                            #####
#################################################################################

$ helm list -n grafana
NAME      	NAMESPACE	REVISION	UPDATED                              	STATUS  	CHART        	APP VERSION
my-grafana	grafana  	1       	2025-08-21 14:28:14.772879 +0200 CEST	deployed	grafana-9.3.2	12.1.0

Following the instructions above, we retrieve the Grafana admin password and setup port forwarding.

$ kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
<REDACTED>

$ kubectl -n grafana port-forward svc/my-grafana 3000:80
Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000

Now we can test the access and login to the Grafana UI on http://localhost:3000, which works fine.

Screenshot 2025-08-21 at 14.35.20.png

Enable persistent storage for Grafana

By default, Grafana only uses ephemeral storage, storing all data in the container’s file system. So, the data will be lost if the container stops. We follow the steps in the Grafana documentation to enable persistent storage for Grafana.

We download the values file and edit the values under the section of persistence, changing the enabled flag from false to true.  

$ diff Grafana/values.yaml Grafana/values-persistence.yaml
418c418
<   enabled: false
---
>   enabled: true

Then we run helm upgrade to make the changes take effect.

$ helm upgrade my-grafana grafana/grafana -f Grafana/values-persistence.yaml -n grafana
Release "my-grafana" has been upgraded. Happy Helming!
NAME: my-grafana
LAST DEPLOYED: Thu Aug 21 14:37:24 2025
NAMESPACE: grafana
STATUS: deployed
REVISION: 2
NOTES:
1. Get your 'admin' user password by running:

   kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo


2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:

   my-grafana.grafana.svc.cluster.local

   Get the Grafana URL to visit by running these commands in the same shell:
     export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}")
     kubectl --namespace grafana port-forward $POD_NAME 3000

We confirm that a PVC backed by by Azure NetApp Files was created in the grafana namespace.

$ kubectl get all,pvc -n grafana
NAME                              READY   STATUS    RESTARTS   AGE
pod/my-grafana-6d5b96b7d7-fqq7d   1/1     Running   0          5m18s

NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/my-grafana   ClusterIP   172.16.9.115   <none>        80/TCP    14m

NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/my-grafana   1/1     1            1           14m

NAME                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/my-grafana-6ccff48567   0         0         0       14m
replicaset.apps/my-grafana-6d5b96b7d7   1         1         1       5m18s

NAME                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/my-grafana   Bound    pvc-5a1844c6-3a9f-4f1d-9d94-caa1666ded3e   50Gi       RWO            azure-netapp-files-standard   <unset>                 5m19s

After restarting the port forwarding, we can login to Grafana again and continue working with persistent storage enabled.

Add a data source

Next, we need to add our Prometheus instance as a data source in Grafana. For doing this, we need the service name and port of Prometheus. Typically, when using the Prometheus Operator, the service name is something like prometheus-operated, so we check on our cluster.

$ kubectl -n prometheus get svc | grep operated
prometheus-operated                  ClusterIP   None           <none>        9090/TCP   27h

Now we can add the Prometheus instance as a data source in Grafana. Use the Kubernetes DNS to reference the Prometheus service. It should look something like this: http://prometheus-operated.prometheus.svc.cluster.local:9090

Screenshot 2025-08-21 at 14.47.02.pngScreenshot 2025-08-21 at 14.47.36.pngScreenshot 2025-08-21 at 14.48.57.pngScreenshot 2025-08-21 at 14.49.27.png

In the Grafana dashboard, we navigate to Menu -> Drilldown, which allows us to easily see the Trident and KSM Trident protect metrics.

Screenshot 2025-08-21 at 14.51.28.png

Screenshot 2025-08-21 at 14.51.39.png

Add a dashboard for the Trident protect metrics

Covering the creation of Grafana dashboards goes beyond the scope of this blog post. As an example and inspiration, we use the dashboard example for visualization of snapshot and backup metrics from Yves Weisser’s highly recommended collection of Trident lab scenarios on GitHub.

After downloading the dashboard json file from GitHub, we change the “Failed” option values to “Error” to display failed snapshot and backups in red in the dashboard.

$ diff Grafana/dashboard.json Grafana/dashboard_v2.json
365c365
<                         "Failed": {
---
>                         "Error": {
562c562
<                         "Failed": {
---
>                         "Error": {
709c709,710
<       "25.02"
---
>       "25.02",
>       "25.06"
724c725
< }
\ No newline at end of file
---
> }

Now can import the dashboard json file into Grafana.

Screenshot 2025-08-21 at 15.28.58.png

After importing the dashboard json file, the “Trident protect Global View” dashboard is available in Grafana. Here’s an example how it visualizes running and failed Trident protect backups.

Screenshot 2025-08-21 at 15.43.02.pngScreenshot 2025-08-21 at 16.15.02.png

Conclusion and call to action

By following this blog, you have successfully set up monitoring and visualization for NetApp Trident and Trident protect using Prometheus and Grafana. This setup enables you to keep tabs on the health and performance of your Trident and Trident protect resources, ensuring your Kubernetes applications are well-protected and efficiently managed.

 

Happy monitoring!

 

 

 

 

Public