Tech ONTAP Blogs
Tech ONTAP Blogs
NetApp® Trident™ protect provides advanced application data management capabilities that enhance the functionality and availability of stateful Kubernetes applications supported by NetApp ONTAP storage systems and the NetApp Trident Container Storage Interface (CSI) storage provisioner. It is compatible with a wide range of fully managed and self-managed Kubernetes offerings (see the supported Kubernetes distributions and storage back ends), making it an optimal solution for protecting your Kubernetes services across various platforms and regions.
In this blog post, I will demonstrate how to scrape and visualize the metrics provided by Trident and Trident Protect using the popular open-source monitoring and visualization frameworks Prometheus and Grafana.
To follow along with this guide, ensure you have the following:
First, we quickly go through the setup of the test environment that we used throughout the blog.
We will use a simple MinIO application with a persistent volume on Azure NetApp Files (ANF) as our sample application for the monitoring tests. The MinIO application is deployed on an Azure Kubernetes Service (AKS) cluster with NetApp Trident 25.06.0 installed and configured:
$ kubectl get all,pvc -n minio
NAME READY STATUS RESTARTS AGE
pod/minio-67dffb8bbd-5rfpm 1/1 Running 0 14m
pod/minio-console-677bd9ddcb-27497 1/1 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/minio ClusterIP 172.16.61.243 <none> 9000/TCP 14m
service/minio-console ClusterIP 172.16.95.239 <none> 9090/TCP 14m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/minio 1/1 1 1 14m
deployment.apps/minio-console 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/minio-67dffb8bbd 1 1 1 14m
replicaset.apps/minio-console-677bd9ddcb 1 1 1 14m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
persistentvolumeclaim/minio Bound pvc-ec50d895-4048-4a51-a651-5439b2a5ba2a 50Gi RWO azure-netapp-files-standard <unset> 14m
Create a Trident protect application minio based on the minio namespace with the Trident protect CLI:
$ tridentctl-protect create application minio --namespaces minio -n minio
Application "minio" created.
Create a snapshot minio-snap and a backup minio-bkp:
$ tridentctl-protect create snapshot minio-snap --app minio --appvault demo -n minio
Snapshot "minio-snap" created.
$ tridentctl-protect create backup minio-bkp --app minio --appvault demo -n minio
Backup "minio-bkp" created.
Trident protect leverages kube-state-metrics (KSM) to provide information about the health status of its resources. Kube-state-metrics is an open-source add-on for Kubernetes that listens to the Kubernetes API server and generates metrics about the state of various Kubernetes objects.
First, we install the Custom Resource Definition (CRD) for the Prometheus ServiceMonitor using Helm. Add the Prometheus-community helm repository:
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
Now, we install and configure kube-state-metrics to generate metrics from Kubernetes API communication. Using it with Trident Protect will expose useful information about the state of Trident Protect custom resources in our environment.
Let's create a configuration file for the KSM helm chart to monitor these Trident Protect CRs:
Let’s take a closer look at the snapshot CR minio-snap that we created earlier.
$ k -n minio get snapshot minio-snap -o yaml
apiVersion: protect.trident.netapp.io/v1
kind: Snapshot
metadata:
annotations:
protect.trident.netapp.io/correlationid: 42111244-fdb7-41f1-af39-7b61fdb0c7e1
creationTimestamp: "2025-08-18T15:25:40Z"
...
name: minio-snap
namespace: minio
ownerReferences:
- apiVersion: protect.trident.netapp.io/v1
kind: Application
name: minio
uid: efc8cdd4-8b20-48e0-8944-eeee8aba98f9
resourceVersion: "14328"
uid: c569472c-ae13-4d30-bffd-98acef304abc
spec:
appVaultRef: demo
applicationRef: minio
cleanupSnapshot: false
completionTimeout: 0s
reclaimPolicy: Delete
volumeSnapshotsCreatedTimeout: 0s
volumeSnapshotsReadyToUseTimeout: 0s
status:
appArchivePath: minio_efc8cdd4-8b20-48e0-8944-eeee8aba98f9/snapshots/20250818152540_minio-snap_c569472c-ae13-4d30-bffd-98acef304abc
appVaultRef: demo
completionTimestamp: "2025-08-18T15:25:58Z"
...
postSnapshotExecHooksRunResults: []
preSnapshotExecHooksRunResults: []
state: Completed
volumeSnapshots:
- name: snapshot-c569472c-ae13-4d30-bffd-98acef304abc-pvc-ec50d895-4048-4a51-a651-5439b2a5ba2a
namespace: minio
From its metadata section, we want to expose the name, UID, and creationTimestamp of the snapshot to Prometheus and from the spec and status fields the metrics appVautlRef, applicationRef, and state. The corresponding KSM configuration entry will look like this.
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Snapshot"
version: "v1"
labelsFromPath:
snapshot_uid: [metadata, uid]
snapshot_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: snapshot_info
help: "Exposes details about the Snapshot state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
From the backup CR, which has the same structure as the snapshot CR, we can collect the same information using this KSM configuration entry.
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Backup"
version: "v1"
labelsFromPath:
backup_uid: [metadata, uid]
backup_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: backup_info
help: "Exposes details about the Backup state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
To access those CR fields, KSM needs to have the corresponding RBAC permissions to allow access to the snapshot and backup CRs in all namespaces (since the Trident protect CRs are created in the application namespace). So we add the following parameters to the KSM configuration file.
rbac:
extraRules:
- apiGroups: ["protect.trident.netapp.io"]
resources: ["snapshots", "backups"]
verbs: ["list", "watch"]
# collect metrics from ALL namespaces
namespaces: ""
Collecting the details for the executionHooksRuns works in the same way as for snapshots and backups, so we don’t show the details here. Putting everything together, our first KSM configuration file looks like this.
$ cat metrics-config-backup-snapshot-hooks.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true
customResourceState:
enabled: true
config:
kind: CustomResourceStateMetrics
spec:
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Snapshot"
version: "v1"
labelsFromPath:
snapshot_uid: [metadata, uid]
snapshot_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: snapshot_info
help: "Exposes details about the Snapshot state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Backup"
version: "v1"
labelsFromPath:
backup_uid: [metadata, uid]
backup_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: backup_info
help: "Exposes details about the Backup state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Exechooksruns"
version: "v1"
labelsFromPath:
ehr_uid: [metadata, uid]
ehr_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: ehr_info
help: "Exposes details about the Exec Hook state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
stage: ["spec", stage]
action: ["spec", action]
status: [status, state]
rbac:
extraRules:
- apiGroups: ["protect.trident.netapp.io"]
resources: ["snapshots"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["backups"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["exechooksruns"]
verbs: ["list", "watch"]
# collect metrics from ALL namespaces
namespaces: ""
# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
monitor:
enabled: true
additionalLabels:
release: prometheus
Now we can install the KSM using Helm.
$ helm install trident-protect -f ./metrics-config-backup-snapshot-hooks.yaml prometheus-community/kube-state-metrics --version 5.21.0 -n prometheus
NAME: trident-protect
LAST DEPLOYED: Tue Aug 19 17:54:22 2025
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
The exposed metrics can be found here:
https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics
The metrics are exported on the HTTP endpoint /metrics on the listening port.
In your case, trident-protect-kube-state-metrics.prometheus.svc.cluster.local:8080/metrics
They are served either as plaintext or protobuf depending on the Accept header.
They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.
We check that the KSM ServiceMonitor was correctly deployed in the prometheus namespace.
$ kubectl -n prometheus get smon -l app.kubernetes.io/instance=trident-protect
NAME AGE
trident-protect-kube-state-metrics 90s
$ kubectl get all -n prometheus
NAME READY STATUS RESTARTS AGE
pod/trident-protect-kube-state-metrics-94d55666c-69j6n 1/1 Running 0 105s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 105s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 105s
NAME DESIRED CURRENT READY AGE
replicaset.apps/trident-protect-kube-state-metrics-94d55666c 1 1 1 105s
Let’s install Prometheus now on our cluster. Before doing that, we must make sure that the Prometheus server can access the Kubernetes API.
The Prometheus server needs access to the Kubernetes API to scrape targets. Therefore, a ServiceAccount is required to provide access to those resources, which must be created and bound to a ClusterRole accordingly. By applying the yaml file below, we create the ServiceAccount prometheus and a ClusterRole prometheus with the necessary privileges, that we bind to the ServiceAccount.
$ cat ./rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
namespace: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
namespace: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: prometheus
$ kubectl apply -f ./rbac.yaml
serviceaccount/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
Now we’re ready to install Prometheus.
After creating the Prometheus ServiceAccount and giving it access to the Kubernetes API, we can deploy the Prometheus instance.
We’ll use the Prometheus operator for the installation. Following the instructions to install the operator in the prometheus namespace will install the operator in some minutes on our K8s cluster.
This manifest defines the serviceMonitor, NamespaceSelector, serviceMonitorSelector, and podMonitorSelector fields to specify which CRs to include. In this example, the {} value is used to match all existing CRs.
$ cat ./prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: prometheus
spec:
serviceAccountName: prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
podMonitorSelector: {}
resources:
requests:
memory: 400Mi
We apply the manifest and check that the Prometheus instance reaches the Running state eventually and a prometheus-operated Service was created:
$ kubectl apply -f ./prometheus.yaml
prometheus.monitoring.coreos.com/prometheus created
$ kubectl get prometheus -n prometheus
NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE
prometheus 1 True True 42s
$ kubectl get services -n prometheus
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-operated ClusterIP None <none> 9090/TCP 103s
prometheus-operator ClusterIP None <none> 8080/TCP 7m44s
trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 17h
$ kubectl get all -n prometheus
NAME READY STATUS RESTARTS AGE
pod/prometheus-operator-5d697c648f-22lrz 1/1 Running 0 6m21s
pod/prometheus-prometheus-0 2/2 Running 0 20s
pod/trident-protect-kube-state-metrics-94d55666c-69j6n 1/1 Running 0 17h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/prometheus-operated ClusterIP None <none> 9090/TCP 20s
service/prometheus-operator ClusterIP None <none> 8080/TCP 6m21s
service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 17h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-operator 1/1 1 1 6m21s
deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 17h
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-operator-5d697c648f 1 1 1 6m21s
replicaset.apps/trident-protect-kube-state-metrics-94d55666c 1 1 1 17h
NAME READY AGE
statefulset.apps/prometheus-prometheus 1/1 20s
To quickly test the Prometheus installation, let’s use port-forwarding.
$ kubectl -n prometheus port-forward svc/prometheus-operated 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
By pointing a web browser to http://localhost:9090 we can view the Prometheus console:
After we have now installed all the monitoring tools, we need to configure them to work together. To integrate the kube-state-metrics with Prometheus, we edit our Prometheus configuration file (prometheus.yaml) and add the kube-state-metrics service information to it, saving it as prometheus-ksm.yaml.
$ cat ./prometheus-ksm.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: prometheus
spec:
serviceAccountName: prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
podMonitorSelector: {}
resources:
requests:
memory: 400Mi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: trident-protect
data:
prometheus.yaml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['kube-state-metrics.trident-protect.svc:8080']
$ diff ./prometheus.yaml ./prometheus-ksm.yaml
13a14,27
> ---
> apiVersion: v1
> kind: ConfigMap
> metadata:
> name: prometheus-config
> namespace: trident-protect
> data:
> prometheus.yaml: |
> global:
> scrape_interval: 15s
> scrape_configs:
> - job_name: 'kube-state-metrics'
> static_configs:
> - targets: ['kube-state-metrics.trident-protect.svc:8080']
After applying the manifest, we confirm that the prometheus-config configuration map was created in the trident-protect namespace:
$ kubectl apply -f ./prometheus-ksm.yaml
prometheus.monitoring.coreos.com/prometheus unchanged
configmap/prometheus-config created
$ kubectl -n trident-protect get cm
NAME DATA AGE
kube-root-ca.crt 1 46h
prometheus-config 1 59s
trident-protect-env-config 15 46h
Now we can query the backups, snapshot, and execution hooks run information in Prometheus:
This matches the two snapshots and one backup and the six execution hook runs we have in Trident protect:
$ tridentctl-protect get snapshot -A
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+
| NAMESPACE | NAME | APP | RECLAIM POLICY | STATE | ERROR | AGE |
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+
| minio | backup-3473b771-caa5-48d2-a9b6-41f4448a049d | minio | Delete | Completed | | 1d22h |
| minio | minio-snap | minio | Delete | Completed | | 1d22h |
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+
$ tridentctl-protect get backup -A
+-----------+--------------+-------+----------------+-----------+-------+-------+
| NAMESPACE | NAME | APP | RECLAIM POLICY | STATE | ERROR | AGE |
+-----------+--------------+-------+----------------+-----------+-------+-------+
| minio | minio-backup | minio | Retain | Completed | | 1d22h |
+-----------+--------------+-------+----------------+-----------+-------+-------+
$ kubectl get ehr -A
NAMESPACE NAME STATE STAGE ACTION ERROR APP AGE
minio post-backup-3473b771-caa5-48d2-a9b6-41f4448a049d Completed Post Backup minio 46h
minio post-snapshot-7e7934a4-b51a-4bc4-a981-28a8ba137ff6 Completed Post Snapshot minio 46h
minio post-snapshot-c569472c-ae13-4d30-bffd-98acef304abc Completed Post Snapshot minio 46h
minio pre-backup-3473b771-caa5-48d2-a9b6-41f4448a049d Completed Pre Backup minio 46h
minio pre-snapshot-7e7934a4-b51a-4bc4-a981-28a8ba137ff6 Completed Pre Snapshot minio 46h
minio pre-snapshot-c569472c-ae13-4d30-bffd-98acef304abc Completed Pre Snapshot minio 46h
Let’s create a 2nd backup:
$ tridentctl-protect create backup minio-bkp-2 --app minio --appvault demo --reclaim-policy Delete -n minio
Backup "minio-bkp-2" created.
Prometheus quickly catches the backup in the Running state, and the Completed state once the backup finishes.
Now, we want to add metrics about additional custom resources to Prometheus and see error states (if any) of the monitored custom resources reflected in Prometheus.
To include metrics about the appVault CR and its error details, we add the below entries to the KSM configuration file:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "AppVault"
version: "v1"
labelsFromPath:
appvault_uid: [metadata, uid]
appvault_name: [metadata, name]
metricsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
metrics:
- name: appvault_info
help: "Exposes details about the AppVault state"
each:
type: Info
info:
labelsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
The complete configuration file to catch metrics and error details from snapshot, backups, execHooksRun, and appVault CR is then:
$ cat ./metrics-config-backup-snapshot-hooks-appvault.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true
customResourceState:
enabled: true
config:
kind: CustomResourceStateMetrics
spec:
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Snapshot"
version: "v1"
labelsFromPath:
snapshot_uid: [metadata, uid]
snapshot_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: snapshot_info
help: "Exposes details about the Snapshot state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Backup"
version: "v1"
labelsFromPath:
backup_uid: [metadata, uid]
backup_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: backup_info
help: "Exposes details about the Backup state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Exechooksruns"
version: "v1"
labelsFromPath:
ehr_uid: [metadata, uid]
ehr_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: ehr_info
help: "Exposes details about the Exec Hook state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
stage: ["spec", stage]
action: ["spec", action]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "AppVault"
version: "v1"
labelsFromPath:
appvault_uid: [metadata, uid]
appvault_name: [metadata, name]
metricsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
metrics:
- name: appvault_info
help: "Exposes details about the AppVault state"
each:
type: Info
info:
labelsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
rbac:
extraRules:
- apiGroups: ["protect.trident.netapp.io"]
resources: ["snapshots"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["backups"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["exechooksruns"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["appvaults"]
verbs: ["list", "watch"]
# collect metrics from ALL namespaces
namespaces: ""
# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
monitor:
enabled: true
additionalLabels:
release: prometheus
We update the KSM configuration:
$ helm upgrade trident-protect prometheus-community/kube-state-metrics -f ./metrics-config-backup-snapshot-hooks-appvault.yaml -n prometheus
Release "trident-protect" has been upgraded. Happy Helming!
NAME: trident-protect
LAST DEPLOYED: Wed Aug 20 16:47:06 2025
NAMESPACE: prometheus
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
The exposed metrics can be found here:
https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics
The metrics are exported on the HTTP endpoint /metrics on the listening port.
In your case, trident-protect-kube-state-metrics.prometheus.svc.cluster.local:8080/metrics
They are served either as plaintext or protobuf depending on the Accept header.
They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.
Now the information about the appVault CR is available in Prometheus.
To test Prometheus’ monitoring and error recognition, we test a failure of or appVault CR. To simulate losing access to the object storage bucket behind the appVault CR, we delete the secret with the access credential from the trident-protect namespace.
$ kubectl -n trident-protect delete secret puneptunetest
secret "puneptunetest" deleted
After some seconds, the AppVault CR goes into the Error state.
$ tridentctl-protect get appvault
+------+----------+-------+--------------------------------+---------+-----+
| NAME | PROVIDER | STATE | ERROR | MESSAGE | AGE |
+------+----------+-------+--------------------------------+---------+-----+
| demo | Azure | Error | failed to resolve value for | | 2d |
| | | | accountKey: unable to ... | | |
+------+----------+-------+--------------------------------+---------+-----+
And the error of the appVault CR is also reflected in Prometheus:
With Trident protect, you can use the asynchronous replication capabilities of NetApp SnapMirror technology to replicate data and application changes from one storage backend to another, on the same cluster or between different clusters. AppMirrorRelationship (AMR) CRs control the replication relationship of an application protect by NetApp Snapmirror with Trident protect, so monitoring its state with Prometheus is essential.
This example config includes snapshot, backup, execHooksRun, appvault, and AMR metrics:
$ cat ./metrics-config-backup-snapshot-hooks-appvault-amr.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true
customResourceState:
enabled: true
config:
kind: CustomResourceStateMetrics
spec:
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Snapshot"
version: "v1"
labelsFromPath:
snapshot_uid: [metadata, uid]
snapshot_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: snapshot_info
help: "Exposes details about the Snapshot state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Backup"
version: "v1"
labelsFromPath:
backup_uid: [metadata, uid]
backup_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: backup_info
help: "Exposes details about the Backup state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Exechooksruns"
version: "v1"
labelsFromPath:
ehr_uid: [metadata, uid]
ehr_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: ehr_info
help: "Exposes details about the Exec Hook state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
stage: ["spec", stage]
action: ["spec", action]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "AppVault"
version: "v1"
labelsFromPath:
appvault_uid: [metadata, uid]
appvault_name: [metadata, name]
metricsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
metrics:
- name: appvault_info
help: "Exposes details about the AppVault state"
each:
type: Info
info:
labelsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "AppMirrorRelationship"
version: "v1"
labelsFromPath:
amr_uid: [metadata, uid]
amr_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: app_mirror_relationship_info
help: "Exposes details about the AppMirrorRelationship state"
each:
type: Info
info:
labelsFromPath:
desiredState: ["spec", "desiredState"]
destinationAppVaultRef: ["spec", "destinationAppVaultRef"]
sourceAppVaultRef: ["spec", "sourceAppVaultRef"]
sourceApplicationName: ["spec", "sourceApplicationName"]
sourceApplicationUID: ["spec", "sourceApplicationUID"]
state: ["status", "state"]
error: ["status", "error"]
lastTransferStartTimestamp: ["status", "lastTransfer", "startTimestamp"]
lastTransferCompletionTimestamp: ["status", "lastTransfer", "completionTimestamp"]
lastTransferredSnapshotName: ["status", "lastTransferredSnapshot", "name"]
lastTransferredSnapshotCompletionTimestamp: ["status", "lastTransferredSnapshot", "completionTimestamp"]
destinationApplicationRef: ["status", "destinationApplicationRef"]
destinationNamespaces: ["status", "destinationNamespaces"]
promotedSnapshot: ["spec", "promotedSnapshot"]
recurrenceRule: ["spec", "recurrenceRule"]
storageClassName: ["spec", "storageClassName"]
namespaceMapping: ["spec", "namespaceMapping"]
conditions: ["status", "conditions"]
rbac:
extraRules:
- apiGroups: ["protect.trident.netapp.io"]
resources: ["snapshots"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["backups"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["exechooksruns"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["appvaults"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["appmirrorrelationships"]
verbs: ["list", "watch"]
# collect metrics from ALL namespaces
namespaces: ""
# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
monitor:
enabled: true
additionalLabels:
release: prometheus
The metrics provided by Trident enable you to do the following:
By default, Trident's metrics are exposed on the target port 8001 at the /metrics endpoint. These metrics are enabled by default when Trident is installed.
Prometheus was setup in the previous sections already, so to consume the Trident metrics, we create another Prometheus ServiceMonitor that watches the trident-csi service and listens on the metrics port. A sample ServiceMonitor configuration looks like this:
$ cat ./prometheus-trident-sm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: trident-sm
namespace: prometheus
labels:
release: prom-operator
spec:
jobLabel: trident
selector:
matchLabels:
app: controller.csi.trident.netapp.io
namespaceSelector:
matchNames:
- trident
endpoints:
- port: metrics
interval: 15s
Let’s deploy the new ServiceMonitor in the prometheus namespace.
$ kubectl apply -f Prometheus/prometheus-trident-sm.yaml
servicemonitor.monitoring.coreos.com/trident-sm created
We can see that the new ServiceMonitor trident-sm is now running in the prometheus namespace:
$ kubectl -n prometheus get all,ServiceMonitor,cm
NAME READY STATUS RESTARTS AGE
pod/prometheus-operator-5d697c648f-22lrz 1/1 Running 0 6h1m
pod/prometheus-prometheus-0 2/2 Running 0 5h55m
pod/trident-protect-kube-state-metrics-99476b548-cv9ff 1/1 Running 0 28m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/prometheus-operated ClusterIP None <none> 9090/TCP 5h55m
service/prometheus-operator ClusterIP None <none> 8080/TCP 6h1m
service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 23h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-operator 1/1 1 1 6h1m
deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 23h
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-operator-5d697c648f 1 1 1 6h1m
replicaset.apps/trident-protect-kube-state-metrics-94d55666c 0 0 0 23h
replicaset.apps/trident-protect-kube-state-metrics-99476b548 1 1 1 28m
NAME READY AGE
statefulset.apps/prometheus-prometheus 1/1 5h55m
NAME AGE
servicemonitor.monitoring.coreos.com/trident-protect-kube-state-metrics 23h
servicemonitor.monitoring.coreos.com/trident-sm 32s
NAME DATA AGE
configmap/kube-root-ca.crt 1 24h
configmap/prometheus-prometheus-rulefiles-0 0 5h55m
configmap/trident-protect-kube-state-metrics-customresourcestate-config 1 23h
By checking for available targets in the Prometheus UI (http://localhost:9090/targets) we confirm that the Trident metrics are now available in Prometheus.
We can now query the available Trident metrics in Prometheus.
For example, we can query the number of Trident snapshots, volumes, and bytes the allocated by Trident volumes in the Prometheus UI.
Now that our monitoring system is functional, it’s time to give you an idea how to visualize the monitoring results. Let’s investigate Grafana dashboards!
We install Grafana using the Grafana helm charts, first adding the Grafana helm repository:
$ helm repo add grafana https://grafana.github.io/helm-charts
Then we can install Grafana into the namespace grafana, which we create first.
$ kubectl create ns grafana
namespace/grafana created
$ helm install my-grafana grafana/grafana --namespace grafana
NAME: my-grafana
LAST DEPLOYED: Thu Aug 21 14:28:14 2025
NAMESPACE: grafana
STATUS: deployed
REVISION: 1
NOTES:
1. Get your 'admin' user password by running:
kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:
my-grafana.grafana.svc.cluster.local
Get the Grafana URL to visit by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace grafana port-forward $POD_NAME 3000
3. Login with the password from step 1 and the username: admin
#################################################################################
###### WARNING: Persistence is disabled!!! You will lose your data when #####
###### the Grafana pod is terminated. #####
#################################################################################
$ helm list -n grafana
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
my-grafana grafana 1 2025-08-21 14:28:14.772879 +0200 CEST deployed grafana-9.3.2 12.1.0
Following the instructions above, we retrieve the Grafana admin password and setup port forwarding.
$ kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
<REDACTED>
$ kubectl -n grafana port-forward svc/my-grafana 3000:80
Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000
Now we can test the access and login to the Grafana UI on http://localhost:3000, which works fine.
By default, Grafana only uses ephemeral storage, storing all data in the container’s file system. So, the data will be lost if the container stops. We follow the steps in the Grafana documentation to enable persistent storage for Grafana.
We download the values file and edit the values under the section of persistence, changing the enabled flag from false to true.
$ diff Grafana/values.yaml Grafana/values-persistence.yaml
418c418
< enabled: false
---
> enabled: true
Then we run helm upgrade to make the changes take effect.
$ helm upgrade my-grafana grafana/grafana -f Grafana/values-persistence.yaml -n grafana
Release "my-grafana" has been upgraded. Happy Helming!
NAME: my-grafana
LAST DEPLOYED: Thu Aug 21 14:37:24 2025
NAMESPACE: grafana
STATUS: deployed
REVISION: 2
NOTES:
1. Get your 'admin' user password by running:
kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:
my-grafana.grafana.svc.cluster.local
Get the Grafana URL to visit by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace grafana port-forward $POD_NAME 3000
We confirm that a PVC backed by by Azure NetApp Files was created in the grafana namespace.
$ kubectl get all,pvc -n grafana
NAME READY STATUS RESTARTS AGE
pod/my-grafana-6d5b96b7d7-fqq7d 1/1 Running 0 5m18s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/my-grafana ClusterIP 172.16.9.115 <none> 80/TCP 14m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/my-grafana 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/my-grafana-6ccff48567 0 0 0 14m
replicaset.apps/my-grafana-6d5b96b7d7 1 1 1 5m18s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
persistentvolumeclaim/my-grafana Bound pvc-5a1844c6-3a9f-4f1d-9d94-caa1666ded3e 50Gi RWO azure-netapp-files-standard <unset> 5m19s
After restarting the port forwarding, we can login to Grafana again and continue working with persistent storage enabled.
Next, we need to add our Prometheus instance as a data source in Grafana. For doing this, we need the service name and port of Prometheus. Typically, when using the Prometheus Operator, the service name is something like prometheus-operated, so we check on our cluster.
$ kubectl -n prometheus get svc | grep operated
prometheus-operated ClusterIP None <none> 9090/TCP 27h
Now we can add the Prometheus instance as a data source in Grafana. Use the Kubernetes DNS to reference the Prometheus service. It should look something like this: http://prometheus-operated.prometheus.svc.cluster.local:9090
In the Grafana dashboard, we navigate to Menu -> Drilldown, which allows us to easily see the Trident and KSM Trident protect metrics.
Covering the creation of Grafana dashboards goes beyond the scope of this blog post. As an example and inspiration, we use the dashboard example for visualization of snapshot and backup metrics from Yves Weisser’s highly recommended collection of Trident lab scenarios on GitHub.
After downloading the dashboard json file from GitHub, we change the “Failed” option values to “Error” to display failed snapshot and backups in red in the dashboard.
$ diff Grafana/dashboard.json Grafana/dashboard_v2.json
365c365
< "Failed": {
---
> "Error": {
562c562
< "Failed": {
---
> "Error": {
709c709,710
< "25.02"
---
> "25.02",
> "25.06"
724c725
< }
\ No newline at end of file
---
> }
Now can import the dashboard json file into Grafana.
After importing the dashboard json file, the “Trident protect Global View” dashboard is available in Grafana. Here’s an example how it visualizes running and failed Trident protect backups.
By following this blog, you have successfully set up monitoring and visualization for NetApp Trident and Trident protect using Prometheus and Grafana. This setup enables you to keep tabs on the health and performance of your Trident and Trident protect resources, ensuring your Kubernetes applications are well-protected and efficiently managed.
Happy monitoring!