NetApp® Console is built on a restructured and enhanced foundation, encompassing platform investments, core services, enterprise readiness, security, ...read more
NetApp Disaster Recovery now supports Amazon EVS connected to Amazon FSx for ONTAP, offering scalable, cost-effective DRaaS for VMware workloads. EVS enables seamless vCenter integration, flexible scaling, and lower costs compared to Amazon VMC. Key features include SnapMirror replication, non-disruptive testing, and cloud-based recovery. This upgrade simplifies disaster recovery planning and enhances resilience for enterprise IT teams.
... View more
Here’s an overview of what we’ll cover:
Storage workload updates
Decrease SSD capacity on FSx for ONTAP
VMware workload updates
Expedite relocations with the new Amazon EVS migration advisor
Compare deployment costs using the new Amazon EVS TCO calculator
Database workload updates
Support for self-managed Oracle databases
Simplify Microsoft SQL Server logs with the AI-powered log analyzer
Smarter workload management with Workload Factory
... View more
NetApp® Trident™ protect provides advanced application data management capabilities that enhance the functionality and availability of stateful Kubernetes applications supported by NetApp ONTAP storage systems and the NetApp Trident Container Storage Interface (CSI) storage provisioner. It is compatible with a wide range of fully managed and self-managed Kubernetes offerings (see the supported Kubernetes distributions and storage back ends), making it an optimal solution for protecting your Kubernetes services across various platforms and regions. In this blog post, I will demonstrate how to scrape and visualize the metrics provided by Trident and Trident Protect using the popular open-source monitoring and visualization frameworks Prometheus and Grafana.
Prerequisites
To follow along with this guide, ensure you have the following:
A Kubernetes cluster with the latest versions of Trident and Trident protect installed, and their associated kubeconfig files
A NetApp ONTAP storage back end and Trident with configured storage back ends, storage classes, and volume snapshot classes
A configured object storage buckets for storing backups and metadata information, with bucket replication configured
A workstation with kubectl configured to use kubeconfig
The tridentctl-protect CLI of Trident protect installed on your workstation
Admin user permission on the Kubernetes clusters
Prepare test environment
First, we quickly go through the setup of the test environment that we used throughout the blog.
Sample application
We will use a simple MinIO application with a persistent volume on Azure NetApp Files (ANF) as our sample application for the monitoring tests. The MinIO application is deployed on an Azure Kubernetes Service (AKS) cluster with NetApp Trident 25.06.0 installed and configured:
$ kubectl get all,pvc -n minio
NAME READY STATUS RESTARTS AGE
pod/minio-67dffb8bbd-5rfpm 1/1 Running 0 14m
pod/minio-console-677bd9ddcb-27497 1/1 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/minio ClusterIP 172.16.61.243 <none> 9000/TCP 14m
service/minio-console ClusterIP 172.16.95.239 <none> 9090/TCP 14m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/minio 1/1 1 1 14m
deployment.apps/minio-console 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/minio-67dffb8bbd 1 1 1 14m
replicaset.apps/minio-console-677bd9ddcb 1 1 1 14m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
persistentvolumeclaim/minio Bound pvc-ec50d895-4048-4a51-a651-5439b2a5ba2a 50Gi RWO azure-netapp-files-standard <unset> 14m
Create a Trident Protect Application
Create a Trident protect application minio based on the minio namespace with the Trident protect CLI:
$ tridentctl-protect create application minio --namespaces minio -n minio
Application "minio" created.
Create a snapshot minio-snap and a backup minio-bkp:
$ tridentctl-protect create snapshot minio-snap --app minio --appvault demo -n minio
Snapshot "minio-snap" created.
$ tridentctl-protect create backup minio-bkp --app minio --appvault demo -n minio
Backup "minio-bkp" created.
Install kube-state-metrics
Trident protect leverages kube-state-metrics (KSM) to provide information about the health status of its resources. Kube-state-metrics is an open-source add-on for Kubernetes that listens to the Kubernetes API server and generates metrics about the state of various Kubernetes objects.
Install Prometheus ServiceMonitor CRD
First, we install the Custom Resource Definition (CRD) for the Prometheus ServiceMonitor using Helm. Add the Prometheus-community helm repository:
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
Install and configure kube-state-metrics
Now, we install and configure kube-state-metrics to generate metrics from Kubernetes API communication. Using it with Trident Protect will expose useful information about the state of Trident Protect custom resources in our environment.
Let's create a configuration file for the KSM helm chart to monitor these Trident Protect CRs:
Snapshots
Backups
ExecutionHooksRuns
AppVaults (added in a later step)
Let’s take a closer look at the snapshot CR minio-snap that we created earlier.
$ k -n minio get snapshot minio-snap -o yaml
apiVersion: protect.trident.netapp.io/v1
kind: Snapshot
metadata:
annotations:
protect.trident.netapp.io/correlationid: 42111244-fdb7-41f1-af39-7b61fdb0c7e1
creationTimestamp: "2025-08-18T15:25:40Z"
...
name: minio-snap
namespace: minio
ownerReferences:
- apiVersion: protect.trident.netapp.io/v1
kind: Application
name: minio
uid: efc8cdd4-8b20-48e0-8944-eeee8aba98f9
resourceVersion: "14328"
uid: c569472c-ae13-4d30-bffd-98acef304abc
spec:
appVaultRef: demo
applicationRef: minio
cleanupSnapshot: false
completionTimeout: 0s
reclaimPolicy: Delete
volumeSnapshotsCreatedTimeout: 0s
volumeSnapshotsReadyToUseTimeout: 0s
status:
appArchivePath: minio_efc8cdd4-8b20-48e0-8944-eeee8aba98f9/snapshots/20250818152540_minio-snap_c569472c-ae13-4d30-bffd-98acef304abc
appVaultRef: demo
completionTimestamp: "2025-08-18T15:25:58Z"
...
postSnapshotExecHooksRunResults: []
preSnapshotExecHooksRunResults: []
state: Completed
volumeSnapshots:
- name: snapshot-c569472c-ae13-4d30-bffd-98acef304abc-pvc-ec50d895-4048-4a51-a651-5439b2a5ba2a
namespace: minio
From its metadata section, we want to expose the name, UID, and creationTimestamp of the snapshot to Prometheus and from the spec and status fields the metrics appVautlRef, applicationRef, and state. The corresponding KSM configuration entry will look like this.
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Snapshot"
version: "v1"
labelsFromPath:
snapshot_uid: [metadata, uid]
snapshot_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: snapshot_info
help: "Exposes details about the Snapshot state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
From the backup CR, which has the same structure as the snapshot CR, we can collect the same information using this KSM configuration entry.
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Backup"
version: "v1"
labelsFromPath:
backup_uid: [metadata, uid]
backup_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: backup_info
help: "Exposes details about the Backup state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
To access those CR fields, KSM needs to have the corresponding RBAC permissions to allow access to the snapshot and backup CRs in all namespaces (since the Trident protect CRs are created in the application namespace). So we add the following parameters to the KSM configuration file.
rbac:
extraRules:
- apiGroups: ["protect.trident.netapp.io"]
resources: ["snapshots", "backups"]
verbs: ["list", "watch"]
# collect metrics from ALL namespaces
namespaces: ""
Collecting the details for the executionHooksRuns works in the same way as for snapshots and backups, so we don’t show the details here. Putting everything together, our first KSM configuration file looks like this.
$ cat metrics-config-backup-snapshot-hooks.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true
customResourceState:
enabled: true
config:
kind: CustomResourceStateMetrics
spec:
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Snapshot"
version: "v1"
labelsFromPath:
snapshot_uid: [metadata, uid]
snapshot_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: snapshot_info
help: "Exposes details about the Snapshot state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Backup"
version: "v1"
labelsFromPath:
backup_uid: [metadata, uid]
backup_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: backup_info
help: "Exposes details about the Backup state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Exechooksruns"
version: "v1"
labelsFromPath:
ehr_uid: [metadata, uid]
ehr_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: ehr_info
help: "Exposes details about the Exec Hook state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
stage: ["spec", stage]
action: ["spec", action]
status: [status, state]
rbac:
extraRules:
- apiGroups: ["protect.trident.netapp.io"]
resources: ["snapshots"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["backups"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["exechooksruns"]
verbs: ["list", "watch"]
# collect metrics from ALL namespaces
namespaces: ""
# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
monitor:
enabled: true
additionalLabels:
release: prometheus
Now we can install the KSM using Helm.
$ helm install trident-protect -f ./metrics-config-backup-snapshot-hooks.yaml prometheus-community/kube-state-metrics --version 5.21.0 -n prometheus
NAME: trident-protect
LAST DEPLOYED: Tue Aug 19 17:54:22 2025
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
The exposed metrics can be found here:
https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics
The metrics are exported on the HTTP endpoint /metrics on the listening port.
In your case, trident-protect-kube-state-metrics.prometheus.svc.cluster.local:8080/metrics
They are served either as plaintext or protobuf depending on the Accept header.
They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.
We check that the KSM ServiceMonitor was correctly deployed in the prometheus namespace.
$ kubectl -n prometheus get smon -l app.kubernetes.io/instance=trident-protect
NAME AGE
trident-protect-kube-state-metrics 90s
$ kubectl get all -n prometheus
NAME READY STATUS RESTARTS AGE
pod/trident-protect-kube-state-metrics-94d55666c-69j6n 1/1 Running 0 105s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 105s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 105s
NAME DESIRED CURRENT READY AGE
replicaset.apps/trident-protect-kube-state-metrics-94d55666c 1 1 1 105s
Prometheus installation
Let’s install Prometheus now on our cluster. Before doing that, we must make sure that the Prometheus server can access the Kubernetes API.
RBAC permissions
The Prometheus server needs access to the Kubernetes API to scrape targets. Therefore, a ServiceAccount is required to provide access to those resources, which must be created and bound to a ClusterRole accordingly. By applying the yaml file below, we create the ServiceAccount prometheus and a ClusterRole prometheus with the necessary privileges, that we bind to the ServiceAccount.
$ cat ./rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
namespace: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
namespace: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: prometheus
$ kubectl apply -f ./rbac.yaml
serviceaccount/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
Now we’re ready to install Prometheus.
Deploy Prometheus
After creating the Prometheus ServiceAccount and giving it access to the Kubernetes API, we can deploy the Prometheus instance.
We’ll use the Prometheus operator for the installation. Following the instructions to install the operator in the prometheus namespace will install the operator in some minutes on our K8s cluster.
This manifest defines the serviceMonitor, NamespaceSelector, serviceMonitorSelector, and podMonitorSelector fields to specify which CRs to include. In this example, the {} value is used to match all existing CRs.
$ cat ./prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: prometheus
spec:
serviceAccountName: prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
podMonitorSelector: {}
resources:
requests:
memory: 400Mi
We apply the manifest and check that the Prometheus instance reaches the Running state eventually and a prometheus-operated Service was created:
$ kubectl apply -f ./prometheus.yaml
prometheus.monitoring.coreos.com/prometheus created
$ kubectl get prometheus -n prometheus
NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE
prometheus 1 True True 42s
$ kubectl get services -n prometheus
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-operated ClusterIP None <none> 9090/TCP 103s
prometheus-operator ClusterIP None <none> 8080/TCP 7m44s
trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 17h
$ kubectl get all -n prometheus
NAME READY STATUS RESTARTS AGE
pod/prometheus-operator-5d697c648f-22lrz 1/1 Running 0 6m21s
pod/prometheus-prometheus-0 2/2 Running 0 20s
pod/trident-protect-kube-state-metrics-94d55666c-69j6n 1/1 Running 0 17h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/prometheus-operated ClusterIP None <none> 9090/TCP 20s
service/prometheus-operator ClusterIP None <none> 8080/TCP 6m21s
service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 17h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-operator 1/1 1 1 6m21s
deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 17h
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-operator-5d697c648f 1 1 1 6m21s
replicaset.apps/trident-protect-kube-state-metrics-94d55666c 1 1 1 17h
NAME READY AGE
statefulset.apps/prometheus-prometheus 1/1 20s
To quickly test the Prometheus installation, let’s use port-forwarding.
$ kubectl -n prometheus port-forward svc/prometheus-operated 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
By pointing a web browser to http://localhost:9090 we can view the Prometheus console:
Configure the monitoring tools to work together
After we have now installed all the monitoring tools, we need to configure them to work together. To integrate the kube-state-metrics with Prometheus, we edit our Prometheus configuration file (prometheus.yaml) and add the kube-state-metrics service information to it, saving it as prometheus-ksm.yaml.
$ cat ./prometheus-ksm.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: prometheus
spec:
serviceAccountName: prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
podMonitorSelector: {}
resources:
requests:
memory: 400Mi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: trident-protect
data:
prometheus.yaml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['kube-state-metrics.trident-protect.svc:8080']
$ diff ./prometheus.yaml ./prometheus-ksm.yaml
13a14,27
> ---
> apiVersion: v1
> kind: ConfigMap
> metadata:
> name: prometheus-config
> namespace: trident-protect
> data:
> prometheus.yaml: |
> global:
> scrape_interval: 15s
> scrape_configs:
> - job_name: 'kube-state-metrics'
> static_configs:
> - targets: ['kube-state-metrics.trident-protect.svc:8080']
After applying the manifest, we confirm that the prometheus-config configuration map was created in the trident-protect namespace:
$ kubectl apply -f ./prometheus-ksm.yaml
prometheus.monitoring.coreos.com/prometheus unchanged
configmap/prometheus-config created
$ kubectl -n trident-protect get cm
NAME DATA AGE
kube-root-ca.crt 1 46h
prometheus-config 1 59s
trident-protect-env-config 15 46h
Now we can query the backups, snapshot, and execution hooks run information in Prometheus:
This matches the two snapshots and one backup and the six execution hook runs we have in Trident protect:
$ tridentctl-protect get snapshot -A
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+
| NAMESPACE | NAME | APP | RECLAIM POLICY | STATE | ERROR | AGE |
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+
| minio | backup-3473b771-caa5-48d2-a9b6-41f4448a049d | minio | Delete | Completed | | 1d22h |
| minio | minio-snap | minio | Delete | Completed | | 1d22h |
+-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+
$ tridentctl-protect get backup -A
+-----------+--------------+-------+----------------+-----------+-------+-------+
| NAMESPACE | NAME | APP | RECLAIM POLICY | STATE | ERROR | AGE |
+-----------+--------------+-------+----------------+-----------+-------+-------+
| minio | minio-backup | minio | Retain | Completed | | 1d22h |
+-----------+--------------+-------+----------------+-----------+-------+-------+
$ kubectl get ehr -A
NAMESPACE NAME STATE STAGE ACTION ERROR APP AGE
minio post-backup-3473b771-caa5-48d2-a9b6-41f4448a049d Completed Post Backup minio 46h
minio post-snapshot-7e7934a4-b51a-4bc4-a981-28a8ba137ff6 Completed Post Snapshot minio 46h
minio post-snapshot-c569472c-ae13-4d30-bffd-98acef304abc Completed Post Snapshot minio 46h
minio pre-backup-3473b771-caa5-48d2-a9b6-41f4448a049d Completed Pre Backup minio 46h
minio pre-snapshot-7e7934a4-b51a-4bc4-a981-28a8ba137ff6 Completed Pre Snapshot minio 46h
minio pre-snapshot-c569472c-ae13-4d30-bffd-98acef304abc Completed Pre Snapshot minio 46h
Let’s create a 2nd backup:
$ tridentctl-protect create backup minio-bkp-2 --app minio --appvault demo --reclaim-policy Delete -n minio
Backup "minio-bkp-2" created.
Prometheus quickly catches the backup in the Running state, and the Completed state once the backup finishes.
Add additional metrics and information
Now, we want to add metrics about additional custom resources to Prometheus and see error states (if any) of the monitored custom resources reflected in Prometheus.
AppVault metrics and error details
To include metrics about the appVault CR and its error details, we add the below entries to the KSM configuration file:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "AppVault"
version: "v1"
labelsFromPath:
appvault_uid: [metadata, uid]
appvault_name: [metadata, name]
metricsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
metrics:
- name: appvault_info
help: "Exposes details about the AppVault state"
each:
type: Info
info:
labelsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
The complete configuration file to catch metrics and error details from snapshot, backups, execHooksRun, and appVault CR is then:
$ cat ./metrics-config-backup-snapshot-hooks-appvault.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true
customResourceState:
enabled: true
config:
kind: CustomResourceStateMetrics
spec:
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Snapshot"
version: "v1"
labelsFromPath:
snapshot_uid: [metadata, uid]
snapshot_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: snapshot_info
help: "Exposes details about the Snapshot state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Backup"
version: "v1"
labelsFromPath:
backup_uid: [metadata, uid]
backup_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: backup_info
help: "Exposes details about the Backup state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Exechooksruns"
version: "v1"
labelsFromPath:
ehr_uid: [metadata, uid]
ehr_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: ehr_info
help: "Exposes details about the Exec Hook state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
stage: ["spec", stage]
action: ["spec", action]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "AppVault"
version: "v1"
labelsFromPath:
appvault_uid: [metadata, uid]
appvault_name: [metadata, name]
metricsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
metrics:
- name: appvault_info
help: "Exposes details about the AppVault state"
each:
type: Info
info:
labelsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
rbac:
extraRules:
- apiGroups: ["protect.trident.netapp.io"]
resources: ["snapshots"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["backups"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["exechooksruns"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["appvaults"]
verbs: ["list", "watch"]
# collect metrics from ALL namespaces
namespaces: ""
# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
monitor:
enabled: true
additionalLabels:
release: prometheus
We update the KSM configuration:
$ helm upgrade trident-protect prometheus-community/kube-state-metrics -f ./metrics-config-backup-snapshot-hooks-appvault.yaml -n prometheus
Release "trident-protect" has been upgraded. Happy Helming!
NAME: trident-protect
LAST DEPLOYED: Wed Aug 20 16:47:06 2025
NAMESPACE: prometheus
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
The exposed metrics can be found here:
https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics
The metrics are exported on the HTTP endpoint /metrics on the listening port.
In your case, trident-protect-kube-state-metrics.prometheus.svc.cluster.local:8080/metrics
They are served either as plaintext or protobuf depending on the Accept header.
They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.
Now the information about the appVault CR is available in Prometheus.
Test AppVault failure
To test Prometheus’ monitoring and error recognition, we test a failure of or appVault CR. To simulate losing access to the object storage bucket behind the appVault CR, we delete the secret with the access credential from the trident-protect namespace.
$ kubectl -n trident-protect delete secret puneptunetest
secret "puneptunetest" deleted
After some seconds, the AppVault CR goes into the Error state.
$ tridentctl-protect get appvault
+------+----------+-------+--------------------------------+---------+-----+
| NAME | PROVIDER | STATE | ERROR | MESSAGE | AGE |
+------+----------+-------+--------------------------------+---------+-----+
| demo | Azure | Error | failed to resolve value for | | 2d |
| | | | accountKey: unable to ... | | |
+------+----------+-------+--------------------------------+---------+-----+
And the error of the appVault CR is also reflected in Prometheus:
AppMirrorRelationship metrics
With Trident protect, you can use the asynchronous replication capabilities of NetApp SnapMirror technology to replicate data and application changes from one storage backend to another, on the same cluster or between different clusters. AppMirrorRelationship (AMR) CRs control the replication relationship of an application protect by NetApp Snapmirror with Trident protect, so monitoring its state with Prometheus is essential.
This example config includes snapshot, backup, execHooksRun, appvault, and AMR metrics:
$ cat ./metrics-config-backup-snapshot-hooks-appvault-amr.yaml
extraArgs:
# collect only our metrics, not the defaults ones (deployments etc.)
- --custom-resource-state-only=true
customResourceState:
enabled: true
config:
kind: CustomResourceStateMetrics
spec:
resources:
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Snapshot"
version: "v1"
labelsFromPath:
snapshot_uid: [metadata, uid]
snapshot_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: snapshot_info
help: "Exposes details about the Snapshot state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Backup"
version: "v1"
labelsFromPath:
backup_uid: [metadata, uid]
backup_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: backup_info
help: "Exposes details about the Backup state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "Exechooksruns"
version: "v1"
labelsFromPath:
ehr_uid: [metadata, uid]
ehr_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: ehr_info
help: "Exposes details about the Exec Hook state"
each:
type: Info
info:
labelsFromPath:
appVaultReference: ["spec", "appVaultRef"]
appReference: ["spec", "applicationRef"]
stage: ["spec", stage]
action: ["spec", action]
status: [status, state]
error: [status, error]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "AppVault"
version: "v1"
labelsFromPath:
appvault_uid: [metadata, uid]
appvault_name: [metadata, name]
metricsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
metrics:
- name: appvault_info
help: "Exposes details about the AppVault state"
each:
type: Info
info:
labelsFromPath:
state: [status, state]
error: [status, error]
message: [status, message]
- groupVersionKind:
group: protect.trident.netapp.io
kind: "AppMirrorRelationship"
version: "v1"
labelsFromPath:
amr_uid: [metadata, uid]
amr_name: [metadata, name]
creation_time: [metadata, creationTimestamp]
metrics:
- name: app_mirror_relationship_info
help: "Exposes details about the AppMirrorRelationship state"
each:
type: Info
info:
labelsFromPath:
desiredState: ["spec", "desiredState"]
destinationAppVaultRef: ["spec", "destinationAppVaultRef"]
sourceAppVaultRef: ["spec", "sourceAppVaultRef"]
sourceApplicationName: ["spec", "sourceApplicationName"]
sourceApplicationUID: ["spec", "sourceApplicationUID"]
state: ["status", "state"]
error: ["status", "error"]
lastTransferStartTimestamp: ["status", "lastTransfer", "startTimestamp"]
lastTransferCompletionTimestamp: ["status", "lastTransfer", "completionTimestamp"]
lastTransferredSnapshotName: ["status", "lastTransferredSnapshot", "name"]
lastTransferredSnapshotCompletionTimestamp: ["status", "lastTransferredSnapshot", "completionTimestamp"]
destinationApplicationRef: ["status", "destinationApplicationRef"]
destinationNamespaces: ["status", "destinationNamespaces"]
promotedSnapshot: ["spec", "promotedSnapshot"]
recurrenceRule: ["spec", "recurrenceRule"]
storageClassName: ["spec", "storageClassName"]
namespaceMapping: ["spec", "namespaceMapping"]
conditions: ["status", "conditions"]
rbac:
extraRules:
- apiGroups: ["protect.trident.netapp.io"]
resources: ["snapshots"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["backups"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["exechooksruns"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["appvaults"]
verbs: ["list", "watch"]
- apiGroups: ["protect.trident.netapp.io"]
resources: ["appmirrorrelationships"]
verbs: ["list", "watch"]
# collect metrics from ALL namespaces
namespaces: ""
# deploy a ServiceMonitor so the metrics are collected by Prometheus
prometheus:
monitor:
enabled: true
additionalLabels:
release: prometheus
Trident metrics
The metrics provided by Trident enable you to do the following:
Keep tabs on Trident's health and configuration. You can examine how successful operations are and if it can communicate with the backends as expected.
Examine backend usage information and understand how many volumes are provisioned on a backend and the amount of space consumed, and so on.
Maintain a mapping of the number of volumes provisioned on available backends.
Track performance. You can look at how long it takes for Trident to communicate to backends and perform operations.
By default, Trident's metrics are exposed on the target port 8001 at the /metrics endpoint. These metrics are enabled by default when Trident is installed.
Create a Prometheus ServiceMonitor for Trident metrics
Prometheus was setup in the previous sections already, so to consume the Trident metrics, we create another Prometheus ServiceMonitor that watches the trident-csi service and listens on the metrics port. A sample ServiceMonitor configuration looks like this:
$ cat ./prometheus-trident-sm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: trident-sm
namespace: prometheus
labels:
release: prom-operator
spec:
jobLabel: trident
selector:
matchLabels:
app: controller.csi.trident.netapp.io
namespaceSelector:
matchNames:
- trident
endpoints:
- port: metrics
interval: 15s
Let’s deploy the new ServiceMonitor in the prometheus namespace.
$ kubectl apply -f Prometheus/prometheus-trident-sm.yaml
servicemonitor.monitoring.coreos.com/trident-sm created
We can see that the new ServiceMonitor trident-sm is now running in the prometheus namespace:
$ kubectl -n prometheus get all,ServiceMonitor,cm
NAME READY STATUS RESTARTS AGE
pod/prometheus-operator-5d697c648f-22lrz 1/1 Running 0 6h1m
pod/prometheus-prometheus-0 2/2 Running 0 5h55m
pod/trident-protect-kube-state-metrics-99476b548-cv9ff 1/1 Running 0 28m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/prometheus-operated ClusterIP None <none> 9090/TCP 5h55m
service/prometheus-operator ClusterIP None <none> 8080/TCP 6h1m
service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 23h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-operator 1/1 1 1 6h1m
deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 23h
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-operator-5d697c648f 1 1 1 6h1m
replicaset.apps/trident-protect-kube-state-metrics-94d55666c 0 0 0 23h
replicaset.apps/trident-protect-kube-state-metrics-99476b548 1 1 1 28m
NAME READY AGE
statefulset.apps/prometheus-prometheus 1/1 5h55m
NAME AGE
servicemonitor.monitoring.coreos.com/trident-protect-kube-state-metrics 23h
servicemonitor.monitoring.coreos.com/trident-sm 32s
NAME DATA AGE
configmap/kube-root-ca.crt 1 24h
configmap/prometheus-prometheus-rulefiles-0 0 5h55m
configmap/trident-protect-kube-state-metrics-customresourcestate-config 1 23h
By checking for available targets in the Prometheus UI (http://localhost:9090/targets) we confirm that the Trident metrics are now available in Prometheus.
Query Trident metrics
We can now query the available Trident metrics in Prometheus.
For example, we can query the number of Trident snapshots, volumes, and bytes the allocated by Trident volumes in the Prometheus UI.
Grafana dashboards
Now that our monitoring system is functional, it’s time to give you an idea how to visualize the monitoring results. Let’s investigate Grafana dashboards!
Install Grafana
We install Grafana using the Grafana helm charts, first adding the Grafana helm repository:
$ helm repo add grafana https://grafana.github.io/helm-charts
Then we can install Grafana into the namespace grafana, which we create first.
$ kubectl create ns grafana
namespace/grafana created
$ helm install my-grafana grafana/grafana --namespace grafana
NAME: my-grafana
LAST DEPLOYED: Thu Aug 21 14:28:14 2025
NAMESPACE: grafana
STATUS: deployed
REVISION: 1
NOTES:
1. Get your 'admin' user password by running:
kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:
my-grafana.grafana.svc.cluster.local
Get the Grafana URL to visit by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace grafana port-forward $POD_NAME 3000
3. Login with the password from step 1 and the username: admin
#################################################################################
###### WARNING: Persistence is disabled!!! You will lose your data when #####
###### the Grafana pod is terminated. #####
#################################################################################
$ helm list -n grafana
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
my-grafana grafana 1 2025-08-21 14:28:14.772879 +0200 CEST deployed grafana-9.3.2 12.1.0
Following the instructions above, we retrieve the Grafana admin password and setup port forwarding.
$ kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
<REDACTED>
$ kubectl -n grafana port-forward svc/my-grafana 3000:80
Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000
Now we can test the access and login to the Grafana UI on http://localhost:3000, which works fine.
Enable persistent storage for Grafana
By default, Grafana only uses ephemeral storage, storing all data in the container’s file system. So, the data will be lost if the container stops. We follow the steps in the Grafana documentation to enable persistent storage for Grafana.
We download the values file and edit the values under the section of persistence, changing the enabled flag from false to true.
$ diff Grafana/values.yaml Grafana/values-persistence.yaml
418c418
< enabled: false
---
> enabled: true
Then we run helm upgrade to make the changes take effect.
$ helm upgrade my-grafana grafana/grafana -f Grafana/values-persistence.yaml -n grafana
Release "my-grafana" has been upgraded. Happy Helming!
NAME: my-grafana
LAST DEPLOYED: Thu Aug 21 14:37:24 2025
NAMESPACE: grafana
STATUS: deployed
REVISION: 2
NOTES:
1. Get your 'admin' user password by running:
kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:
my-grafana.grafana.svc.cluster.local
Get the Grafana URL to visit by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace grafana port-forward $POD_NAME 3000
We confirm that a PVC backed by by Azure NetApp Files was created in the grafana namespace.
$ kubectl get all,pvc -n grafana
NAME READY STATUS RESTARTS AGE
pod/my-grafana-6d5b96b7d7-fqq7d 1/1 Running 0 5m18s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/my-grafana ClusterIP 172.16.9.115 <none> 80/TCP 14m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/my-grafana 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/my-grafana-6ccff48567 0 0 0 14m
replicaset.apps/my-grafana-6d5b96b7d7 1 1 1 5m18s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
persistentvolumeclaim/my-grafana Bound pvc-5a1844c6-3a9f-4f1d-9d94-caa1666ded3e 50Gi RWO azure-netapp-files-standard <unset> 5m19s
After restarting the port forwarding, we can login to Grafana again and continue working with persistent storage enabled.
Add a data source
Next, we need to add our Prometheus instance as a data source in Grafana. For doing this, we need the service name and port of Prometheus. Typically, when using the Prometheus Operator, the service name is something like prometheus-operated, so we check on our cluster.
$ kubectl -n prometheus get svc | grep operated
prometheus-operated ClusterIP None <none> 9090/TCP 27h
Now we can add the Prometheus instance as a data source in Grafana. Use the Kubernetes DNS to reference the Prometheus service. It should look something like this: http://prometheus-operated.prometheus.svc.cluster.local:9090
In the Grafana dashboard, we navigate to Menu -> Drilldown, which allows us to easily see the Trident and KSM Trident protect metrics.
Add a dashboard for the Trident protect metrics
Covering the creation of Grafana dashboards goes beyond the scope of this blog post. As an example and inspiration, we use the dashboard example for visualization of snapshot and backup metrics from Yves Weisser’s highly recommended collection of Trident lab scenarios on GitHub.
After downloading the dashboard json file from GitHub, we change the “Failed” option values to “Error” to display failed snapshot and backups in red in the dashboard.
$ diff Grafana/dashboard.json Grafana/dashboard_v2.json
365c365
< "Failed": {
---
> "Error": {
562c562
< "Failed": {
---
> "Error": {
709c709,710
< "25.02"
---
> "25.02",
> "25.06"
724c725
< }
\ No newline at end of file
---
> }
Now can import the dashboard json file into Grafana.
After importing the dashboard json file, the “Trident protect Global View” dashboard is available in Grafana. Here’s an example how it visualizes running and failed Trident protect backups.
Conclusion and call to action
By following this blog, you have successfully set up monitoring and visualization for NetApp Trident and Trident protect using Prometheus and Grafana. This setup enables you to keep tabs on the health and performance of your Trident and Trident protect resources, ensuring your Kubernetes applications are well-protected and efficiently managed.
Happy monitoring!
... View more
NetApp® Console is built on a restructured and enhanced foundation, encompassing platform investments, core services, enterprise readiness, security, administration and functionality. It is now the intuitive, intelligent, highly secure, and compliant single point of control for seamless management for your NetApp intelligent data infrastructure. With its reimagined user interface and experience, managing your NetApp Data Services and NetApp storage and has never been more intuitive, smarter, and insightful.
... View more
Microsoft has announced several new features for Azure NetApp Files (ANF). These updates bring meaningful improvements in performance, flexibility, security, and data mobility—making ANF an even more capable solution for organizations running demanding workloads in the cloud.
Whether you're managing infrastructure, supporting hybrid environments, or navigating compliance requirements, these enhancements are designed to help your organization operate more efficiently and securely.
Improved Data Mobility and Access
Two powerful data mobility features are now available:
Azure NetApp Files Cache Volumes
Azure NetApp Files Migration Assistant
Cache Volumes, built on NetApp’s ONTAP® FlexCache® technology, introduce a persistent, high-performance cache in Azure for origin volumes located outside ANF. This means active data can be accessed faster and more efficiently—even across WAN connections. For distributed teams or hybrid architectures, this capability enables low-latency access to critical files without duplicating entire datasets.
The Migration Assistant streamlines the process of moving data from on-premises ONTAP environments to Azure. It preserves metadata and minimizes downtime, helping your organization reduce migration complexity and network costs.
Flexible Pricing and Performance Optimization
Three new features are now GA that give organizations more control over cost and performance:
Flexible Service Level
Flexible Service Level with Cool Access
Short-Term Clones
With flexible service levels, you can dynamically adjust performance tiers based on workload needs—scaling up for high-performance tasks or scaling down to save costs during quieter periods. The addition of cool access tiers allows you to store infrequently accessed data at a lower cost, while maintaining availability when needed.
Short-term clones are ideal for development, testing, and analytics. These space-efficient, temporary copies allow teams to work with production-like data without consuming large amounts of storage, accelerating innovation while keeping costs in check.
Simplified VMware Integration
Azure NetApp Files now supports datastore integration with Azure VMware Solution (AVS) Generation 2, and notably, this no longer requires ExpressRoute.
This update simplifies deployment for organizations using AVS, making it easier to migrate VMware workloads to Azure. With ANF providing high-performance storage, teams can expect improved responsiveness and reliability for their virtualized environments—without the complexity of additional networking infrastructure.
Enhanced Security and Visibility
Security and compliance are top priorities for many organizations, and ANF’s latest updates deliver greater control and transparency:
Cross-Tenant Customer-Managed Keys for Volume Encryption
File Access Logs
With cross-tenant encryption, your organization can manage its own encryption keys—even in multi-tenant scenarios—ensuring data protection policies remain under your control. This is especially important for regulated industries or environments with strict governance requirements.
File Access Logs provide detailed visibility into file-level operations. These logs support audit trails, help detect unusual access patterns, and enable forensic analysis—making it easier to meet compliance standards and maintain operational integrity.
Who Benefits Most from These Updates?
These new features are particularly valuable for:
Organizations modernizing infrastructure or migrating to AVS The VMware integration and migration tools simplify transitions and reduce friction.
Teams with strict security and compliance requirements Enhanced encryption and logging capabilities support governance and regulatory needs.
IT departments looking to optimize storage costs and performance Flexible service levels, cool access tiers, and cache volumes deliver measurable efficiency gains.
Whether you're supporting enterprise applications, managing hybrid cloud environments, or enabling global collaboration, these enhancements to Azure NetApp Files offer practical tools to improve performance, reduce costs, and strengthen security.
Next Steps
If your organization is already using ANF or considering it as part of your cloud strategy, these new features are worth exploring. They offer tangible benefits across infrastructure, operations, and compliance—helping you get more value from your cloud investments.
Would you like help evaluating how these features could fit into your current environment or roadmap?
Let’s talk https://www.netapp.com/azure/contact/
... View more