Tech ONTAP Blogs - Page 24

richard9

NetApp Disaster Recovery now supports Amazon EVS connected to Amazon FSx for ONTAP, offering scalable, cost-effective DRaaS for VMware workloads. EVS enables seamless vCenter integration, flexible scaling, and lower costs compared to Amazon VMC. Key features include SnapMirror replication, non-disruptive testing, and cloud-based recovery. This upgrade simplifies disaster recovery planning and enhances resilience for enterprise IT teams.

robertbell

Here’s an overview of what we’ll cover: Storage workload updates Decrease SSD capacity on FSx for ONTAP VMware workload updates Expedite relocations with the new Amazon EVS migration advisor Compare deployment costs using the new Amazon EVS TCO calculator Database workload updates Support for self-managed Oracle databases Simplify Microsoft SQL Server logs with the AI-powered log analyzer Smarter workload management with Workload Factory

PatricU

NetApp® Trident™ protect provides advanced application data management capabilities that enhance the functionality and availability of stateful Kubernetes applications supported by NetApp ONTAP storage systems and the NetApp Trident Container Storage Interface (CSI) storage provisioner. It is compatible with a wide range of fully managed and self-managed Kubernetes offerings (see the supported Kubernetes distributions and storage back ends), making it an optimal solution for protecting your Kubernetes services across various platforms and regions. In this blog post, I will demonstrate how to scrape and visualize the metrics provided by Trident and Trident Protect using the popular open-source monitoring and visualization frameworks Prometheus and Grafana. Prerequisites To follow along with this guide, ensure you have the following: A Kubernetes cluster with the latest versions of Trident and Trident protect installed, and their associated kubeconfig files A NetApp ONTAP storage back end and Trident with configured storage back ends, storage classes, and volume snapshot classes A configured object storage buckets for storing backups and metadata information, with bucket replication configured A workstation with kubectl configured to use kubeconfig The tridentctl-protect CLI of Trident protect installed on your workstation Admin user permission on the Kubernetes clusters Prepare test environment First, we quickly go through the setup of the test environment that we used throughout the blog. Sample application We will use a simple MinIO application with a persistent volume on Azure NetApp Files (ANF) as our sample application for the monitoring tests. The MinIO application is deployed on an Azure Kubernetes Service (AKS) cluster with NetApp Trident 25.06.0 installed and configured: $ kubectl get all,pvc -n minio NAME READY STATUS RESTARTS AGE pod/minio-67dffb8bbd-5rfpm 1/1 Running 0 14m pod/minio-console-677bd9ddcb-27497 1/1 Running 0 14m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/minio ClusterIP 172.16.61.243 <none> 9000/TCP 14m service/minio-console ClusterIP 172.16.95.239 <none> 9090/TCP 14m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/minio 1/1 1 1 14m deployment.apps/minio-console 1/1 1 1 14m NAME DESIRED CURRENT READY AGE replicaset.apps/minio-67dffb8bbd 1 1 1 14m replicaset.apps/minio-console-677bd9ddcb 1 1 1 14m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/minio Bound pvc-ec50d895-4048-4a51-a651-5439b2a5ba2a 50Gi RWO azure-netapp-files-standard <unset> 14m Create a Trident Protect Application Create a Trident protect application minio based on the minio namespace with the Trident protect CLI: $ tridentctl-protect create application minio --namespaces minio -n minio Application "minio" created. Create a snapshot minio-snap and a backup minio-bkp: $ tridentctl-protect create snapshot minio-snap --app minio --appvault demo -n minio Snapshot "minio-snap" created. $ tridentctl-protect create backup minio-bkp --app minio --appvault demo -n minio Backup "minio-bkp" created. Install kube-state-metrics Trident protect leverages kube-state-metrics (KSM) to provide information about the health status of its resources. Kube-state-metrics is an open-source add-on for Kubernetes that listens to the Kubernetes API server and generates metrics about the state of various Kubernetes objects. Install Prometheus ServiceMonitor CRD First, we install the Custom Resource Definition (CRD) for the Prometheus ServiceMonitor using Helm. Add the Prometheus-community helm repository: $ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts $ helm repo update Install and configure kube-state-metrics Now, we install and configure kube-state-metrics to generate metrics from Kubernetes API communication. Using it with Trident Protect will expose useful information about the state of Trident Protect custom resources in our environment. Let's create a configuration file for the KSM helm chart to monitor these Trident Protect CRs: Snapshots Backups ExecutionHooksRuns AppVaults (added in a later step) Let’s take a closer look at the snapshot CR minio-snap that we created earlier. $ k -n minio get snapshot minio-snap -o yaml apiVersion: protect.trident.netapp.io/v1 kind: Snapshot metadata: annotations: protect.trident.netapp.io/correlationid: 42111244-fdb7-41f1-af39-7b61fdb0c7e1 creationTimestamp: "2025-08-18T15:25:40Z" ... name: minio-snap namespace: minio ownerReferences: - apiVersion: protect.trident.netapp.io/v1 kind: Application name: minio uid: efc8cdd4-8b20-48e0-8944-eeee8aba98f9 resourceVersion: "14328" uid: c569472c-ae13-4d30-bffd-98acef304abc spec: appVaultRef: demo applicationRef: minio cleanupSnapshot: false completionTimeout: 0s reclaimPolicy: Delete volumeSnapshotsCreatedTimeout: 0s volumeSnapshotsReadyToUseTimeout: 0s status: appArchivePath: minio_efc8cdd4-8b20-48e0-8944-eeee8aba98f9/snapshots/20250818152540_minio-snap_c569472c-ae13-4d30-bffd-98acef304abc appVaultRef: demo completionTimestamp: "2025-08-18T15:25:58Z" ... postSnapshotExecHooksRunResults: [] preSnapshotExecHooksRunResults: [] state: Completed volumeSnapshots: - name: snapshot-c569472c-ae13-4d30-bffd-98acef304abc-pvc-ec50d895-4048-4a51-a651-5439b2a5ba2a namespace: minio From its metadata section, we want to expose the name, UID, and creationTimestamp of the snapshot to Prometheus and from the spec and status fields the metrics appVautlRef, applicationRef, and state. The corresponding KSM configuration entry will look like this. resources: - groupVersionKind: group: protect.trident.netapp.io kind: "Snapshot" version: "v1" labelsFromPath: snapshot_uid: [metadata, uid] snapshot_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: snapshot_info help: "Exposes details about the Snapshot state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] status: [status, state] From the backup CR, which has the same structure as the snapshot CR, we can collect the same information using this KSM configuration entry. resources: - groupVersionKind: group: protect.trident.netapp.io kind: "Backup" version: "v1" labelsFromPath: backup_uid: [metadata, uid] backup_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: backup_info help: "Exposes details about the Backup state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] status: [status, state] To access those CR fields, KSM needs to have the corresponding RBAC permissions to allow access to the snapshot and backup CRs in all namespaces (since the Trident protect CRs are created in the application namespace). So we add the following parameters to the KSM configuration file. rbac: extraRules: - apiGroups: ["protect.trident.netapp.io"] resources: ["snapshots", "backups"] verbs: ["list", "watch"] # collect metrics from ALL namespaces namespaces: "" Collecting the details for the executionHooksRuns works in the same way as for snapshots and backups, so we don’t show the details here. Putting everything together, our first KSM configuration file looks like this. $ cat metrics-config-backup-snapshot-hooks.yaml extraArgs: # collect only our metrics, not the defaults ones (deployments etc.) - --custom-resource-state-only=true customResourceState: enabled: true config: kind: CustomResourceStateMetrics spec: resources: - groupVersionKind: group: protect.trident.netapp.io kind: "Snapshot" version: "v1" labelsFromPath: snapshot_uid: [metadata, uid] snapshot_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: snapshot_info help: "Exposes details about the Snapshot state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] status: [status, state] - groupVersionKind: group: protect.trident.netapp.io kind: "Backup" version: "v1" labelsFromPath: backup_uid: [metadata, uid] backup_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: backup_info help: "Exposes details about the Backup state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] status: [status, state] - groupVersionKind: group: protect.trident.netapp.io kind: "Exechooksruns" version: "v1" labelsFromPath: ehr_uid: [metadata, uid] ehr_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: ehr_info help: "Exposes details about the Exec Hook state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] stage: ["spec", stage] action: ["spec", action] status: [status, state] rbac: extraRules: - apiGroups: ["protect.trident.netapp.io"] resources: ["snapshots"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["backups"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["exechooksruns"] verbs: ["list", "watch"] # collect metrics from ALL namespaces namespaces: "" # deploy a ServiceMonitor so the metrics are collected by Prometheus prometheus: monitor: enabled: true additionalLabels: release: prometheus Now we can install the KSM using Helm. $ helm install trident-protect -f ./metrics-config-backup-snapshot-hooks.yaml prometheus-community/kube-state-metrics --version 5.21.0 -n prometheus NAME: trident-protect LAST DEPLOYED: Tue Aug 19 17:54:22 2025 NAMESPACE: prometheus STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. The exposed metrics can be found here: https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics The metrics are exported on the HTTP endpoint /metrics on the listening port. In your case, trident-protect-kube-state-metrics.prometheus.svc.cluster.local:8080/metrics They are served either as plaintext or protobuf depending on the Accept header. They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint. We check that the KSM ServiceMonitor was correctly deployed in the prometheus namespace. $ kubectl -n prometheus get smon -l app.kubernetes.io/instance=trident-protect NAME AGE trident-protect-kube-state-metrics 90s $ kubectl get all -n prometheus NAME READY STATUS RESTARTS AGE pod/trident-protect-kube-state-metrics-94d55666c-69j6n 1/1 Running 0 105s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 105s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 105s NAME DESIRED CURRENT READY AGE replicaset.apps/trident-protect-kube-state-metrics-94d55666c 1 1 1 105s Prometheus installation Let’s install Prometheus now on our cluster. Before doing that, we must make sure that the Prometheus server can access the Kubernetes API. RBAC permissions The Prometheus server needs access to the Kubernetes API to scrape targets. Therefore, a ServiceAccount is required to provide access to those resources, which must be created and bound to a ClusterRole accordingly. By applying the yaml file below, we create the ServiceAccount prometheus and a ClusterRole prometheus with the necessary privileges, that we bind to the ServiceAccount. $ cat ./rbac.yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus namespace: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/metrics - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - configmaps verbs: ["get"] - apiGroups: - networking.k8s.io resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus namespace: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: prometheus $ kubectl apply -f ./rbac.yaml serviceaccount/prometheus created clusterrole.rbac.authorization.k8s.io/prometheus created clusterrolebinding.rbac.authorization.k8s.io/prometheus created Now we’re ready to install Prometheus. Deploy Prometheus After creating the Prometheus ServiceAccount and giving it access to the Kubernetes API, we can deploy the Prometheus instance. We’ll use the Prometheus operator for the installation. Following the instructions to install the operator in the prometheus namespace will install the operator in some minutes on our K8s cluster. This manifest defines the serviceMonitor, NamespaceSelector, serviceMonitorSelector, and podMonitorSelector fields to specify which CRs to include. In this example, the {} value is used to match all existing CRs. $ cat ./prometheus.yaml apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: prometheus spec: serviceAccountName: prometheus serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} podMonitorSelector: {} resources: requests: memory: 400Mi We apply the manifest and check that the Prometheus instance reaches the Running state eventually and a prometheus-operated Service was created: $ kubectl apply -f ./prometheus.yaml prometheus.monitoring.coreos.com/prometheus created $ kubectl get prometheus -n prometheus NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE prometheus 1 True True 42s $ kubectl get services -n prometheus NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-operated ClusterIP None <none> 9090/TCP 103s prometheus-operator ClusterIP None <none> 8080/TCP 7m44s trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 17h $ kubectl get all -n prometheus NAME READY STATUS RESTARTS AGE pod/prometheus-operator-5d697c648f-22lrz 1/1 Running 0 6m21s pod/prometheus-prometheus-0 2/2 Running 0 20s pod/trident-protect-kube-state-metrics-94d55666c-69j6n 1/1 Running 0 17h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/prometheus-operated ClusterIP None <none> 9090/TCP 20s service/prometheus-operator ClusterIP None <none> 8080/TCP 6m21s service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 17h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/prometheus-operator 1/1 1 1 6m21s deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 17h NAME DESIRED CURRENT READY AGE replicaset.apps/prometheus-operator-5d697c648f 1 1 1 6m21s replicaset.apps/trident-protect-kube-state-metrics-94d55666c 1 1 1 17h NAME READY AGE statefulset.apps/prometheus-prometheus 1/1 20s To quickly test the Prometheus installation, let’s use port-forwarding. $ kubectl -n prometheus port-forward svc/prometheus-operated 9090:9090 Forwarding from 127.0.0.1:9090 -> 9090 Forwarding from [::1]:9090 -> 9090 By pointing a web browser to http://localhost:9090 we can view the Prometheus console: Configure the monitoring tools to work together After we have now installed all the monitoring tools, we need to configure them to work together. To integrate the kube-state-metrics with Prometheus, we edit our Prometheus configuration file (prometheus.yaml) and add the kube-state-metrics service information to it, saving it as prometheus-ksm.yaml. $ cat ./prometheus-ksm.yaml apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: prometheus spec: serviceAccountName: prometheus serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} podMonitorSelector: {} resources: requests: memory: 400Mi --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: trident-protect data: prometheus.yaml: | global: scrape_interval: 15s scrape_configs: - job_name: 'kube-state-metrics' static_configs: - targets: ['kube-state-metrics.trident-protect.svc:8080'] $ diff ./prometheus.yaml ./prometheus-ksm.yaml 13a14,27 > --- > apiVersion: v1 > kind: ConfigMap > metadata: > name: prometheus-config > namespace: trident-protect > data: > prometheus.yaml: | > global: > scrape_interval: 15s > scrape_configs: > - job_name: 'kube-state-metrics' > static_configs: > - targets: ['kube-state-metrics.trident-protect.svc:8080'] After applying the manifest, we confirm that the prometheus-config configuration map was created in the trident-protect namespace: $ kubectl apply -f ./prometheus-ksm.yaml prometheus.monitoring.coreos.com/prometheus unchanged configmap/prometheus-config created $ kubectl -n trident-protect get cm NAME DATA AGE kube-root-ca.crt 1 46h prometheus-config 1 59s trident-protect-env-config 15 46h Now we can query the backups, snapshot, and execution hooks run information in Prometheus: This matches the two snapshots and one backup and the six execution hook runs we have in Trident protect: $ tridentctl-protect get snapshot -A +-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+ | NAMESPACE | NAME | APP | RECLAIM POLICY | STATE | ERROR | AGE | +-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+ | minio | backup-3473b771-caa5-48d2-a9b6-41f4448a049d | minio | Delete | Completed | | 1d22h | | minio | minio-snap | minio | Delete | Completed | | 1d22h | +-----------+---------------------------------------------+-------+----------------+-----------+-------+-------+ $ tridentctl-protect get backup -A +-----------+--------------+-------+----------------+-----------+-------+-------+ | NAMESPACE | NAME | APP | RECLAIM POLICY | STATE | ERROR | AGE | +-----------+--------------+-------+----------------+-----------+-------+-------+ | minio | minio-backup | minio | Retain | Completed | | 1d22h | +-----------+--------------+-------+----------------+-----------+-------+-------+ $ kubectl get ehr -A NAMESPACE NAME STATE STAGE ACTION ERROR APP AGE minio post-backup-3473b771-caa5-48d2-a9b6-41f4448a049d Completed Post Backup minio 46h minio post-snapshot-7e7934a4-b51a-4bc4-a981-28a8ba137ff6 Completed Post Snapshot minio 46h minio post-snapshot-c569472c-ae13-4d30-bffd-98acef304abc Completed Post Snapshot minio 46h minio pre-backup-3473b771-caa5-48d2-a9b6-41f4448a049d Completed Pre Backup minio 46h minio pre-snapshot-7e7934a4-b51a-4bc4-a981-28a8ba137ff6 Completed Pre Snapshot minio 46h minio pre-snapshot-c569472c-ae13-4d30-bffd-98acef304abc Completed Pre Snapshot minio 46h Let’s create a 2nd backup: $ tridentctl-protect create backup minio-bkp-2 --app minio --appvault demo --reclaim-policy Delete -n minio Backup "minio-bkp-2" created. Prometheus quickly catches the backup in the Running state, and the Completed state once the backup finishes. Add additional metrics and information Now, we want to add metrics about additional custom resources to Prometheus and see error states (if any) of the monitored custom resources reflected in Prometheus. AppVault metrics and error details To include metrics about the appVault CR and its error details, we add the below entries to the KSM configuration file: - groupVersionKind: group: protect.trident.netapp.io kind: "AppVault" version: "v1" labelsFromPath: appvault_uid: [metadata, uid] appvault_name: [metadata, name] metricsFromPath: state: [status, state] error: [status, error] message: [status, message] metrics: - name: appvault_info help: "Exposes details about the AppVault state" each: type: Info info: labelsFromPath: state: [status, state] error: [status, error] message: [status, message] The complete configuration file to catch metrics and error details from snapshot, backups, execHooksRun, and appVault CR is then: $ cat ./metrics-config-backup-snapshot-hooks-appvault.yaml extraArgs: # collect only our metrics, not the defaults ones (deployments etc.) - --custom-resource-state-only=true customResourceState: enabled: true config: kind: CustomResourceStateMetrics spec: resources: - groupVersionKind: group: protect.trident.netapp.io kind: "Snapshot" version: "v1" labelsFromPath: snapshot_uid: [metadata, uid] snapshot_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: snapshot_info help: "Exposes details about the Snapshot state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] status: [status, state] error: [status, error] - groupVersionKind: group: protect.trident.netapp.io kind: "Backup" version: "v1" labelsFromPath: backup_uid: [metadata, uid] backup_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: backup_info help: "Exposes details about the Backup state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] status: [status, state] error: [status, error] - groupVersionKind: group: protect.trident.netapp.io kind: "Exechooksruns" version: "v1" labelsFromPath: ehr_uid: [metadata, uid] ehr_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: ehr_info help: "Exposes details about the Exec Hook state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] stage: ["spec", stage] action: ["spec", action] status: [status, state] error: [status, error] - groupVersionKind: group: protect.trident.netapp.io kind: "AppVault" version: "v1" labelsFromPath: appvault_uid: [metadata, uid] appvault_name: [metadata, name] metricsFromPath: state: [status, state] error: [status, error] message: [status, message] metrics: - name: appvault_info help: "Exposes details about the AppVault state" each: type: Info info: labelsFromPath: state: [status, state] error: [status, error] message: [status, message] rbac: extraRules: - apiGroups: ["protect.trident.netapp.io"] resources: ["snapshots"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["backups"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["exechooksruns"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["appvaults"] verbs: ["list", "watch"] # collect metrics from ALL namespaces namespaces: "" # deploy a ServiceMonitor so the metrics are collected by Prometheus prometheus: monitor: enabled: true additionalLabels: release: prometheus We update the KSM configuration: $ helm upgrade trident-protect prometheus-community/kube-state-metrics -f ./metrics-config-backup-snapshot-hooks-appvault.yaml -n prometheus Release "trident-protect" has been upgraded. Happy Helming! NAME: trident-protect LAST DEPLOYED: Wed Aug 20 16:47:06 2025 NAMESPACE: prometheus STATUS: deployed REVISION: 2 TEST SUITE: None NOTES: kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. The exposed metrics can be found here: https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics The metrics are exported on the HTTP endpoint /metrics on the listening port. In your case, trident-protect-kube-state-metrics.prometheus.svc.cluster.local:8080/metrics They are served either as plaintext or protobuf depending on the Accept header. They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint. Now the information about the appVault CR is available in Prometheus. Test AppVault failure To test Prometheus’ monitoring and error recognition, we test a failure of or appVault CR. To simulate losing access to the object storage bucket behind the appVault CR, we delete the secret with the access credential from the trident-protect namespace. $ kubectl -n trident-protect delete secret puneptunetest secret "puneptunetest" deleted After some seconds, the AppVault CR goes into the Error state. $ tridentctl-protect get appvault +------+----------+-------+--------------------------------+---------+-----+ | NAME | PROVIDER | STATE | ERROR | MESSAGE | AGE | +------+----------+-------+--------------------------------+---------+-----+ | demo | Azure | Error | failed to resolve value for | | 2d | | | | | accountKey: unable to ... | | | +------+----------+-------+--------------------------------+---------+-----+ And the error of the appVault CR is also reflected in Prometheus: AppMirrorRelationship metrics With Trident protect, you can use the asynchronous replication capabilities of NetApp SnapMirror technology to replicate data and application changes from one storage backend to another, on the same cluster or between different clusters. AppMirrorRelationship (AMR) CRs control the replication relationship of an application protect by NetApp Snapmirror with Trident protect, so monitoring its state with Prometheus is essential. This example config includes snapshot, backup, execHooksRun, appvault, and AMR metrics: $ cat ./metrics-config-backup-snapshot-hooks-appvault-amr.yaml extraArgs: # collect only our metrics, not the defaults ones (deployments etc.) - --custom-resource-state-only=true customResourceState: enabled: true config: kind: CustomResourceStateMetrics spec: resources: - groupVersionKind: group: protect.trident.netapp.io kind: "Snapshot" version: "v1" labelsFromPath: snapshot_uid: [metadata, uid] snapshot_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: snapshot_info help: "Exposes details about the Snapshot state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] status: [status, state] error: [status, error] - groupVersionKind: group: protect.trident.netapp.io kind: "Backup" version: "v1" labelsFromPath: backup_uid: [metadata, uid] backup_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: backup_info help: "Exposes details about the Backup state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] status: [status, state] error: [status, error] - groupVersionKind: group: protect.trident.netapp.io kind: "Exechooksruns" version: "v1" labelsFromPath: ehr_uid: [metadata, uid] ehr_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: ehr_info help: "Exposes details about the Exec Hook state" each: type: Info info: labelsFromPath: appVaultReference: ["spec", "appVaultRef"] appReference: ["spec", "applicationRef"] stage: ["spec", stage] action: ["spec", action] status: [status, state] error: [status, error] - groupVersionKind: group: protect.trident.netapp.io kind: "AppVault" version: "v1" labelsFromPath: appvault_uid: [metadata, uid] appvault_name: [metadata, name] metricsFromPath: state: [status, state] error: [status, error] message: [status, message] metrics: - name: appvault_info help: "Exposes details about the AppVault state" each: type: Info info: labelsFromPath: state: [status, state] error: [status, error] message: [status, message] - groupVersionKind: group: protect.trident.netapp.io kind: "AppMirrorRelationship" version: "v1" labelsFromPath: amr_uid: [metadata, uid] amr_name: [metadata, name] creation_time: [metadata, creationTimestamp] metrics: - name: app_mirror_relationship_info help: "Exposes details about the AppMirrorRelationship state" each: type: Info info: labelsFromPath: desiredState: ["spec", "desiredState"] destinationAppVaultRef: ["spec", "destinationAppVaultRef"] sourceAppVaultRef: ["spec", "sourceAppVaultRef"] sourceApplicationName: ["spec", "sourceApplicationName"] sourceApplicationUID: ["spec", "sourceApplicationUID"] state: ["status", "state"] error: ["status", "error"] lastTransferStartTimestamp: ["status", "lastTransfer", "startTimestamp"] lastTransferCompletionTimestamp: ["status", "lastTransfer", "completionTimestamp"] lastTransferredSnapshotName: ["status", "lastTransferredSnapshot", "name"] lastTransferredSnapshotCompletionTimestamp: ["status", "lastTransferredSnapshot", "completionTimestamp"] destinationApplicationRef: ["status", "destinationApplicationRef"] destinationNamespaces: ["status", "destinationNamespaces"] promotedSnapshot: ["spec", "promotedSnapshot"] recurrenceRule: ["spec", "recurrenceRule"] storageClassName: ["spec", "storageClassName"] namespaceMapping: ["spec", "namespaceMapping"] conditions: ["status", "conditions"] rbac: extraRules: - apiGroups: ["protect.trident.netapp.io"] resources: ["snapshots"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["backups"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["exechooksruns"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["appvaults"] verbs: ["list", "watch"] - apiGroups: ["protect.trident.netapp.io"] resources: ["appmirrorrelationships"] verbs: ["list", "watch"] # collect metrics from ALL namespaces namespaces: "" # deploy a ServiceMonitor so the metrics are collected by Prometheus prometheus: monitor: enabled: true additionalLabels: release: prometheus Trident metrics The metrics provided by Trident enable you to do the following: Keep tabs on Trident's health and configuration. You can examine how successful operations are and if it can communicate with the backends as expected. Examine backend usage information and understand how many volumes are provisioned on a backend and the amount of space consumed, and so on. Maintain a mapping of the number of volumes provisioned on available backends. Track performance. You can look at how long it takes for Trident to communicate to backends and perform operations. By default, Trident's metrics are exposed on the target port 8001 at the /metrics endpoint. These metrics are enabled by default when Trident is installed. Create a Prometheus ServiceMonitor for Trident metrics Prometheus was setup in the previous sections already, so to consume the Trident metrics, we create another Prometheus ServiceMonitor that watches the trident-csi service and listens on the metrics port. A sample ServiceMonitor configuration looks like this: $ cat ./prometheus-trident-sm.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: trident-sm namespace: prometheus labels: release: prom-operator spec: jobLabel: trident selector: matchLabels: app: controller.csi.trident.netapp.io namespaceSelector: matchNames: - trident endpoints: - port: metrics interval: 15s Let’s deploy the new ServiceMonitor in the prometheus namespace. $ kubectl apply -f Prometheus/prometheus-trident-sm.yaml servicemonitor.monitoring.coreos.com/trident-sm created We can see that the new ServiceMonitor trident-sm is now running in the prometheus namespace: $ kubectl -n prometheus get all,ServiceMonitor,cm NAME READY STATUS RESTARTS AGE pod/prometheus-operator-5d697c648f-22lrz 1/1 Running 0 6h1m pod/prometheus-prometheus-0 2/2 Running 0 5h55m pod/trident-protect-kube-state-metrics-99476b548-cv9ff 1/1 Running 0 28m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/prometheus-operated ClusterIP None <none> 9090/TCP 5h55m service/prometheus-operator ClusterIP None <none> 8080/TCP 6h1m service/trident-protect-kube-state-metrics ClusterIP 172.16.88.31 <none> 8080/TCP 23h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/prometheus-operator 1/1 1 1 6h1m deployment.apps/trident-protect-kube-state-metrics 1/1 1 1 23h NAME DESIRED CURRENT READY AGE replicaset.apps/prometheus-operator-5d697c648f 1 1 1 6h1m replicaset.apps/trident-protect-kube-state-metrics-94d55666c 0 0 0 23h replicaset.apps/trident-protect-kube-state-metrics-99476b548 1 1 1 28m NAME READY AGE statefulset.apps/prometheus-prometheus 1/1 5h55m NAME AGE servicemonitor.monitoring.coreos.com/trident-protect-kube-state-metrics 23h servicemonitor.monitoring.coreos.com/trident-sm 32s NAME DATA AGE configmap/kube-root-ca.crt 1 24h configmap/prometheus-prometheus-rulefiles-0 0 5h55m configmap/trident-protect-kube-state-metrics-customresourcestate-config 1 23h By checking for available targets in the Prometheus UI (http://localhost:9090/targets) we confirm that the Trident metrics are now available in Prometheus. Query Trident metrics We can now query the available Trident metrics in Prometheus. For example, we can query the number of Trident snapshots, volumes, and bytes the allocated by Trident volumes in the Prometheus UI. Grafana dashboards Now that our monitoring system is functional, it’s time to give you an idea how to visualize the monitoring results. Let’s investigate Grafana dashboards! Install Grafana We install Grafana using the Grafana helm charts, first adding the Grafana helm repository: $ helm repo add grafana https://grafana.github.io/helm-charts Then we can install Grafana into the namespace grafana, which we create first. $ kubectl create ns grafana namespace/grafana created $ helm install my-grafana grafana/grafana --namespace grafana NAME: my-grafana LAST DEPLOYED: Thu Aug 21 14:28:14 2025 NAMESPACE: grafana STATUS: deployed REVISION: 1 NOTES: 1. Get your 'admin' user password by running: kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo 2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster: my-grafana.grafana.svc.cluster.local Get the Grafana URL to visit by running these commands in the same shell: export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace grafana port-forward $POD_NAME 3000 3. Login with the password from step 1 and the username: admin ################################################################################# ###### WARNING: Persistence is disabled!!! You will lose your data when ##### ###### the Grafana pod is terminated. ##### ################################################################################# $ helm list -n grafana NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION my-grafana grafana 1 2025-08-21 14:28:14.772879 +0200 CEST deployed grafana-9.3.2 12.1.0 Following the instructions above, we retrieve the Grafana admin password and setup port forwarding. $ kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo <REDACTED> $ kubectl -n grafana port-forward svc/my-grafana 3000:80 Forwarding from 127.0.0.1:3000 -> 3000 Forwarding from [::1]:3000 -> 3000 Now we can test the access and login to the Grafana UI on http://localhost:3000, which works fine. Enable persistent storage for Grafana By default, Grafana only uses ephemeral storage, storing all data in the container’s file system. So, the data will be lost if the container stops. We follow the steps in the Grafana documentation to enable persistent storage for Grafana. We download the values file and edit the values under the section of persistence, changing the enabled flag from false to true. $ diff Grafana/values.yaml Grafana/values-persistence.yaml 418c418 < enabled: false --- > enabled: true Then we run helm upgrade to make the changes take effect. $ helm upgrade my-grafana grafana/grafana -f Grafana/values-persistence.yaml -n grafana Release "my-grafana" has been upgraded. Happy Helming! NAME: my-grafana LAST DEPLOYED: Thu Aug 21 14:37:24 2025 NAMESPACE: grafana STATUS: deployed REVISION: 2 NOTES: 1. Get your 'admin' user password by running: kubectl get secret --namespace grafana my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo 2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster: my-grafana.grafana.svc.cluster.local Get the Grafana URL to visit by running these commands in the same shell: export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace grafana port-forward $POD_NAME 3000 We confirm that a PVC backed by by Azure NetApp Files was created in the grafana namespace. $ kubectl get all,pvc -n grafana NAME READY STATUS RESTARTS AGE pod/my-grafana-6d5b96b7d7-fqq7d 1/1 Running 0 5m18s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/my-grafana ClusterIP 172.16.9.115 <none> 80/TCP 14m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/my-grafana 1/1 1 1 14m NAME DESIRED CURRENT READY AGE replicaset.apps/my-grafana-6ccff48567 0 0 0 14m replicaset.apps/my-grafana-6d5b96b7d7 1 1 1 5m18s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/my-grafana Bound pvc-5a1844c6-3a9f-4f1d-9d94-caa1666ded3e 50Gi RWO azure-netapp-files-standard <unset> 5m19s After restarting the port forwarding, we can login to Grafana again and continue working with persistent storage enabled. Add a data source Next, we need to add our Prometheus instance as a data source in Grafana. For doing this, we need the service name and port of Prometheus. Typically, when using the Prometheus Operator, the service name is something like prometheus-operated, so we check on our cluster. $ kubectl -n prometheus get svc | grep operated prometheus-operated ClusterIP None <none> 9090/TCP 27h Now we can add the Prometheus instance as a data source in Grafana. Use the Kubernetes DNS to reference the Prometheus service. It should look something like this: http://prometheus-operated.prometheus.svc.cluster.local:9090 In the Grafana dashboard, we navigate to Menu -> Drilldown, which allows us to easily see the Trident and KSM Trident protect metrics. Add a dashboard for the Trident protect metrics Covering the creation of Grafana dashboards goes beyond the scope of this blog post. As an example and inspiration, we use the dashboard example for visualization of snapshot and backup metrics from Yves Weisser’s highly recommended collection of Trident lab scenarios on GitHub. After downloading the dashboard json file from GitHub, we change the “Failed” option values to “Error” to display failed snapshot and backups in red in the dashboard. $ diff Grafana/dashboard.json Grafana/dashboard_v2.json 365c365 < "Failed": { --- > "Error": { 562c562 < "Failed": { --- > "Error": { 709c709,710 < "25.02" --- > "25.02", > "25.06" 724c725 < } \ No newline at end of file --- > } Now can import the dashboard json file into Grafana. After importing the dashboard json file, the “Trident protect Global View” dashboard is available in Grafana. Here’s an example how it visualizes running and failed Trident protect backups. Conclusion and call to action By following this blog, you have successfully set up monitoring and visualization for NetApp Trident and Trident protect using Prometheus and Grafana. This setup enables you to keep tabs on the health and performance of your Trident and Trident protect resources, ensuring your Kubernetes applications are well-protected and efficiently managed. Happy monitoring!

Cathi_Allen

NetApp® Console is built on a restructured and enhanced foundation, encompassing platform investments, core services, enterprise readiness, security, administration and functionality. It is now the intuitive, intelligent, highly secure, and compliant single point of control for seamless management for your NetApp intelligent data infrastructure. With its reimagined user interface and experience, managing your NetApp Data Services and NetApp storage and has never been more intuitive, smarter, and insightful.

Robert

Microsoft has announced several new features for Azure NetApp Files (ANF). These updates bring meaningful improvements in performance, flexibility, security, and data mobility—making ANF an even more capable solution for organizations running demanding workloads in the cloud. Whether you're managing infrastructure, supporting hybrid environments, or navigating compliance requirements, these enhancements are designed to help your organization operate more efficiently and securely. Improved Data Mobility and Access Two powerful data mobility features are now available: Azure NetApp Files Cache Volumes Azure NetApp Files Migration Assistant Cache Volumes, built on NetApp’s ONTAP® FlexCache® technology, introduce a persistent, high-performance cache in Azure for origin volumes located outside ANF. This means active data can be accessed faster and more efficiently—even across WAN connections. For distributed teams or hybrid architectures, this capability enables low-latency access to critical files without duplicating entire datasets. The Migration Assistant streamlines the process of moving data from on-premises ONTAP environments to Azure. It preserves metadata and minimizes downtime, helping your organization reduce migration complexity and network costs. Flexible Pricing and Performance Optimization Three new features are now GA that give organizations more control over cost and performance: Flexible Service Level Flexible Service Level with Cool Access Short-Term Clones With flexible service levels, you can dynamically adjust performance tiers based on workload needs—scaling up for high-performance tasks or scaling down to save costs during quieter periods. The addition of cool access tiers allows you to store infrequently accessed data at a lower cost, while maintaining availability when needed. Short-term clones are ideal for development, testing, and analytics. These space-efficient, temporary copies allow teams to work with production-like data without consuming large amounts of storage, accelerating innovation while keeping costs in check. Simplified VMware Integration Azure NetApp Files now supports datastore integration with Azure VMware Solution (AVS) Generation 2, and notably, this no longer requires ExpressRoute. This update simplifies deployment for organizations using AVS, making it easier to migrate VMware workloads to Azure. With ANF providing high-performance storage, teams can expect improved responsiveness and reliability for their virtualized environments—without the complexity of additional networking infrastructure. Enhanced Security and Visibility Security and compliance are top priorities for many organizations, and ANF’s latest updates deliver greater control and transparency: Cross-Tenant Customer-Managed Keys for Volume Encryption File Access Logs With cross-tenant encryption, your organization can manage its own encryption keys—even in multi-tenant scenarios—ensuring data protection policies remain under your control. This is especially important for regulated industries or environments with strict governance requirements. File Access Logs provide detailed visibility into file-level operations. These logs support audit trails, help detect unusual access patterns, and enable forensic analysis—making it easier to meet compliance standards and maintain operational integrity. Who Benefits Most from These Updates? These new features are particularly valuable for: Organizations modernizing infrastructure or migrating to AVS The VMware integration and migration tools simplify transitions and reduce friction. Teams with strict security and compliance requirements Enhanced encryption and logging capabilities support governance and regulatory needs. IT departments looking to optimize storage costs and performance Flexible service levels, cool access tiers, and cache volumes deliver measurable efficiency gains. Whether you're supporting enterprise applications, managing hybrid cloud environments, or enabling global collaboration, these enhancements to Azure NetApp Files offer practical tools to improve performance, reduce costs, and strengthen security. Next Steps If your organization is already using ANF or considering it as part of your cloud strategy, these new features are worth exploring. They offer tangible benefits across infrastructure, operations, and compliance—helping you get more value from your cloud investments. Would you like help evaluating how these features could fit into your current environment or roadmap? Let’s talk https://www.netapp.com/azure/contact/

Blog Activity

NetApp Disaster Recovery Now Supports Amazon EVS Connected to Amazon FSx for ONTAP Storage

New NetApp Workload Factory updates: INSIGHT 2025 edition

NetApp Trident protect metrics and monitoring

NetApp Console: Your New Centralized, Simplified, Secure Management Solution

New possibilities: what Azure NetApp Files new capabilities mean for your organization

New video on NetApp KB TV

New video on NetApp KB TV

New video on NetApp KB TV