Self-Managing Storage: Part 2 Understanding Storage Resource Performance Lifecycle Management

jacoba · ‎2020-12-11

Storage Resource Performance Management

Storage Resource Performance Management in Active IQ Unified Manager provides consistent performance throughout the lifetime of an ONTAP cluster so that the workloads served by these resources perform as intended. SRLM focuses strictly on the performance of the hardware resources of an ONTAP cluster, including nodes and local tiers.

Active IQ Unified Manager monitors and manages the storage resource performance of highly loaded and overloaded resources to provide balance throughout the cluster. It also provides adequate resource performance headroom during workload provisioning. Active IQ Unified Manager takes preventative, proactive, and reactive actions so that storage resources perform optimally.

An exciting new feature in Active IQ Unified Manager 9.8 enables customers to remediate issues by clicking “Fix It” or “Dismiss” buttons. By choosing the “Fix It” option, Active IQ Unified Manager takes the appropriate remedial actions for the associated events.

How does Storage Resource Performance Management operate?

Storage Resource Performance Management works in three separate modes throughout the operation lifetime of an ONTAP cluster. The first mode is active when user demand is lower than the estimated optimal performance capacity (preventive management). The second mode kicks in when user demand is at the optimal performance capacity (proactive management), and the third mode is active when user demand is higher than the estimated performance capacity (reactive management).

Figure 1: Resource Performance Lifecycle Management

Preventive Actions

Active IQ Unified Manager monitors and analyzes the storage resource performance and the workload performance served by the cluster for a period of time to assess performance capacity use of the storage resources.

Based on this assessment, if there is a distribution imbalance of as much as 30% in the performance capacity used across cluster resources throughout a 24 hour period, then a cluster imbalance event called “Cluster Load Balance Threshold Breached” is raised.

If a volume can be found that reduces the load-imbalance gap between the most highly loaded node and the least-used node without affecting the services configured in the volume, then a volume move operation to remediate the cluster imbalance is suggested in the Management Action card. The recommended volume move operation does not add demand or affect user performance for the resource to which the volume was moved.

The recommended volume move remediation verifies that no ONTAP cluster configurations are violated. Before suggesting the remediation, Active IQ Unified Manager checks if the volume is a constituent of a NetApp FlexGroup volume, a root volume, or volume tiered to the cloud. The analysis also makes sure that the suggested volume move operation reduces the overall cluster demand imbalance rather than shifting the condition from one node to another.

If no such volume is found, then the cluster-load imbalance event is not accompanied by a “ Fix-it” recommendation in the Management Action card on the dashboard. In that case, the event might have actionable verbal recommendations. The user can also troubleshoot the event by reviewing and comparing the effect of workloads in the cluster resources in the corresponding Performance Explorer pages.

Figure 2: Preventive Actions

Proactive Actions

If resource performance for the cluster node is found to be consistently at the optimal performance capacity level for a period of 24 hours, then the node over-utilization event “Performance Capacity Used Threshold Breached” is generated.

This event notification has been a part of the Active IQ Unified Manager for several releases. As a part of the 9.8 release, a new action has been introduced as a part of the resource lifecycle management framework to limit the growth of demand in the node by introducing QoS limits on a subset of workloads. Workloads with the most performance impact and the suggested QoS limits based on historical analysis are shown on the Management Actions widget on the dashboard.

If the QoS policies for the selected workloads are already set and the current demand is higher than the existing limits, then Active IQ Unified Manager analysis suggests new QoS limits to apply to the workloads. The new QoS limits ensure that the performance consumption of the workloads are limited without risking over-utilization due to additional growth. If the demand on the node resource is reduced, then the QoS limits are removed or the QoS levels revert to the previous level.

Figure 3: Proactive Actions

Reactive Actions

In some situations, demand can rise beyond the optimal resource performance capacity due to one workload growing abnormally. In this situation, Active IQ Unified Manager generates an event that suggests remediation for a workload producing abnormally high demand that affects the performance of other workloads in the node.

Active IQ Unified Manager recommends setting a QoS limit on the abnormal workload so that the nodes can serve workloads with the required performance levels and maintain optimal resource performance. After the new QoS limits are set, the performance capacity allocated to the workloads stays strictly within the QoS limit so that the other workloads continue to behave and operate normally. When the demand on the node is reduced due to changes in usage patterns, the QoS limits are removed or reverted to the previous limits.

Figure 4: Reactive Actions

There is More!

We hope that you now have an overall understanding of the Resource Performance Lifecycle Management feature that we have introduced in Active IQ Unified manager.

Self-Managing Storage: Part 1 – Understanding Active IQ Unified Manager LifeCycle Management
Self-Managing Storage: Part 2 – Understanding Storage Resource Performance LifeCycle Management
Self-Managing Storage: Part 3 – Understanding Workload Performance LifeCycle Management
Self-Managing Storage: Part 4 – Understanding Capacity LifeCycle Management
Self-Managing Storage: Part 5 – Understanding Security Manager LifeCycle Management

Keep an eye out for this blog series to get an in-depth understanding of performance, capacity, and security lifecycle management. We know that you might have questions because we can’t cover the entire topic in these short blog posts, so please contact us for more information.