Flying Through Clouds 9: OnCommand Performance Manager and Performance Troubleshooting Part 2

By Dan Chilton and Bhavik Desai, NetApp Solutions Performance Engineers

In case you didn't recognize things, NetApp moved to a new blog platform last week.  This caused a small delay in getting part 2 out to you for this blog series on OPM, but the new platform is cool.  Thanks for your patience and we hope you like the new style.

In part 1, we talked about the foundations of OnCommand Performance Manager (OPM) and began examining an incident.  In part 2, we will look further through an incident report to help the victim by identifying the bully workload.

Let’s start with some definitions:

V is for Victim – This is a user workload (on a volume) that has suffered painful above trend latency due to some bully component on the cluster.  To be identified as a victim, the response time must have greatly increased from the normal response time trend.

B is for Bully – This is a workload that can be a system or user workload whose increased use over its baseline utilization of a cluster resource has caused the performance of other workloads (victims) to decrease. 

 S is for Shark – A heavy user workload that under normal circumstances consumes a lot of cluster component resources compared with other user workloads.  Using the fishbowl analogy, having many sharks on a shared infrastructure single resource (such as an aggregate) can impact the cluster’s ability to smooth out peaks in demand. 

 C is for Contention – When some workloads consume so much of a cluster component that other workloads suffer, the component is in “contention”.


Workload Details – Investigating the Incident 

With these definitions understood, let’s look at another incident.  The incident description states, “18 victim volumes are slow due to 25 bully volumes causing contention on data_aggr_n03” as seen in Figure 1.


Figure 1 Incident Summary


The first step to investigate the incident is to move down to the workload details section as seen in Figure 2 (OPM version 1.1).  As we move from left to right, we can identify several key components.  The first is the workload which in this example is named vol_oltp004.  The second column heading has a drop down arrow to filter by different categories and the default filter is by “Victims – Peak Deviation in Response Time”.  We see that the workload for vol_oltp004 is identified as a “Victim” during the time of the Incident as it has the letter “V”.  The third column has three dots that provide a quick view of the severity of the response time impact.  The default color is gray, but as the number of red dots increase, the workload response time impact from the “Bully” workload is greater.   The next column is a small graph of the Response Time plotted in blue over the time.  The red shaded area shows the incident period.  Above the graph, the text describes why this is a “Victim” volume as the “Actual” response time peaked at 14.53 ms/op during the Incident, which is outside of the Expected Range of 0 – 12.05 ms. 



Figure 2 Workload Details - “Victim” Volume


It’s important to note that the incident is based upon trending latency or utilization metrics that are measured over time.  This is all automatic and if the latency breaks above trend, then it logs an incident.  Nevertheless, if the latency is still within the acceptable latency range agreed to in the service level objectives for this workload, then you can stop here.  Don’t go looking for problems if you don’t have to.  If the latency is of concern to your workload (application), then proceed forward.

Finding the “Bully”

The second step is to identify the bully workloads by changing the drop down arrow to “Bullies – Peak Deviation in Utilization”.  The “Bully” is something that has increased its workload and this increase is causing painful latency to other workloads.  As you can see the volume vol_olp025 is listed with a “B” indicating “Bully”.  The utilization is depicted on the right graph and the Expected Range for this volume was <1 – 5% but the Actual Reported Utilization was 6.06%.  In this case as the component in contention is the aggregate, the utilization is reporting the percent of the total aggregate utilization consumed by this volume.  Stay with me here, because this is counterintuitive. This volume is listed with a “V” as well, indicating it is a “Victim”.  The reason it is designated as a “Victim” is that the response time ballooned up to 14.32 ms/op when the Response Time Expected Range was 0 – 1.34 ms/op.  Both of these dramatic changes in utilization and response time are indicators that the start of this vol_oltp025 workload strongly impacted the existing workloads like vol_oltp004 causing a performance degradation.

bully-oltp025-utilization-big.PNGFigure 3 “Bully” Volume


With these two steps, we have identified one of the “Bully” workloads that had a significant increase in utilization that caused latency increases on the other volumes in the same aggregate.    Keep in mind that the cause of a latency increase can be a “Bully” workload or a bottleneck in some other storage component.  Even features like deduplication and replication that require processing resources can impact user workloads.  OPM can help identify these bottlenecks and make recommendations for more efficient resource scheduling.   


Stay tuned for part 3 when we’ll go deep into the OPM volume stats and talk integration with OCI.