Flying Through Clouds 9: OnCommand Performance Manager and Performance Troubleshooting Part 3

By Dan Chilton and Bhavik Desai, NetApp Solutions Performance Engineers

In part 2, we examined the incident report, to help the “Victim” and identify the “Bully” workload.  In this final part, we dig deeper into the volume stats of one of the “Bully” volumes vol_oltp025 at the time of the incident.  In Figure 1 below, the Volume vol_oltp025 is located on aggregate data_aggr_n03.  The “Events List” on the right hand side of the page shows the Incidents marked by a Red dot with an X in it.  The brief incident description states, “Contention on Aggregate storm/storm03/DISK_HDD_data_aggr_n03…”.

Analyzing the Bully Workload


Figure 1 Volume Detail View

Figure 2 shows a magnified view of the first graph which is the volume performance data captured over time.  It has a slider bar that allows you to focus in on a large or small time window.  It even allows moving back historically for up to 90 days of time to examine historical events.


Figure 2 Volume Performance Data Over Time


Figure 3 shows response time (latency) on the top graph and ops/sec on the bottom graph over time.  The red dot on the Response time graph shows the start of the incident and the red line shows the remainder of the incident. 



Figure 3 Volume Response Time (ms/op) and Operations (ops/sec)

We can see that during this time, the volume workload increased quickly and dramatically from zero to on average 410 ops/sec.  The response time went from 0 to 14 ms/op.  Now, pay attention here, by clicking on the dot, we can see a pop up of the chart showing the component in contention, in this case the aggregate (disks).



Figure 4 Pop-up Chart Showing Component in Contention

Moving further down the volume summary page, Figure 5 shows the Response time (latency) of all cluster components in the stack.  This is really deep stuff here, but you can see precisely where the latency is coming from in the ONTAP stack.  The default view is with all boxes checked, so you can uncheck boxes one by one until you see the source.  We see that in this case, the Aggregate (disk) is the highest cause of Response Time as it has the largest area under the graph.



Figure 5 Response Time of all Cluster components

Finally, as shown in Figure 6, you can even add additional performance data stats by selecting “Break Down data by” drop down arrow.  This allows additional graphs of stats to be added, such as disk utilization, cache hit ratio, etc.   



Figure 6 Adding Performance Stats for the Workload (Volume)

In looking at this incident, we saw bursts of 99% disk utilization.  Remember when we talked about happy system performance by not overfilling the buckets on the storage performance philosophy blog. If <50% disk utilization keeps the storage controller happy, then this system is deeply troubled.  It might be time to apply a QOS policy to the bully (calm the storm), vol move the workload to an All Flash FAS, or add some fast SSDs to this storage controller.

So, I hope you enjoyed this quick overview of OPM.  As you can see for the first time, we are providing you with deep performance troubleshooting analytics and troubleshooting capability.  With OPM 1.1 coming out soon, we will add the ability for APIs to connect into this data and then be pushed or pulled into other monitoring applications like OCI, Graphite, and Splunk.  So you can consume this data in whatever your favorite monitoring application is from one central location.  Until next time, stay tuned.