Flying Through Clouds 9: OnCommand Performance Manager and New Approaches to Performance Troubleshooting Part 1

By Dan Chilton and Bhavik Desai, NetApp Solutions Performance Engineers

 

We hope that you enjoyed the storage stories and that they inspired you with new ways to solve your business problems using technologies like Clustered Data ONTAP, vol move, AFF, and QOS. We certainly had fun writing them and are having even more fun turning them into video.  More to come soon. . .

 

For today’s discussion, I thought it might be good to talk about new approaches to performance troubleshooting for Clustered ONTAP environments.  Within the past year OnCommand Performance Manager (OPM) 1.0 was released and OPM 1.1 arrives this fall.  If you are interested in a beta release of OPM 1.1, please talk to your NetApp SE as it is available now.  So what is OPM?

 

OnCommand Performance Manager or OPM is a tool that provides deep data storage performance troubleshooting capabilities for Clustered ONTAP.  It can help in isolating potential problems and offering concrete solutions to performance issues based upon its system analysis.  Whereas OnCommand Insight (OCI) goes wide in monitoring performance across virtualization providers, switches, and most enterprise storage arrays; OPM goes deep into the veins of Clustered ONTAP storage system performance.  For the first time in ONTAP performance monitoring history, OPM takes you the customer where only NetApp Support has gone before.

So let’s look at how to get started.  The steps are simple. Plug it in, turn it on and let it start monitoring your environment.  OPM installs in minutes and begins collecting detailed performance stats for every volume in your cluster and assembling a trend view of what your performance is and should be.  All analysis within OPM is derived from the movement of latency.  The assumption being that if latency stays within range, your users will be happy that things are “fast enough”.  If latency moves out of the accepted range by more than the trended value, than OPM registers an incidentIncidents let you know that performance may not be where it should be.

 

Incidents

 

When an incident occurs, look at the incident details page. Each incident is based off the latency of a volume. For example, let’s say “volume vol_exchange037 is slow” for some reason.  While the problem description is pretty simple, the analytics run deep. Historical performance data is trended, analyzed, and compared to determine why. 

 

This is a three phase analysis:

  1. Where has an incident occurred and what is impacted? Yes, the volume (vol_exchange037) is slow due to vol_oltp037 causing contention on data_aggr_n03. If multiple volumes are affected, they are reported.
  2. What component(s) are in contention?  The latency for the volume is the sum of the latencies of the different components. Each component is examined to determine if it has experienced a latency increase from its historical trend that has caused this incident. These components are the data storage stack for clustered ONTAP.  I told you we were going deep
    1. Network- latency at the iSCSI or FCP network layer between server and storage controller
    2. Network Processing – latency at the software layer in ONTAP For handling the networking protocols
    3. Policy Group – latency at the user applied QOS policy layer (if applicable)
    4. Cluster Interconnect – latency at the cluster interconnect  from one node to another node layer(if applicable)
    5. Data Processing – latency at the software layer between cluster and storage aggregate (WAFL)
    6. Aggregate – latency at the disk layer of the disks serving this workload
  3. Examine the Workload Details for the component in contention to determine why it is in contention.

 

With this analysis complete, you have a great starting point to begin troubleshooting where the problem is occurring on the SAN or storage controller.  Stay tuned for Part 2. It’s going to get scary when we will talk about Sharks, Bullies, and Victims.