Are performance bottlenecks getting you down?

Bottlenecks are difficult to find and even harder to predict in a virtualized environment.   If storage is always getting blamed for being the bottleneck, how do you know whether it is or not? Do you have visibility and deep storage analysis to validate the source of the bottleneck and report back to the team?

 

For example, do you really care that a VM is doing 1500 IOs per second or whether that’s normal and good?  OnCommand Balance color codes the health of your application workload across cpu, memory and storage IO paths, telling you if there is a bottleneck and who’s effected.

 

This is done by simplifying complicated math into three key performance indicators:

· Infrastructure Response Time

· Performance Index

· Disk Utilization

 

In this post, I’ll show you how Infrastructure Response Time is used to determine the source of a bottleneck. 

 

Understanding Infrastructure Response Time

Infrastructure Response Time is the total time it takes for an I/O to be completed – and this is really what business and application users care about. They want to know how long it takes the I/O to make its round trip – from application to virtual server to physical server to storage array to disk group and back. 

 

It shows the performance delivered to an application by all of the resources assigned to it.  You can use it to baseline and predict future service and even alert on service deviations.  Point solutions that look at individual elements of the infrastructure could never deliver this level of analytic performance data.

 

Where do I start?

Balance makes it easy for you to figure out where to start.  The screenshot below is the Balance Servers home page.  You can see under the “VMs in Trouble or At Risk” heading that these are the VMs you need to pay attention to today.

In this example, you can see one of the critical development servers for the software product, Blackbox7, has shown up with a high Performance Index.  It’s time to investigate if this is a one-time issue or if it’s going to be ongoing.

 

So, I look at Infrastructure Response Time.  The reason it’s so powerful is because it gives you the Infrastructure’s view of how long it is taking to get these units of work done – cpu and IO. 

 

In the example below, you can see a huge spike in Infrastructure Response Time caused by more workload – more cpu and more IO transactions being added to the system.

Could this have been predicted?

Balance’s Abnormality Index predicts when applications are behaving inconsistently and alerts you for proactive resolution.   It automatically calculates and updates a “normal” Infrastructure Response Time profile over time that can be used as a predictive dynamic threshold for service level management.  This takes 10 days of data to do the forecast and the data is retained for one year in a mysql database.  Full analysis includes past and forecasted service levels for Read and Write IOPS, Total IOPS and Total Response Time.

 

On the Total Response Time Forecast Chart below, we see a combination of analytics including, in blue, the 48-hour Forecast Data, in green and yellow the standard deviation above and below the forecasted data, representing upper and lower thresholds respectively, and in red, is the actual data for the period.  By looking back at what is “Normal” Balance predicts and charts service levels for the next 48-hours based on past performance.

 

And when actual performance is beyond the upper or lower threshold, Balance generates an alert indicating the abnormality.

What was the impact on CPU?

As you probably know, virtual environments are like a balloon – you blow them full of desktops and other servers – and then allow the end users to push in on the infrastructure which alters the characteristics of utilization and sharing.  In the screenshot below, we see the impact of some of this pushing and pulling on the balloon as a result of this bottleneck – cpu went crazy for a moment which caused the VDI guest developer system to wait for cpu 100% of the time.  This means that it was locked up and not allowed to share anymore in the “pool” even though there was plenty of extra to be had. 

Further, it started a little pin-wheeling of the VDI environment caused by Dynamic Resource Scheduling. As a result, this VM vmotioned, not once, but twice within a few hours showing that the cluster really had some difficulties at that point.  We can also see that this only got worse over the course of the next few days.

 

Bottlenecks are easy to find and prevent when you have OnCommand Balance

Running Balance in your infrastructure is the easiest way to CYA.  Balance provides the predictability and confidence that enables you to virtualize the business critical applications you’re worried about and make time for other projects.  And it’s so affordable for NetApp customers; you’d be crazy not to check it out.  If you’re considering purchasing FAS2220, Balance is half off with FAS2220 now through December 31, 2012.