VMware Solutions Discussions

How to interpret Average latency values in Operations Manger(DFM)



I am facing a performance issue with some of the luns assigned to ESX Estate.

When i look into operations manager LUN Average Latency values for some of the luns it is showing as 64milli sec.  If try to average for 3 months it is going for 117milli seconds for those luns.

My question is how to read or interpret these latency values for LUNs, Volumes, Aggregagtes, CIFS, NFS with Operations Manager.




Keep in mind that reports in Operations Manager are based on sample collections which are then rolled-up into averages.  For example, the Operations Manager server daily history for LUN latency is an average over every 15 minutes.  Monthly history is an average that covers every 8 hours.  So you're looking at numbers which may include spikes that throw-off the average. You'd be better off using Performance Advisor to view your historical LUN latency as it does not roll-up counters into averages.  It shows each counter it collects, usualy on 1 minute intervals.  Then you can define thresholds and alarms to notify you when latency exceeds a certain value.

So, to answer your question about how to interpret them.  As you probably know, latency is a measure of how long it takes the object (LUN, volume, aggregate, or disk) to respond to a I/O request. There is always some amount of mechanical latency on a physical disk drive, but latency can also caused by other factors.  The generally accepted industry-standard for disk latency is 20 ms or less, but it truly depends on the application and its requirements.  Some applications are more or less tolerant of disk latency.

For performance monitoring, I would focus on the LUNs that make-up your ESX datastore. There are canned views in Performance Advisor that show latency for LUNs.  If the LUN latency is greater than 15 or 20 ms for an extended period of time, I would suggest you start working your way down from the logical objects to the physical.  From the LUN, I would next look at the volume the LUNs are in.  If its latenct is high as well, I would then look at the aggregate the volume is in.  Perhaps the aggregate doesn't have enough disks to meet the I/O requests. Perhaps there are other volumes in the same aggregate that are so busy, they're impacting the performance of your ESX LUNs.   If the aggregate has enough disks so that it should be performing well, the controller's CPU could be at its limit.

This is a very broad topic.  Let me know what other questions you have and we will go from there.



Thank you very much for your response.

The explanation was really userful.  But, i have some queries.

Generally customer asks a report for past one week or one month.  So, the DFM reports are not quite accurate.

1. What is difference between Volume IOPS and aggregates IOPS? If i add up all the IOPS on all volumes does it match with Aggregate IOPS?

2. Performance Advisor provides real time data at that particular time. Can we collect back dated data in Performance Advisor?   



I wouldn't say that DFM reports aren't accurate.  They're based on accurate information that has been rolled-up into averages.  I'd just say they're not going to show you the short-term peaks that may be impacting your performance.

Performance Advisor does collect and retain performance data for long periods of time. Therefore you can view historical performance data. Most counters are kept for 1 year. The nice thing about Performance Advisor is that it keeps *every* data point it collects (usually 1 minute intervals) and does not roll-up data into averages. If you view LUN latency for the past month, it will show you every single value it collected for the past month. Therefore, its very easy to see spikes in performance and exactly what time of the day it happened. Therefore it is much easier to troubleshoot the cause of the latency.

Asking to see average LUN latency for the past month is like asking to see the average MPG of your car for the past month. How useful is that information if you're trying to troubleshoot a problem?  If the value is higher than you expected (or lower in the case of MPG), it doesn't tell you what is causing it or exactly when its occuring. You're better off using a LUN latency view in Performance Advisor and setting the range to 1-month.  It will show you exactly when latency is high, exactly how high it gets, and allows you to see patterns in the data.  You can zoom-in on specific ranges of time.  There is also a "metrics" feature in Performance Advisor that allows you to apply metrics such as min, max, mean, and percentiles to the performance data.

In my opinion, the most valuable feature of Performance advisor is the ability to set thresholds and alerts on perofrmance data. It allows you to be aware of performance issues - such as LUN latency - when it occurs so you can be pro-active.  You'll know about the problem before your customer does.

I hope this helps.  Let me know if you have other questions.


Difference between volume IOPS and aggregate IOPS:

The aggregate is the larger pool of capacity and performance from which your volumes are provisioned. Therefore, if you're examining aggregate IOPS you're looking at the total IOPS that aggregate is delivering to any/all volumes contained within it.  If you're looking at volume IOPS, its just for that one volume and it represents a portion of the parent aggregate IOPS.

If a particular volume is not delivering the IOPS you expected, you could look at its parent aggregate. If the parent aggregate is delivering the IOPS you expect, and the total IOPS are high, it could be that some other volume in the same aggregate is consuming a lot of IOs. Performance and capacity in an aggregate are shared between the volumes.  If one volume is consuming a lot of IOPS, it could be denying performance to other volumes. A single busy volume could also drive-up the latency of other volumes in the aggregate.  At this point, you can make decisions to add disks to the aggregate to increase it total performance - or - move a busy volume out of that aggregate to a different aggregate to balance out your performance.

If you added-up the IOPS for all volumes in an aggregate, you'd be close to the total aggregate IOPS.  However, there are other system-level activities that occur which also count towards aggregate IOPS. Consistency points, weekly raid scrubs, etc... So I'd always expect aggregate IOPS to be slightly higher than the sum of all volume IOPS.

I hope this helps.