Active IQ Unified Manager Discussions

Best way to drill into Balance content

b_lawson
3,245 Views

So this is more of a general balance question. I had our Netapp rep's out for a Quarterly Business Technical Review and we noticed that I had a significant spike in latency and I/O on a particular day that lasted for a few weeks before the trend dropped back down (and when it dropped, it was still 10% higher than the averages from before the spike). They all agreed that since I have Balance in the environment, it will be really easy to dig into it and see what caused the spike. However, when I go into Balance and click into the filer that saw the spike, I am finding it very difficult to drill into the workload and pinpoint the cause of the spike. A little about our environment; we are running a FAS2240 in 8.1 running in 7-mode with 84 10k SAS drives split between the 2 heads. I have the filer setup with a single aggr on each head and the volume in question is a 3TB NFS volume that we are running all our production VM's inside of. Is there any way to try and walk me through your process when trying to drill into a performance issue? 

1 REPLY 1

plauterb
3,245 Views

If you did not get any Balance alerts, this means that even though there was a spike, there may not have been enough information to determine the cause.

1) Do you have Balance set up to send you emails alerts? If not, you can view past alerts and analysis on the top of the storage page, or on the Admin -> Events page.

If there is an event from the time you are interested in, please post the details.

If not, you can still see the details of which workloads where driving the storage at that time. Go to the Aggregate Summary Page; Performance Summary tab. This will have the details of the storage, and below it, which workloads are active for the time window you select at the top. It is important to monitor all the hosts and guests, this level of detail is necessary to uncover which workload is cause the problem.

If this is not enough details, it may not have been due to a host workloads, but internal to the storage system, such as dedupe. If it was going on for several weeks, I wonder if dedupe can run that long...

Public