2011-01-11 02:35 AM
I am monitoring my 2 FAS3140 controllers configured in a cluster using DFM. Till few weeks, values were quite good in that or I must say I was not familier with its different kind of reports. But problems occuring when I upgraded my controllers from 7.3.2 to 8.0.1R3.
Problems observed are:
1. There was no change in any kind of load on both controllers and one controller started showing NFS write latencies ranges from 500-2000-5000 for different volumes. I was also surprised to see root volume among the top list. Whereas there is no such latencies on its partner.
2. I opened a case with Netapp and they said that there is VMs misalignment issue. But my counter question was, if all that latencies are because of VMs, then why too much on controller which serves only 50 VMs and why not on another one which hosts 150 VMs. Now I am still waiting for their reply since 4-5 days.
3. DFM shows different latencies for same volume in different reports. "Vol NAS performance summary" shows different latency than the custom report in which I selected CIFS and NFS (read and write). What is the right way to check latencies from DFM?
4. Also is there any command or powershell script by which I can see latency of all volume on CLI?
5. I also observed that some volumes shows latnecies even when there are no IOPS on them. How come it is possible?
6. What is the relation between different performance terminilogies like latency, IOPS and disk utilization?
2011-01-19 12:04 AM
basically I can understand your frustration I also struggled a lot with Netapp support, and first thing was to blame misalignment (even though I had only 5 VMs misaligned). I would not count on support help with tha.
Your question is pretty hard to answer (imo), but let's wait for other more experienced people.
2011-01-19 01:32 AM
First of all, I would advise you to check performance advisor - imho a much better view on performance information in that tool (find it in the DFM website). Performance advisor will allow you to create easy total overviews of multiple volumes, it will also have the option to create overviews via CLI.
Rogue latency on vols that have no apparent I/O against them is 9 out of 10 times a scan of some kind on the volume or aggregate - a good way to find out is to create an alert against a condition where both IOPS AND latency are occuring at the same time. At my customer sites alerts would normally all but disappear.
Disk utilization, normally, you can count on a certain amount of IOPS per disk depending on the disk type (SATA, FC/SAS 10k, 15k, etc) the percentage will give an indication of how much strain is put on the disks. This is a little simplistic but I find it to cover most use cases. When you reach the max amount of IOPS a disk can give at a reasonable latency, it can still give you more IOPS but the latency will go up exponentially. (2-3 is normal, volume latency above 20 ms is not good news)
2011-01-19 01:38 AM
Sorry short of time as I only logged on to collect an email message.
Quick and dirty list of tools to get you started
DoT command manual
Have a look at the commands for
sysstat - What is maxed out (CPU, NetWork, Disk...)
lun stats - Which lun is creating the load. Queue > 2 bad
statit - Is there a hot disk or all even. Busy >65% bad
nfsstat - Which VM NFS datastore is causing the problem
Download the latest and greatest perstat and run the report
- You can read the results as it creates a text file but you will be better off sending perfstat results to NetApp support and getting them to tell you what is wrong.
Will come back later with help for how to interpert the results.
Hope it helps
2011-01-20 03:18 AM
Thanks for the above info. I will read the above commands.
Is there any tool which can read the perfstat data?
I don't want to depend on support guys and wait for their answers.
2011-01-20 04:37 AM
It is an 'internal tool' that reads perfstat output so we 'customers' do not get access to it. However the perfstat result is a text file, so you can look at the results and commands it runs on the host and filer. If you are only just learning NetApp you will be much better off waiting for NetApp to analysis the results.
I have been looking at perfstat results for years and still do not understand it all. Each new DoT requires an 'improved' perfstat but they are backwards compatible. Also may of the items in the prefstat are not described in detail and are internal to NetApp employees. I learnt all about the inner workings of filers from fixing performance problems based on perfstat data.
The main things are to understand WHAT resource you system is running out of. Memory, CPU, Disk I/O, NetWork I/O
Then to understand WHY it has run out: Asking to much from a small system, bug, poor schedualing of work load, etc
Once this is understood, make correct action.
This will move the bottle neck to somewhere else in the system, so repeat the process.
The important thing is to understand what normal operation looks like on your filer and what is not, ie Baseline. If you use CLI tools, creating the baseline is hard but Performance Manager shows you the history for the area you are looking at. The newer versions also highlight system alerts on the performance graphs which helps linking the events all together.
It is very interesting to see the IO pattern of a system change with the time of day and date, so be sure to include this with your baseline.
NetApp user groups are always good for this type of thing. http://communities.netapp.com/message/17369
Is there one in your area? http://communities.netapp.com/community/usergroups
2011-02-19 07:16 PM
Did you ever find an answer for your issue? I seem to be having the same issue on my 2020 after a recent upgrade to 7.3.5. It worked fine before the upgrade, but now I'm seeing huge spikes in latency every hour on my second filer. I see a small spike on my first filer, but it doesn't seem to affect the performance. My disks on my first filer are SAS and the ones on the second are SATA. I contacted support and they said it's because my iops are too high for my disks. After looking at performance manager I do see iops spike at the same time, but I don't have anything running on my vm's that would cause a spike at 5 minutes after every hour. I thought it was caused by my hourly snapshots, but I disabled them and I'm still seeing the issue. I have dedupe set to run once a week on Sunday's. Are there any tasks that run hourly?