2010-02-16 05:29 PM
We are experiencing performance problems in our environment and it points to partial writes. We are seeing back-to-back CPs that is causing a spike in latency across all volumes on a filer to above 500ms.
We have contacted NetApp support and they have said yes it is partial writes and it is probably caused by ESX. The filer is almost dedicated to ESX so it has to be ESX, we know all our VMs are unaligned but short of aligning 1000’s of VMs we want to target a few that are causing the most havoc.
How can we narrow it down to a VM level accurately, which ones are causing us the most pain?
2010-02-18 08:03 AM
with which FAS you are working with?
how many hosts do you have?
how many vms do you have?
I did a short look over your sysstat output and read the following:
- row 2 needs to be shifted 4 columns to the right
- very high CPU load (more than 95%)
- NFS only
- very high Net in and out (in: 701142 KB/s, out: 484975 KB/s)
- very high Disk read and write (read: 991328 KB/s, write: 886479 KB/s)
are you sure that these numbers are correct?
Supposed that's correct, i think your netapp is overloaded or undersized
2010-02-18 10:06 AM
Are you hosting the VMs on LUNs? Over NFS?
If you're curious about which ones are causing the most pain, collecting a perfstat and working with NetApp Support will be your best bet.
2010-02-21 06:33 PM
We have 42 ESX Hosts
We have approx 1500 Hosts
I think the filer is overloaded yes, but i beleive it is because of the number of partial writes that are occuring. If all VMs (especially the highest IO ones) were aligned I would think the filer could handle the given workload quite comfortably.
We are utilising NFS datastores.
I have engaged NetApp support but i am asking here to try and get some information from people that may have experianced VM alignment problems before.
I have written a script to poll the filer every 15mins and get pw.over_limit stat from wafl_susp -w. I have found at times this number grows by 3000 counts /s. See attached graph over_limt. These large spikes correspond to when we see massive latency jumps on our filers (4am everyday). We are still trying to work out what happens at this time to cause this massive IO spike (and subsequent latency spike), but i still beleive the root cause is unaligned VMs. Any comments appreciated.
2010-04-27 06:29 AM
We have the exact same problem on our IBM Nseries (rebranded Netapp)
All our ESX hosts are using FC and allmost all of our 1000+ virtual servers are unaligned.. We have approx. 60 mill pw.over_limit every 24 hours and our latency is going from a few ms to more than a second if someone is doing excessive writes.
Recently we started aligning the virtual servers using software from VisionCore, -but it's a very time consuming process and we expect to use the next 6-12 months aligning.
We qualified the top writers (LUNS) using Operations manager and our ESX guru found the virtual servers using the most busy luns.
We are about 10% done, but haven't seen any major improvements yet - but we're still optimistic..!
2010-05-03 04:13 PM
Hi, this thread shows up on the top of "Netapp Latency Spikes" searches.
We have a 3040 cluster hosting 11 vSphere hosts with 200 VMs on NFS datastores.
We see latency spikes 3-4 times a month as reported by Operations Manager.
We hoped our upgrade from 188.8.131.52 last week to 7.3.3 would help, but we had another spike up to 1 second take out a NFS mount and all several of the VMs on Saturday.
We previously determined the High & medium IO VMs and either aligned them or migrated them to local disk - has NOT helped - still getting the spikes.
I have another case opened with Netapp.
Following the notes in this thread, I ran the wafl_susp -w to check the pw.over_limit
Turns out ours is ZERO (is it relevant to NFS?)
I suspect an internal Netapp process is responsible for these (dedup?) - we had it disabled on 184.108.40.206 - 7.3.3 was supposed to fix this (we re-enabled de-dup after the upgrade)
And the latency spike outages are back
Will share any info from the case
thanks for any tips,
2010-05-10 03:22 AM
We have seen excessive responsetimes, when the system did aggregate snapshots.. Try comparing the aggr snap schedule to your response time problems..
Our aggr snap problem might be related to the misalignment.
"snap sched -A"