We are experiencing performance problems in our environment and it points to partial writes. We are seeing back-to-back CPs that is causing a spike in latency across all volumes on a filer to above 500ms.
We have contacted NetApp support and they have said yes it is partial writes and it is probably caused by ESX. The fileris almost dedicated to ESX so it has to be ESX, we know all our VMs are unaligned but short of aligning 1000’s of VMs we want to target a few that are causing the most havoc.
How can we narrow it down to a VM level accurately, which ones are causing us the most pain?
I think the filer is overloaded yes, but i beleive it is because of the number of partial writes that are occuring. If all VMs (especially the highest IO ones) were aligned I would think the filer could handle the given workload quite comfortably.
We are utilising NFS datastores.
I have engaged NetApp support but i am asking here to try and get some information from people that may have experianced VM alignment problems before.
I have written a script to poll the filer every 15mins and get pw.over_limit stat from wafl_susp -w. I have found at times this number grows by 3000 counts /s. See attached graph over_limt. These large spikes correspond to when we see massive latency jumps on our filers (4am everyday). We are still trying to work out what happens at this time to cause this massive IO spike (and subsequent latency spike), but i still beleive the root cause is unaligned VMs. Any comments appreciated.
We have the exact same problem on our IBM Nseries (rebranded Netapp)
All our ESX hosts are using FC and allmost all of our 1000+ virtual servers are unaligned.. We have approx. 60 mill pw.over_limit every 24 hours and our latency is going from a few ms to more than a second if someone is doing excessive writes.
Recently we started aligning the virtual servers using software from VisionCore, -but it's a very time consuming process and we expect to use the next 6-12 months aligning.
We qualified the top writers (LUNS) using Operations manager and our ESX guru found the virtual servers using the most busy luns.
We are about 10% done, but haven't seen any major improvements yet - but we're still optimistic..!
Hi, we are experiencing HUGE 1,000,000+ microseconds (1 second+) latency spikes on ONTAP 7.3.3 NFS volumes as reported by NetApp Management Console which is disabling VMware virtual machines (Windows SQL server needs to be rebooted, Linux VMs go into read only mode and need reboots etc)
We have a case open with Netapp (2001447643) and the latest analysis of perstat and archive stats from the spikes is being presented to us:
This data is definitely good. We are seeing the latency. Here is what I am seeing on the filer side: Server rpc: TCP: calls badcalls nullrecv badlen xdrcall 232298 4294959566 0 0 4294959566 The NetApp filer is getting a huge number of bad XDR calls, indicating that the filer is unable to read the NFS headers.
We cannot determine at this time what the source of these bad calls is. Some of the worst offending volumes during this period, regarding latency appear to be:
Vm64net Vm65net Vw65net2 Vm65net3 Ora64net02
Time Time Delta Volume Parent Aggr Total Op/s Avg Lat (µs) Read Op/s Read Data (B/s) Read Lat (µs) Write Op/s Write Data (B/s) Write Lat (µs) Other Op/s Other Lat (µs) Tue Jun 15 17:45:46 UTC 2010 0.00 vm64net aggr1 311.00 6,981,691.95 35.00 336,402.00 1,336,540.48 267.00 1,561,500.00 7,953,614.68 8.00 4.29 Tue Jun 15 17:45:46 UTC 2010 0.00 vm65net aggr1 115.00 6,283,673.41 0.00 2,453.00 38,863.33 107.00 1,441,475.00 6,714,803.11 6.00 12.41 Tue Jun 15 17:45:46 UTC 2010 0.00 vm65net3 aggr1 292.00 3,481,462.35 14.00 110,824.00 1,390,729.32 263.00 1,582,710.00 3,780,725.12 14.00 6.82 Tue Jun 15 17:45:46 UTC 2010 0.00 ora64net02 aggr1 17.00 3,280,731.47 5.00 92,421.00 4,776.50 7.00 24,536.00 7,710,536.08 4.00 2.77 Tue Jun 15 17:45:46 UTC 2010 0.00 vm65net2 aggr1 315.00 2,838,381.82 11.00 56,383.00 22,902.19 287.00 1,805,548.00 3,105,157.06 15.00 21.81 Out best bet to track down the source of the bad calls would be to capture a packet trace from the filer when this issue is occurring."
My points I’d like to clarify:
What is a bad XDR call and why are they relevant to the latency spike?
“indicating that the filer is unable to read the NFS headers” - need you to clarify and expand
We saw another smaller spike around 2:30am today:
These are all volumes on aggregate aggr1 (10K RPM disks) is this an overloaded (IOPS-wise) AGGR issue?
We can't currently predict when these spikes in latency occur - they are random - so getting a packet capture of a random event does not seem feasible...
Any insight is welcome - we are in major pain with this for months now