Hi, I have reproduced this several times in testing vfiler failover with ONTAP 7.3.3 1) (on source cluster) vfiler stop testvfiler (vfiler status reports stopped) 2) (on DR cluster) vfiler dr activate testvfiler@sourcecluster (reports vfiler is activated) - but shortly thereafter "Duplicate IP address 10.6.64.100!! sent from Ethernet address:02:a0:98:xx:xx:xx" And I did check the KB: https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb28734 (it basically says to make sure the source vfiler is reporting stopped before activating the DR side) What I found was the source side while reporting the vfiler stopped, hangs onto the IP address as an alias issuing an ifconfig VIP -alias <duplicate IP> resolves it - but this feels like a bug and I did not have this issue in versions prior to 7.3.3 Anyone else seen this? going to open a case too. thanks http://vmadmin.info
... View more
Thanks for the tips - I tried your method #1 with the wizard and for the last 5 minutes its been loading the available counters... I'll try the other way too with baselines Is there a TR doc for this? thanks again, Fletcher.
... View more
Hi, I recently had a performance problem with one of my aggregates - it was the second time in a few months so I decided to configure Operations Manager thresholds and alarms to alert me before the IOPS reached a critical level in the future. I documented the analysis and procedure here http://www.vmadmin.info/2010/09/netapp-iops-threshold-alerting.html Feedback welcome!
... View more
Hi, we’ve aligned all our Vmware vmdk’s according to the Netapp best practices while tracking the pw.over_limit counter see: http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html Counters that indicate improper alignment ( ref: ftp://service.boulder.ibm.com/storage/isv/NS3593-0.pdf) “There are various ways of determining if you do not have proper alignment. Using perfstat counters, under the wafl_susp section, “wp.partial_writes“, “pw.over_limit“, and “pw.async_read,“ are indicators of improper alignment. The “wp.partial write“ is the block counter of unaligned I/O. If more than a small number of partial writes happen, then IBM® System StorageTM N series with WAFL® (write anywhere file layout) will launch a background read. These are counted in “pw.async_read“; “pw.over_limit“ is the block counter of the writes waiting on disk reads.” -- So the pw.over_limit counter is still recording an 5 minute average of 14 with 7-10 peaks in the 50-100 range at certain times of the day. If I look at the clients talking to the Netapp those times its mostly Oracle RAC servers with storage for data and voting disks on NFS. This leads me to the question: What if any are the other possible sources for unaligned IO on Netapp? All references I find are vmware vmdk – but are there others like Oracle which may be doing block IO over NFS? Many thanks -- Fletcher Cocquyt http://vmadmin.info
... View more
I opened a case and was told Netapp does not support mbralign on Solaris (and to open a case with Sun) This despite the fact the NetApp® Virtual Storage Console 2.0 for VMware® vSphereTM Installation and Administration Guide index lists grub as a step in the realignment process for Solaris Solaris reinstalling GRUB after running mbralign 58 the case engineer copy and pasted the steps to me from the guide and when I asked if he had tried those steps in the lab (because I have many times and all Solaris grub fixes fail - or say they succeed and result in a non-bootable VM) He then countered with the "Netapp does not support the Solaris grub fix - call Sun" I have yet to hear of _anyone_ successfully aligning and grub fixing a Solaris VM into a bootable, aligned VM. Since we have very few solaris VMs left this is not a huge deal for us - we're migrating all VMs to linux. thanks http://vmadmin.info
... View more
Turns out the impact of misaligned VMs can be more dramatic if the level of unaligned IO tips ONTAP into synchronous mode. One indicator is the pw.over_limit stat - ref: TR-3593.pdf 1.4 Counters that indicate Improper Alignment There are various ways of determining if you do not have proper alignment. Using perfstat counters, under the wafl_susp section, “wp.partial_writes“, “pw.over_limit“, and “pw.async_read“ are indicators of improper alignment. The “wp.partial write“ is the block counter of unaligned I/O. If more than a small number of partial writes happen, then WAFL® will launch a background read. These are counted in “pw.async_read“; “pw.over_limit“ is the block counter of the writes waiting on disk reads. This counter is not exposed via SNMP as standard, but can be trended as outlined here: http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html thanks
... View more
Hi, A quick followup outlining how we currently quantify the misalignment issue: http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html Cheers
... View more
After I took apart the 6800+ IOPS on the problem aggregate the issue turned out to be we were hitting physical limitations of the 10K RPM disks. Further analysis (surprisingly) revealed about 50% of these IOPS were snapmirror related. We rescheduled the snapmirrors to reduce this and have said goodbye to the latency spikes. If interested in the details, please see: http://www.vmadmin.info/2010/07/vmware-and-netapp-deconstructing.html I want to thank Netapp support - especially Errol Fouquet for his expertise pulling apart this problem and help isolating the cause. Fletcher.
... View more
Hi, we are aligning all our VMs with Netapp's mbralign tool. We fix the linux grub with a Rescue "super grub" CD - but this does not work on the Solaris VMs Error is: "Error 6 Mismatched or corrupt version of stage1/stage2" I tried the manual steps (specifying the "a" slice someone mentioned in a solaris forum): grub> root (hd0,0,a) grub> setup (hd0) but got the same error Has anyone successfully aligned a Solaris VM?? thanks
... View more
Hi, we are experiencing HUGE 1,000,000+ microseconds (1 second+) latency spikes on ONTAP 7.3.3 NFS volumes as reported by NetApp Management Console which is disabling VMware virtual machines (Windows SQL server needs to be rebooted, Linux VMs go into read only mode and need reboots etc) We have a case open with Netapp (2001447643) and the latest analysis of perstat and archive stats from the spikes is being presented to us: "Version:1.0 StartHTML:0000000149 EndHTML:0000003705 StartFragment:0000000199 EndFragment:0000003671 StartSelection:0000000199 EndSelection:0000003671 This data is definitely good. We are seeing the latency. Here is what I am seeing on the filer side: Server rpc: TCP: calls badcalls nullrecv badlen xdrcall 232298 4294959566 0 0 4294959566 The NetApp filer is getting a huge number of bad XDR calls, indicating that the filer is unable to read the NFS headers. We cannot determine at this time what the source of these bad calls is. Some of the worst offending volumes during this period, regarding latency appear to be: Vm64net Vm65net Vw65net2 Vm65net3 Ora64net02 Time Time Delta Volume Parent Aggr Total Op/s Avg Lat (µs) Read Op/s Read Data (B/s) Read Lat (µs) Write Op/s Write Data (B/s) Write Lat (µs) Other Op/s Other Lat (µs) Tue Jun 15 17:45:46 UTC 2010 0.00 vm64net aggr1 311.00 6,981,691.95 35.00 336,402.00 1,336,540.48 267.00 1,561,500.00 7,953,614.68 8.00 4.29 Tue Jun 15 17:45:46 UTC 2010 0.00 vm65net aggr1 115.00 6,283,673.41 0.00 2,453.00 38,863.33 107.00 1,441,475.00 6,714,803.11 6.00 12.41 Tue Jun 15 17:45:46 UTC 2010 0.00 vm65net3 aggr1 292.00 3,481,462.35 14.00 110,824.00 1,390,729.32 263.00 1,582,710.00 3,780,725.12 14.00 6.82 Tue Jun 15 17:45:46 UTC 2010 0.00 ora64net02 aggr1 17.00 3,280,731.47 5.00 92,421.00 4,776.50 7.00 24,536.00 7,710,536.08 4.00 2.77 Tue Jun 15 17:45:46 UTC 2010 0.00 vm65net2 aggr1 315.00 2,838,381.82 11.00 56,383.00 22,902.19 287.00 1,805,548.00 3,105,157.06 15.00 21.81 Out best bet to track down the source of the bad calls would be to capture a packet trace from the filer when this issue is occurring." My points I’d like to clarify: What is a bad XDR call and why are they relevant to the latency spike? “indicating that the filer is unable to read the NFS headers” - need you to clarify and expand We saw another smaller spike around 2:30am today: These are all volumes on aggregate aggr1 (10K RPM disks) is this an overloaded (IOPS-wise) AGGR issue? We can't currently predict when these spikes in latency occur - they are random - so getting a packet capture of a random event does not seem feasible... Any insight is welcome - we are in major pain with this for months now thanks
... View more
Hi, this thread shows up on the top of "Netapp Latency Spikes" searches. We have a 3040 cluster hosting 11 vSphere hosts with 200 VMs on NFS datastores. We see latency spikes 3-4 times a month as reported by Operations Manager. We hoped our upgrade from 7.3.1.1 last week to 7.3.3 would help, but we had another spike up to 1 second take out a NFS mount and all several of the VMs on Saturday. We previously determined the High & medium IO VMs and either aligned them or migrated them to local disk - has NOT helped - still getting the spikes. I have another case opened with Netapp. Following the notes in this thread, I ran the wafl_susp -w to check the pw.over_limit Turns out ours is ZERO (is it relevant to NFS?) I suspect an internal Netapp process is responsible for these (dedup?) - we had it disabled on 7.3.1.1 - 7.3.3 was supposed to fix this (we re-enabled de-dup after the upgrade) And the latency spike outages are back Will share any info from the case thanks for any tips, Fletcher.
... View more
Hi, the ESX NFS tuning recommendations are working - but only 3 of my 15 ESX hosts are populated in the VSC->Overview pane - the rest show "?Unknown" Why would that be? thanks
... View more
Hi, We are in the process of aligning our VMs on our NFS datastores - I read with interest the Netapp doc http://media.netapp.com/documents/tr-3747.pdf outlining the performance impact from the Netapp point of view - but I did not see the impact quantified. Since the process of alignment currently requires the VM to be down (integrate with storage vMotion please?!) I decided to try and design a test from the VM's point of view of the impact of aligned vs not-aligned. The setup (I did this on an otherwise quiesced lab system (Dell 1950 x 2 cluster running vSphere + Netapp 2020 (NFS Datastores)): 1) Take a misaligned Linux VM (as checked by mbrscan) 2) clone the VM 3) align the clone with mbralign Now we have two linux VMs M(isaligned) and (A)ligned I wanted a way to generate IO of varying sizes - I used this script: [fcocquyt@lab-vm-01 ~]$ more generateIO.csh #!/bin/csh set x=1 set bs=1024 while ( $bs < 9000 ) echo $bs while ( $x < 20 ) dd if=/dev/zero of=tstfile$x bs=$bs count=10240 sum tstfile$x @ x++ end rm tstfile* @ bs+=1024 set x = 1 end What I found from repeated runs of this script on both M and A vms was the Misaligned VM took an average of 18% longer to run the same IO. I also captured /usr/lib/vmware/bin/vscsiStats - but interestingly those numbers (latency and outStandingIOs for example) did not show the same result (it showed about the same average latency for M & A vms... I welcome any and all comments on this analysis One area: block size - I have a suspicion the blocksize has a big effect on the latency - while the script was stepping through the blocksizes I observed the throughput varying quite a bit. But the finding of 18% impact is in line with my expectation for NFS datastores... thanks
... View more
Eric, thanks for the reply - yes the volume is sis (dedup) enabled. df -sh Filesystem used saved %saved /vol/vm2/ 1333GB 1852GB 58% Is it the case we can not enjoy both dedup savings and rapid file level cloning simultaneously? We are looking at implementing VDI in the future so it would be very nice to have both. A few follow up questions: 1) If RCU is supported, why have the two cases I opened been closed and stalled with "RCU is not supported"? On the current case: 2000881634 - the engineer can not find documentation on how to perform the file level clone from the command line I found it in Scott Lowe's blog post from Dec 2008: http://blog.scottlowe.org/2008/12/03/2031-enhancements-to-netapp-cloning-technology/ Is there a technical doc # that fully describes the RCU and file level cloning? 2) If RCU is based of the shared block model of dedup, why are blocks being duplicated at all ? Shouldn't the new clone file just consist of pointers to the blocks of the old file? 3) In my test I was able to run the file level clone command from vfiler0 on a volume that is actually NFS exported from a different vfiler. Could RCU be made to support vfilers in the future? thanks for clarifying
... View more
Netapp support is researching how to use file level flexcloning on the command line - I found this great explanation on Scott Lowe's blog from Dec '08: http://blog.scottlowe.org/2008/12/03/2031-enhancements-to-netapp-cloning-technology/ sounds like its not supported in a vfiler context. At least I know how to run it on the command line now - in my testing it does not seem rapid at all - copying individual blocks 1 % at a time Seems to defeat the purpose of using this over traditional VM cloning via VMWare VIC thanks
... View more
They still say its not supported - see new engineer's reply and my response below: My response; Ok, I don’t want to use RCU if I can not get support Regardless - my goal remains the same: I want to use file level flexclone in ONTAP 7.3.1.1 to clone a file /vol/vms/vm1/vm1.vmdk to /vol/vms/vm2/vm1.vmdk (I can rename the file to vm2.vmdk later) How do I accomplish this via the ONTAP commandline? thanks On 7/1/09 6:40 PM, "neweng@netapp.com> wrote: Hi, Fletcher, I checked with our engineers on your questions who insisted that the no-support status of this utility remains in effect. The only publicly accessible support documentation is available on the RCU2./01 Description Page where you may review Release Notes, Best Practices, and an Installation and Administration Guide. http://now.netapp.com/NOW/download/software/rapid_cloning/2.0.1/ You may consult with your NetApp Sales Engineer to learn whether there are alternatives available. I apologize for your inconvenience.
... View more
Hi Peter, thanks for the feedback - if the SVMotion plugin author was to look at incorporating Netapp calls to attempt this - what Netapp API/SDK would be most appropriate? If the offset is a value read in at boot time by the OS, VMWare may have the necessary layer of abstraction to make this change transparent to the SVMotion destination before it takes over for the source... thanks
... View more
Can VMWare and Netapp please collaborate to create a SVMotion plugin smart enough to fix misaligned VMs during the storage vmotion progress? thanks
... View more