About fletch2007

fletch2007 · ‎2010-11-23

Well its been > 24 hours since I freed up the space on the volumes and the report still shows 156Gb growth for "Daily Growth Rate" - I'd expect it to have a negative growth rate since several hundred Gb were freed and that is reflected in the negative slope of the orange extrapolation line on the 1d chart. I think the 156Gb growth number may actually be coming from the 1w chart which still shows the positive slope (growth) but its diminishing gradually as the new low volume usage level points replace the older 1w high usage points... So it appears misleading (the daily growth number does not appear to match the last 24 hour window) I appears there is no way to force a re-run of the report - it appears to be regenerating automatically on some unknown schedule? Can anyone shed some light on this? thanks

fletch2007 · ‎2010-11-22

Followup - how can I refresh the report? Eg - I just freed up a lot of space on several volumes and the graphs are reflecting this, but the report is not. Can I force it to run again now? thanks

fletch2007 · ‎2010-11-18

Hi, with netapp support's help, we determined there was a stale /etc/sm/notify entry preventing the offlining of the old volume. cp /dev/null /etc/sm/notify resolved this thanks

fletch2007 · ‎2010-11-12

Hi, Provisioning Manager is being used to migrated several vFilers, but this one is repeatedly failing - can someone decipher why? Another issue is when this failure occurs, it disables snapmirror on the head - killing other migrations and snapmirrors! (I end up having to issue an options snapmirror.enable on) Here is the excerpt from the vfiler_trans_migrate_log – can you tell me why the cutover of app-vf-01 repeatedly fails and why it sets options snapmirror.enable off causing all other snapmirrors to abort ? thanks --------------------START OF NON-DISRUPTIVE MIGRATION------------------------------ SOURCE THE VFILER UNDER MIGRATION: app-vf-01 Snapshot create started Snapshot create completed took:2482 msecs Disable visibility started Disable visibility snapshot took:1252 msecs Snapshot disable started Snapshot disable completed took:0 msecs CUTOVER started : Thu Nov 11 20:49:24 PST 2010 Step:'Shutdown iscsi sessions running in vfiler:app-vf-01 context' took:0 seconds WAFL:Ref counts for volume 'appdata' Volref count:1 snapmirror source WAFL:vol offline on 'appdata' failed error: CR_BEING_USED Multi-volume offline stats: Total volumes = 2, Volumes with errors = 1, Offline async msg complete = 312, Phase1 = 0, CP1 = 1166, Phase2 = 0, Phase3 = 0, CP2 = 468, Phase4 = 0, Name cache = 91, Phase5 = 3, Inode cache = 151, Phase6 = 21, CP switch = 1530, Phase7 = 121, Total Time = 3863 WAFL:vol online on 'appdata' failed error: CR_ALREADY_ONLINE Multi-volume online stats: Total volumes = 1, Volumes with Errors = 1, Time to online = 0 Multi-volume online stats: Total volumes = 1, Volumes with Errors = 0, Time to online = 1346 Snapshot deletion took :6073 msec ===MONITOR DUMP=== Total Heartbeats - 1 Frequency - 2500 millisecond [MON] WakeUp [MON] WatchDog = 120sec [MON] Abort [MON] Fallback Triggered on:app-vf-01 [FALL] Requested [FALL] Fallback Success [MON] Sleep to MEngine [MON] Sleep [MON] SleepToFallback [FALL] Sleep ===MONITOR DUMP===

fletch2007 · ‎2010-11-12

Turns out it was related to a bug - details of the workaround: http://www.vmadmin.info/2010/11/vfiler-migrate-netapp-lockup.html thanks

fletch2007 · ‎2010-11-11

Hi, I've opened a case on this, but wanted to get the community's feedback Summary: Last night just after initiating 2 vfiler migrations, we experienced an outage on our new 3170 running 7.3.3 It dropped off the network around 10:50pm – I could not ssh to it – I logged in via the RLM and found the cf status said up, but all the network and NFS services were down. I ended up initiating a takeover from its partner, then a giveback after the problem head came up clean It was logging messages like these on the RLM console: ping: wrote 17.6.6.1 64 chars, error=No buffer space available na01> Wed Nov 10 23:00:07 PST [irt-na01: Java_Thread:info]: Lookup of time.school.edu failed with DNS server 17.6.7.77: No buffer space available. syslogd: Could not forward message to host 17.2.65.2: No buffer space available Nov 10 23:00:07 [na01: Java_Thread:info]: Lookup of time.school.edu failed with DNS server 17.6.7.77: No buffer space available. syslogd: Could not forward message to host 17.2.65.2: No buffer space available CPU was high before and after the outage NFS ops were low at the outage time Questions: 1) is this network related buffer space? 2) How can I track the buffer space usage? (is there an SNMP exposed metric?) 3) if its network buffers - is it related to a 10GigE bug? (I thought Netapp had worked those out for the most part) many thanks for any info / experience, http://vmadmin.info Fletcher

fletch2007 · ‎2010-11-08

One of the useful reports "Volume Growth Rates" uses the Daily Growth rate to extrapolate "Days To Full" http://dfmserver/dfm/report/view/volumes-growth-rates?lines=20 It'd be more accurate for us if we could customize it to use the weekly growth rate (to smooth out daily outliers) is this possible? thanks, Fletcher. http://vmadmin.info

fletch2007 · ‎2010-10-31

Hi, Does there exist a tool to make volume/vFiler migration suggestions based on optimizing overall performance and space goals of a set of filers? Analogous to how VMware DRS makes prioritized vMotion suggestions based certain criteria (CPU usage being primary criteria) - in the case of the Netapp tool the primary resources would be IOPs (observed average and peak) and storage space (observed current + growth rate). Inputs (all known by DFM): 1) aggregates: sizes, IO characteristics (how many IOPs can this aggr do at 5,10,20ms latency?) (based on type, number of disks) 2) volumes: sizes, IO characteristics (avg and avg peak IOPS) 3) administrative goals (similar to vmware DRS rules) where the admin can configure rules to force a vFiler/volume stick on a certain aggr, or vFilers/volumes not be located on the same aggr or cluster (eg for fault tolerance) Output: The system would use a contraints based system to perform combinatorial optimization on the inputs to recommend migrations based on optimizing the volume/vFiler set of aggregates calculated ability to provide IOPs at 5,10,20ms etc Assumptions: - temp space (to rearrage volumes/vFilers on a fully allocated or nearly fully allocated set of aggrs a certain amount of temporary space will be needed) - all IOs are equal (the VMware concept of prioritizing via shares is not addressed directly) - instead the admin can dictate this volume/vFiler sticks here (eg This tool could work in different use cases: 1) fully allocated systems (suggest migrations to increase IO efficiency) 2) upgrades (how best to re-arrange volumes on new filer aggrs 3) presales sizing (inputs would be estimates of dataset IO and sizes) is there a tool like this in the works? thanks! Fletcher. http://vmadmin.info

fletch2007 · ‎2010-10-28

[root@db-03 ~]# mount | grep -i vote ora64-vf-01:/vol/ora64net/vote01 on /oracrs/vote01 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=17.5.64.166) ora64-vf-01:/vol/ora64net/vote02 on /oracrs/vote02 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=17.5.64.166) ora64-vf-01:/vol/ora64net/vote03 on /oracrs/vote03 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=17.5.64.166) [root@db-03 ~]# crsctl query css votedisk 0. 0 /oracrs/vote01/crs_dss01.ora 1. 0 /oracrs/vote02/crs_dss02.ora 2. 0 /oracrs/vote03/crs_dss03.ora Located 3 voting disk(s). I agree - there should need to be another layer for there to arise unaligned IO. I was thinking maybe these files are handled specially somehow via block level IO instead of direct NFS thanks

fletch2007 · ‎2010-10-28

Hi Jakub, Please see: http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html We are tracking unaligned IO by the only method/best method I know (partial writes over limit) counter. We'd like this to be zero for overall Netapp health. There is no indication the oracle performance is suffering unduly, but we have been told any partial writes will affect performance of the whole system. The point of starting this thread was to identify other sources of unaligned IO (besides the most common VMware cases) Oracle seems to be confirmed as one of these sources - although I'd prefer a more better method than a global counter to specifically identify the clients Ideally there'd be a way to align this IO once identified. In the vmware case, you shutdown your vm and run the Netapp tool mbralign. Not sure what the oracle fix would be? thanks

fletch2007 · ‎2010-10-28

na01> version NetApp Release 7.3.3: Thu Mar 11 22:29:52 PST 2010 Screenshot is attached thanks

fletch2007 · ‎2010-10-27

We had an Oracle DB outage today and noticed the partial writes were zero during the outage We now have strong evidence the Oracle on NFS is doing some unaligned IO Q: What can we do about it? Open an Oracle case? thanks

fletch2007 · ‎2010-10-27

I'm trying the tool - its listing -6% progress for my vFiler migration volume However it shows other regularly scheduled snapmirrors happening with positive numbers and progress bars The vfiler volume is > 2Tb And I keep getting a popup at refresh saying: "In order to use this version of Snapmirror Progress Monitor you need data ONTAP version 7.3 or higher" I have 7.3.3 Any ideas? this would be useful! thanks

fletch2007 · ‎2010-10-27

Hi, we've got about 20Tb of data in 15 vfilers to migrate. I'd like to know how I can optimize throughput (eg will paring down the snapshots on the source before migrating help?) How can I best monitor the progress? I used "snapmirror status" on the last one, it was confusing because while the rate did not seem steady and it progressed up to 150Gb on a volume that was 85/500Gb full then completed. The main goal is to be able to estimate snapmirror initialization completion ETA so I can plan when I'll be doing the manual cutovers - I need to be watching all the clients during cutover, so can't let it happen automatically in the middle of the night... Is there a better progress monitor than snapmirror status? thanks

fletch2007 · ‎2010-10-27

Wow - 3 solutions in these forums today - you guys rock!! thanks http://vmadmin.info

fletch2007 · ‎2010-10-27

Hi, I'm deleting the existing volume on the snapmirror destination to make way for vFiler migrate to re-init the snapmirror. But Provisioning Manager is erroring out in the vFiler migrate (see below) since it thinks the deleted volume is still existing on the destination. I follow the suggestion to run dfm host discover <filer with deleted volume> and tried it several times via the MC refresh, but it will not clear. Any ideas besides rebooting the DFM? thanks Conformance Results === SEVERITY === Error: Attention: Failed to select a resource. === ACTION === Select a destination resource for migrating === REASON === Storage system : 'na01.school.edu'(151): - Volumes by same name as 'video' already exist on the storage system 'irt-na01.stanford.edu'(151). === SUGGESTION === Suggestions related to storage system 'na01.school.edu'(151): - Destroy the volumes on the storage system and refresh host information either by executing 'dfm host discover' CLI or navigate to 'Hosts > Storage Systems' page in Management Console and press 'Refresh' button.

fletch2007 · ‎2010-10-27

thanks Mike - got 3.0D1 and the aggr specify option works like a charm

fletch2007 · ‎2010-10-27

I don't see "Recent D releases of OM 4.0" for download on NOW - this is the latest one I see - (from Feb 18, 2010) First Customer Shipment Release Release Date DataFabric Manager 4.0 (includes Operations Manager, Protection Manager and Provisioning Manager) 18-FEB-2010 thanks

fletch2007 · ‎2010-10-27

Hi, vFiler migrate chooses the destination aggregate without asking – I’m guessing its using a simple algorithm which may choose the first aggr with enough free space? What if we want the migrated vfiler to end up on a specific AGGR when there are multiple that have enough space? (I guess I could fill the ones I don’t want it to choose with dummy volumes – but is there a better way?) thanks

fletch2007 · ‎2010-10-26

With the goal of avoiding the long snapmirror initialization I was playing around with importing an existing vfiler -> DR vfiler relationship and managed to get one imported. What’s not clear, however is if the failover offered under the protection manager – (esp with the “if a final update from primary to secondary storage is necessary...” language) will do the same semi-synchronous snapmirror and IP failover as seemlessly as vFiler migration does. Anyone know?

fletch2007 · ‎2010-10-19

Would Netapp be able to share the "script" behind PM vFiler migration so we can modify it for our needs? (eg take out the initialization step) thanks

fletch2007 · ‎2010-10-18

Hi, this one requires a little setup - I hope its clear: we are upgrading 2 campus area Netapp 3040 clusters to 3170s running 7.3.3 The snapmirror destination has already been upgraded to the 3170. We run all NFS served by vFilers. Clients are VMware, Oracle, NFS logs, apache web content etc Our snapmirrors are all setup with the vfiler dr configure command line syntax. Our plan was to upgrade the remaining production 3040 cluster to the 3170 was to initiate a vfiler failover as we have practiced and documented several times in the past. (on 7.2.x) 1) Suspend VMware VMs, shutdown Oracle, tomcats, apache etc 2) vfiler stop (on 3040) 3) snapmirror update for all volumes 4) vfiler dr activate for all vfilers to "promote" the 3170 from snapmirror destination to production 5) re-animate VMware VMs, restart Oracle, tomcats, apache etc 5) upgrade the 3040 to 3170 6) re-establish DR vfilers and snapmirrors What we found when we did the steps we had documented from 7.2.x days (a small scale test with a small volume encapsulated in a test vfiler) was the IP failover did not work as expected - we received duplicate IPs messages and soon after the VMs in the test vfiler crashed when the ESX host got confused about the NFS datastore on the IP level due to duplicate IPs. We have had an open case with Netapp on this for a few weeks now without much progress. -- Last week we learned about the vFiler migration functions automated by Provisioning manager (PM). I tested them today and they worked flawlessly in offline and even online with a running VM - uninterrupted while the vfiler (NFS datastore) failover happened - I was particularly impressed to see the "converting to semi-synchronous snapmirror" messages. The IP failover seemed to be handled without the VMs or ESX boxes even logging a single timeout warning. However the PM vFiler migration solution while seemingly free of duplicate IP issues, does not perfectly suit our vFiler failover goals since: 1) our vfiler and snapmirror relationships are already established and initialized. In fact to re-initialized our terabytes of snapmirrors from scratch would take days - we want to use snapmirror update, not initialize 2) we want to end up not with the old vfiler being left in "needs cleanup" mode as the PM vfiler migration does, but we want to re-establish the snapmirrors in reverse from where we left them off (avoiding lengthy initialization times) So my question is: Does PM provide a method to discover existing vfiler DR relationships and provide administratively initiated failover automation ? Failing that, could we get the "script" PM is using to automate the vFiler migration and modify it for our needs? (PM already provides an input for a user customized script for vFiler migration) thanks for any feedback, Will summarize, Fletcher http://vmadmin.info

fletch2007 · ‎2010-10-05

This is the case update I just received - they marked it as closed! - I guess its up to us to demonstrate the behavior changed from 7.3.2 -> 7.3.3?: "Engineering stated this is working by design and will not be fixed. The vfiler DR feature assumes that dr backup is activated only when source vfiler undergoes disaster. In case of disaster, no resources of vfiler, like ip address, are accessible. Hence, dr activate does not care about state of vfiler unit at source. It assumes that the source vfiler unit is not present and continues. The specified error message will only appear if both the ip's are in the same subnet. This happens because both the network cards publish, their mac address to ip address association. As they are in same subnet, there is an ip conflict. This is a network configuration problem and the workaround is to unbind the ip addess at source vfiler. I will go ahead and archive this case per the BURT has been addressed."

fletch2007 · ‎2010-09-30

Hi, We are still burning in a new 3170 clusters running 7.3.3 - last night one of the heads logged: Emergency shutdown: Number of Failed chassis fans are more than tolerable limit. Shutting down now Shortly after midnight last night. When I went to check it this morning there were no fans failed (no amber lights on) and the head was sitting at the LOADER prompt. I typed 'bye' and it came up fine waiting for giveback. This is less than confidence inspiring - we want to cutover production to this cluster and having it flake out on a seemingly non-existent fan issue is troubling. Anyone else seen this or anything like it? All HW diags appear green - do I have a faulty piece of hardware or not? thanks, Fletcher http://vmadmin.info

fletch2007 · ‎2010-09-25

If I had the time I'd run our previous version in the Simulator to verify vfiler stop & vfiler dr activate was indeed working properly in 7.2.4 and not creating duplicate IPs I'm waiting on Netapp support case # 2001721904 right now - will summarize how it turns out thanks http://vmadmin.info

Re: Can DFM Volume Growth Rates report use weekly growth rate?

Re: Can DFM Volume Growth Rates report use weekly growth rate?

Re: Provisioning Manager vFiler migrate fails - then disables snapmirror killing other operations

Provisioning Manager vFiler migrate fails - then disables snapmirror killing other operations

Re: "No buffer space available" during vFiler migrate

"No buffer space available" during vFiler migrate

Can DFM Volume Growth Rates report use weekly growth rate?

Constraints based tool for volume migration suggestions to optimize IO & space

Re: Sources of unaligned IO other that Vmware? - Oracle RAC?

Re: Sources of unaligned IO other that Vmware? - Oracle RAC?

Re: optimizing PM vFiler migration throughput and monitoring progress

Re: Sources of unaligned IO other that Vmware? - Oracle RAC?

Re: optimizing PM vFiler migration throughput and monitoring progress

optimizing PM vFiler migration throughput and monitoring progress

Re: Provisioning manager vFiler migrate fails with "volume exists on desitination" - won't refresh w...

Provisioning manager vFiler migrate fails with "volume exists on desitination" - won't refresh with ...

Re: vFiler migrate chooses destination aggregate without asking

Re: vFiler migrate chooses destination aggregate without asking

vFiler migrate chooses destination aggregate without asking

Re: Provisioning Manager vFiler Migrate without the snapmirror initialization

Re: Provisioning Manager vFiler Migrate without the snapmirror initialization

Provisioning Manager vFiler Migrate without the snapmirror initialization

Re: vfiler dr failover testing results in duplicate IPs

Emergency shutdown: Number of Failed chassis fans are more than tolerable limit. Shutting down now

Re: vfiler dr failover testing results in duplicate IPs