2015-05-21 11:51 AM
Hi, anyone have any experience moving large, eg 2TB volumes to another node? These volumes are NFS-3 datastores to vmware 5.5 and there are lots of VMs running on them. I'm hoping I can just migrate the volumes without any noticeable impact on any VMs.
If anyone has experience good or bad please share. I know this is possible but have no real life experience yet as we only just transitioned to cDOT. We run 8.3, and it's 4 node cluster of FAS3250s.
Solved! SEE THE SOLUTION
2015-05-22 06:05 AM
Hi my friend,
This is neto from Brazil
How are you?
When you say to another node, you meant: SVM A - volumes on node 1 and you want to move to node 2?
How about vol move?
How is the CPU on the controllers and the disk utilization?
2015-05-22 06:32 AM
Hi there neto, good thanks, and you?
I mean I want to move a flex-vol from node2,aggr1 to node4,aggr1. I know I can use "volume move" but just want to ask if anyone has done this with very large, busy volumes serving datastores to vmware.
Just wondering if anyone managed to use vol move on a big NFS volume like this, and how it went.
2015-05-22 06:38 AM
Hi my friend,
This is neto from Brazil
How are you?
Glad to hear from you.
Do you have numbers about CPU and % disk utilization on the source aggregate?
2015-05-22 08:14 AM
I've found volume moves work very well and are non-disruptive for the most part. I have both NAS and SAN protocols in play and regularly move volumes of size - 8TB NFS ESX datastores, 25TB general file volumes accessed via CIFS and NFS, 8TB+ volumes with LUNs for database servers.
The volume move engine of course puts additional load on the system and disk, so of course it can affect performance during the move. When possible, it is best to do the moves when there is minimal other load on the target volume/system. But, given certain sizes sometimes you have to run it for extended times. I've watched a 40TB volume move take 12 days given other loads in the total system.
Volume moves are limited per node in two ways - moves run at lower priorities, otherwise a single move running at full possible speed could starve other loads to the same aggregates or nodes. Also - each node only gets so many volume move endpoint slots. Both the source and the target count as a slot, so moves within a single node count two slots agains that node. Thus you can queue up a number of moves if you need to and they will process in a measured fashion.
As a standard data management mechanism, I have in our "NAS" cluster two nodes that basically are holding points for archive style data - loaded up with capacity MSATA but it's a small node pair. Other nodes in the cluster store the active data on larger controllers and NL-SAS capacity disks. Part of our standard operations is to move data back and forth between the two logical storage tiers within the cluster based on level of activity. Typical weeks will see around 20TB of data shifted using this system (among multiple volumes). It's also common to rebalance volumes onto different aggregates within a tier as variable size or I/O patterns arise. All the moves take place against a background of around 1500 volumes total on this four node cluster with space efficiency, full SnapMirror replications, and lots of user access.
The one issue I've encountered is due to volume "container" size. In our NAS cluster we have node types that have different max volume sizes. What is not apparent is that WAFL has an underlying "container" mechanism that impacts the size. As a volume grows, WAFL increases the size of the logical container. The logical container size never shrinks (thin provisioning not withstanding) even if the actual data does. The logical container is more a function of metadata and internal structures needed, and can reflect things like max files that might have been manually increased beyond the standard. The kicker is that while we only had 26TB of real data (should fit anywhere), the volume source had grown to a 100TB capable container (max on the node in question). Likely this was due to the actual user data being larger at some previous point and then shrinking back down. Attempting to move that volume to a node with a 70TB volume limit didn't fail exactly, it just didn't go. The volume move just stalled without doing anything. It would show an error if you displayed all data, but it sat in the queue doing nothing. It took a query in diagnostic mode APIs to pull the container size of the volume and confirm the issue. The only way to move that volume was the old fashioned manual way - Robocopy. Thankfully ODX enabled access allowed the data to move in a few days without killing the network.
Given your homogenous node cluster, and unless your datastores are under significant steady load, volume moves in cDot will be just fine.
Hope this helps.
2015-05-22 09:43 AM
What a great reply - really appreciated, thanks very much indeed. I will go ahead with the moves confidently, now seeing them as "small" rather than "large"!
Really helps to hear from someone who has tried and tested this. I've only done vmware storage-vmotions previously for non-disruptive VM migrations, needed to hear that vol move works for people.
2016-02-24 03:50 PM
I have just performed a 10TB vol move into a different controller stack aggregate within the cluster and its pretty nerve wrecking - specially when you start seeing All Paths Down alert in your hypervisor (in our case VMWare). It took a good 20hours for the move, mostly because we have a 5% changelog policy dedup applied to this volume. Its panicking to see those APD alerts so the next time I am going to place the datastore on maintenance mode first from the hypervisor (which will migrate all vm's out of that datastore and their associated files), then execute the volume move within NetApp.
2016-02-24 04:08 PM
I've moved NFS datastores around many times without incident. Perhaps you're still running ESX 5.5U1, or the host settings have not been applied with the VSC. Either way its worth opening a case.