Hi, anyone have any experience moving large, eg 2TB volumes to another node? These volumes are NFS-3 datastores to vmware 5.5 and there are lots of VMs running on them. I'm hoping I can just migrate the volumes without any noticeable impact on any VMs.
If anyone has experience good or bad please share. I know this is possible but have no real life experience yet as we only just transitioned to cDOT. We run 8.3, and it's 4 node cluster of FAS3250s.
I mean I want to move a flex-vol from node2,aggr1 to node4,aggr1. I know I can use "volume move" but just want to ask if anyone has done this with very large, busy volumes serving datastores to vmware.
Just wondering if anyone managed to use vol move on a big NFS volume like this, and how it went.
I've found volume moves work very well and are non-disruptive for the most part. I have both NAS and SAN protocols in play and regularly move volumes of size - 8TB NFS ESX datastores, 25TB general file volumes accessed via CIFS and NFS, 8TB+ volumes with LUNs for database servers.
The volume move engine of course puts additional load on the system and disk, so of course it can affect performance during the move. When possible, it is best to do the moves when there is minimal other load on the target volume/system. But, given certain sizes sometimes you have to run it for extended times. I've watched a 40TB volume move take 12 days given other loads in the total system.
Volume moves are limited per node in two ways - moves run at lower priorities, otherwise a single move running at full possible speed could starve other loads to the same aggregates or nodes. Also - each node only gets so many volume move endpoint slots. Both the source and the target count as a slot, so moves within a single node count two slots agains that node. Thus you can queue up a number of moves if you need to and they will process in a measured fashion.
As a standard data management mechanism, I have in our "NAS" cluster two nodes that basically are holding points for archive style data - loaded up with capacity MSATA but it's a small node pair. Other nodes in the cluster store the active data on larger controllers and NL-SAS capacity disks. Part of our standard operations is to move data back and forth between the two logical storage tiers within the cluster based on level of activity. Typical weeks will see around 20TB of data shifted using this system (among multiple volumes). It's also common to rebalance volumes onto different aggregates within a tier as variable size or I/O patterns arise. All the moves take place against a background of around 1500 volumes total on this four node cluster with space efficiency, full SnapMirror replications, and lots of user access.
The one issue I've encountered is due to volume "container" size. In our NAS cluster we have node types that have different max volume sizes. What is not apparent is that WAFL has an underlying "container" mechanism that impacts the size. As a volume grows, WAFL increases the size of the logical container. The logical container size never shrinks (thin provisioning not withstanding) even if the actual data does. The logical container is more a function of metadata and internal structures needed, and can reflect things like max files that might have been manually increased beyond the standard. The kicker is that while we only had 26TB of real data (should fit anywhere), the volume source had grown to a 100TB capable container (max on the node in question). Likely this was due to the actual user data being larger at some previous point and then shrinking back down. Attempting to move that volume to a node with a 70TB volume limit didn't fail exactly, it just didn't go. The volume move just stalled without doing anything. It would show an error if you displayed all data, but it sat in the queue doing nothing. It took a query in diagnostic mode APIs to pull the container size of the volume and confirm the issue. The only way to move that volume was the old fashioned manual way - Robocopy. Thankfully ODX enabled access allowed the data to move in a few days without killing the network.
Given your homogenous node cluster, and unless your datastores are under significant steady load, volume moves in cDot will be just fine.
I have just performed a 10TB vol move into a different controller stack aggregate within the cluster and its pretty nerve wrecking - specially when you start seeing All Paths Down alert in your hypervisor (in our case VMWare). It took a good 20hours for the move, mostly because we have a 5% changelog policy dedup applied to this volume. Its panicking to see those APD alerts so the next time I am going to place the datastore on maintenance mode first from the hypervisor (which will migrate all vm's out of that datastore and their associated files), then execute the volume move within NetApp.
The host settings are applied correctly on all of our hosts. We are still on ESXi 5.5, so that could be it. The APD was in micro/seconds so it was "almost" transparent. It occurred during the cutover phase of the vol move.
When using ESX/ESXi NFS clients with NetApp storage controllers, you might
experience the following issues:
Issue 1: Intermittent NFS APDs on VMware ESXi 5.5 U1
When running ESXi 5.5 Update 1, the ESXi host intermittently loses connectivity
to NFS storage and an All Paths Down (APD) condition to NFS volumes is observed.
This issue is resolved in ESXi 5.5 Express Patch 04. Refer to the VMware KB
articles http://kb.vmware.com/kb/2076392 or http://kb.vmware.com/kb/2077414 for
instructions to install the patch.
Issue 2: Random disconnection of NFS exports under workloads with an excessive
number of requests
On some NetApp storage controllers with NFS enabled, you might experience the
- The NFS datastores appear to be unavailable (greyed out) in vCenter Server,
or when accessed through the vSphere Client.
- The NFS shares disappear and reappear again after few minutes.
- Virtual machines located on the NFS datastore are in a hung/paused state when
the NFS datastore is unavailable.
This issue is most often seen after a host upgrade to ESXi 5.x or the addition
of an ESXi 5.x host to the environment.
Chasing down the bug report trail indicates that the 2nd issue is corrected in all up-to-date cDots.
If you will remain on ESX 5.5 and with NFS - you definitely want the indicated patch. For me a typical data store is about 2-3TB used, higher end hits 8TB, all NFS, and I move them at will as needed without any concern for system load. VMWare never notices anything out of the ordinary.
Hope this helps.
Lead Storage Engineer | Consilio LLC
NCIE SAN Clustered, Data Protection
Kudos and accepted solutions are always appreciated.