Data ONTAP Discussions

Highlighted

It appears "volume move" will cause massive data loss on large volume

We have been using "volume move" to shuffle volumes around our cluster, being told that it is a great tool, and will cut-over cleanly.

 

Well, I've found out that it isn't so clean.  When "volume move start" is executed, it appears to create a snapshot for the copy.  It then proceeds copying data from that snapshot, and including snapshots which existed prior to the "volume move start" snapshot.  Once it is ready to trigger the cutover, it updates, I believe, from the origination snapshot, cuts-over the volume, but does not update files created or modified after the origination snapshot.

This has been wreaking havoc with our moves, with users telling us their data has reverted to a prior time;  I now believe the users are correct.

-Unfortunately, I could not find any documentation or TR's which address the issue.  So I must assume it is an issue with the volume move command.

-One caveat, is we did not have snapmirror licensed on the entire cluster.  Perhaps that would cause "volume move" not to be able to update, however, there should have been a note in the documentation.

 

If anyone at NetApp can address this, that would be great.  I'd like to know if "volume move" can be used in future, or if I need to go back to good old Volume SnapMirror.

21 REPLIES 21
Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

If you're seeing this behavior I would open a ticket, that's not at all how vol move is suposed to work.  I have moved 100s of volumes, including a lot of NFS and LUN based VMware DataStores with zero issue.   

 

Give these a read over: 

https://library.netapp.com/ecmdocs/ECMP1196995/html/GUID-98BCA1F4-9366-4D89-85BA-AD732375EA81.html 

https://library.netapp.com/ecmdocs/ECMP1196995/html/GUID-26FE8933-0EB0-450C-BCB4-10DAE3552878.html 

 

Are you seeing any errors?  what happens after the move, are you having to restore? 

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

Sorry, your links come up with unauthorized access.

 

No errors at all;  the automatic triggers starts and completes.

 

However, viewing snapshots, the volum move snapshot is created at the time the "volume move start" is executed.  It does not change.  With Snapmirror, I find that all snapshots past the creation are locked, which tells me "vol move" is not using volume snapmirror.

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

That's weird, so am I now...   sorry about that.  

 

Give this a shot, it's the parent topic link.   https://library.netapp.com/ecmdocs/ECMP1368845/html/GUID-A03C1B1E-3DE2-455D-B5AD-89C1389EF0C8.html

 

A snapmirror (asynchronous) will just copy over what's on the snapshot it's created and call it a day.  

 

a vol move is a little more complex.   The cut-over part it similar to how a VM will cut over.  It will copy over what it can and then stun for a short moment and copy off the remainder of the blocks.   

 

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

Well, DataMotion, sounds like VMware.  But beyond that, it still does not talk about snapshots and snapmirror requirements.

 

There should be a HUGE banner in front of this command, that apparently based on the link you sent, it will not work with NFS/CIFS shares. 

  • Actually, it works, with CIFS/NFS active
  • It does not fail.
  • It does not copy anything after the initial "volume move" snapshot is created

 

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

Snapmirror has nothing to do with vol move.    They might use some of the similar mechanism under the hood, but they are separate.   

 

Found more modern versions of the docs specifically on ONTAP (clustered):     

http://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-vsmg%2FGUID-3DAFB27A-74F8-4FF0-9E9A-9136D408A9C5.html&cp=15_3_2

 

but I asure you that moving volumes around on Clustered ONTAP is non-disruptive.   

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

My P O I N T   exactly.

 

If I am moving and deferring cutover for a volume that is 100TB's, and it is waiting for a week, I would expect to have multiple TB's of changes.  If the process is not using snapmirror or snapshots, how can it update the state of the volume to the latest base image?

 

I am left to infer that it does not, and ergo the massive data loss our scientists are seeing when using "vol move".

Unless I see something from Engineering, stating otherwise, I am reverting back to SM's.

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

I realize this hasn't been asked yet..  what version of ONTAP are you on.     vol moves between 7mode and Clusterd ONTOP are (slightly)  different. 

 

Also, if you are having data loss, please open a P1 ticket with support. 

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

9.3.2P8

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

You should be seeing zero issues/loss with vol moves, CDOT/ONTAP vol moves are all non-disruptive.      I would open a P1 to further investigate at this point.   

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

"Once it is ready to trigger the cutover, it updates, I believe, from the origination snapshot, cuts-over the volume, but does not update files created or modified after the origination snapshot." --> this is not how it works.

 

Vol moves will sync up the source and destination for cutover. We don't use the original snapshot, because we'd have data loss, like you mentioned. 

 

What could be happening here is that the cutover is slow or that the vol moves aren't completely done. As mentioned, a support case is your best bet to get to the bottom of what's happening here.

 

Once you've resolved the issue, if you could post back here with what the fix/root cause was, that would be useful. 🙂

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

my question would be, is the automated cutover failing and you are performing a manual cutover?  I think others would be interested in the ultimate underlying cause and resolution of this case if you're able to share.  Thanks

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

Has there been a support case opened on this? If so, can you provide the case number?

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

Will do.

 

Have a case open and looking at logs as far back as I can go.

 

I've also started my own test, monitoring the volume move ref_ss snapshot.  So far so good.

 

I'm touching files, taking a snapshot, and sleeping for 24 hours, while the volume move is waiting for a cut-over.  It creates a new ref_ss snapshot every few minutes, at the bottom of the stack. 

 

So it appears to be working as intended.  I'll let you know what support has to say after looking at data.

TasP

Highlighted

Re: It appears "volume move" will cause massive data loss on large volume

2007845828

Highlighted

Re: It appears "volume move" works okay, I may have a problem with out of space conditions

So several issues came to light through this excersize.

Our last vol move of a 100TB volume, on a tight aggregate, appeared to work;  the volume was copied to a new aggregate and transfered.  No errors were reported.

However, immediately on Monday after cutover (auto-triggered), users started reporting their files had reverted to an earlier version;  The date was the date of the original volume move start operation.

There are several issues with troubleshooting this issue:

  1. This is a restricted site, so no auto-supports go to NetApp
  2. My server which triggered weekly auto-supports to email, had been turned-down (new one not up yet)
  3. Log limit on the array's mroot removed original logs which applied to the trigger operation
  4. Do not have a Syslog server to send "ALL" logs to;  not sure that can be done either
  5. Volume Move does not add to the "volume recovery-queue" so I cannot undelete the original source

I ran a test on a small volume, populated with 0 byte files in nested directories.  I watched the snapshots and updates "volume move" made and they worked fine.  The only difference between my troublesome moves and the test, was that the trouble was on volumes with dedicated aggregates, with minimal free space.  (Don't ask why...)  The same with the other two large volumes which had problems.

All of the smaller volumes which I had moved appear to have worked okay.

 

So my conclusion is that this happened because of the lack of free space, for one reason or another, but of course I can't prove it.

 

I would like to request the following from NetApp:

  • Please add an option to volume move to keep the original volume around if there is an issue (I forgot, in my case my backup was to a smaller model array, which has a smaller volume size.)

If anyone knows how to setup a network syslog type server which can keep all of the Ontap logs, please let me know.

 

In the final analysis, I had to ask that the case be closed, because I could not provide logs proving or disproving user allegations.  I believe that volume move will work correctly, so long as there is enough space to do what it needs.  Of course, this is all conjecture on my part, and I apologize if it is wrong.  In the meantime, I've had to revert to SM, which appears to be a little slower than vol move.

 

TasP

Try the NEW Knowledgebase!
NetApp KB Site
Forums