Solved: It appears "volume move" will cause massive data loss on large volume

Tas · ‎2019-04-01

We have been using "volume move" to shuffle volumes around our cluster, being told that it is a great tool, and will cut-over cleanly.

Well, I've found out that it isn't so clean. When "volume move start" is executed, it appears to create a snapshot for the copy. It then proceeds copying data from that snapshot, and including snapshots which existed prior to the "volume move start" snapshot. Once it is ready to trigger the cutover, it updates, I believe, from the origination snapshot, cuts-over the volume, but does not update files created or modified after the origination snapshot.

This has been wreaking havoc with our moves, with users telling us their data has reverted to a prior time; I now believe the users are correct.

-Unfortunately, I could not find any documentation or TR's which address the issue. So I must assume it is an issue with the volume move command.

-One caveat, is we did not have snapmirror licensed on the entire cluster. Perhaps that would cause "volume move" not to be able to update, however, there should have been a note in the documentation.

If anyone at NetApp can address this, that would be great. I'd like to know if "volume move" can be used in future, or if I need to go back to good old Volume SnapMirror.

Tas · ‎2019-07-08

Just to close out this thread, I believe the issues I experienced were caused by a transient space condition. Each volume move appears to check for available space prior to beginning the operation, but of course has no control over other simultaneous moves on the same aggregate. I believe the problem I had was caused by multiple volume moves to the same destination aggregate, but not by this alone. I was moving volumes around to bring a new HA-pair in to the cluster (and retire an older one).

Many of my volume have tremendous growth rates, and are thin provisioned; I've looked at some internal tickets, and saw that I did run out of space on some destination aggregates, while the volume moves were running. I've experimented with this and found that the volume copy will pause, at the out of space condition, which happened a week or so after the move operations began. In my test, I was able to recover by making aggregate space available (snapshot deletion and such).

I also believe this happened multiple times during the volume move.

* - But based on my testing, this would not have caused a problem, because, the volume move suspends itself until more space becomes available.

* - So what happened?

-1 I don't have the foggiest; however, by the time users complained the moves had completed and the source volume had been irretrievably purged.

-2 From my testing, I was not able to cause any data loss by making the destination volumes run out of space.

-3 It may be that users over-wrote their data from Previous Versions, or thought they had finished work which they hadn't. No way to tell at this time.

Best Practice for me (not suggesting anyone else use it); if I'm going to move an important volume, I will create a clone; this will cause volume move to delay purging the source, until the clone becomes available. Or, if I in future have enough space, I will take a point-in-time snapmirror of the source, use a manual cutover, and update the snapmirror before cutting over. I think I will find that number -3 was the reason for the perceived data loss. (And as always, the squeaky wheels get the grease, there were only small handful of users who complained; It is just that they had high level management backing, who complained verocifously to IT management. )

With that, I would like to close this out.

TasP

View solution in original post

SpindleNinja · ‎2019-04-01

If you're seeing this behavior I would open a ticket, that's not at all how vol move is suposed to work. I have moved 100s of volumes, including a lot of NFS and LUN based VMware DataStores with zero issue.

Give these a read over:

~~https://library.netapp.com/ecmdocs/ECMP1196995/html/GUID-98BCA1F4-9366-4D89-85BA-AD732375EA81.html~~

~~https://library.netapp.com/ecmdocs/ECMP1196995/html/GUID-26FE8933-0EB0-450C-BCB4-10DAE3552878.html~~

Are you seeing any errors? what happens after the move, are you having to restore?

Tas · ‎2019-04-01

Sorry, your links come up with unauthorized access.

No errors at all; the automatic triggers starts and completes.

However, viewing snapshots, the volum move snapshot is created at the time the "volume move start" is executed. It does not change. With Snapmirror, I find that all snapshots past the creation are locked, which tells me "vol move" is not using volume snapmirror.

SpindleNinja · ‎2019-04-01

That's weird, so am I now... sorry about that.

Give this a shot, it's the parent topic link. ~~https://library.netapp.com/ecmdocs/ECMP1368845/html/GUID-A03C1B1E-3DE2-455D-B5AD-89C1389EF0C8.html~~

A snapmirror (asynchronous) will just copy over what's on the snapshot it's created and call it a day.

a vol move is a little more complex. The cut-over part it similar to how a VM will cut over. It will copy over what it can and then stun for a short moment and copy off the remainder of the blocks.

Tas · ‎2019-04-01

Well, DataMotion, sounds like VMware. But beyond that, it still does not talk about snapshots and snapmirror requirements.

There should be a HUGE banner in front of this command, that apparently based on the link you sent, it will not work with NFS/CIFS shares.

Actually, it works, with CIFS/NFS active
It does not fail.
It does not copy anything after the initial "volume move" snapshot is created

SpindleNinja · ‎2019-04-01

Snapmirror has nothing to do with vol move. They might use some of the similar mechanism under the hood, but they are separate.

Found more modern versions of the docs specifically on ONTAP (clustered):

http://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-vsmg%2FGUID-3DAFB27A-74F8-4FF0-9E9A-9136D408A9C5.html&cp=15_3_2

but I asure you that moving volumes around on Clustered ONTAP is non-disruptive.

Tas · ‎2019-04-02

My P O I N T exactly.

If I am moving and deferring cutover for a volume that is 100TB's, and it is waiting for a week, I would expect to have multiple TB's of changes. If the process is not using snapmirror or snapshots, how can it update the state of the volume to the latest base image?

I am left to infer that it does not, and ergo the massive data loss our scientists are seeing when using "vol move".

Unless I see something from Engineering, stating otherwise, I am reverting back to SM's.

SpindleNinja · ‎2019-04-02

I realize this hasn't been asked yet.. what version of ONTAP are you on. vol moves between 7mode and Clusterd ONTOP are (slightly) different.

Also, if you are having data loss, please open a P1 ticket with support.

Tas · ‎2019-04-02

9.3.2P8

SpindleNinja · ‎2019-04-02

You should be seeing zero issues/loss with vol moves, CDOT/ONTAP vol moves are all non-disruptive. I would open a P1 to further investigate at this point.

parisi · ‎2019-04-02

"Once it is ready to trigger the cutover, it updates, I believe, from the origination snapshot, cuts-over the volume, but does not update files created or modified after the origination snapshot." --> this is not how it works.

Vol moves will sync up the source and destination for cutover. We don't use the original snapshot, because we'd have data loss, like you mentioned.

What could be happening here is that the cutover is slow or that the vol moves aren't completely done. As mentioned, a support case is your best bet to get to the bottom of what's happening here.

Once you've resolved the issue, if you could post back here with what the fix/root cause was, that would be useful. 🙂

scottharney · ‎2019-04-03

my question would be, is the automated cutover failing and you are performing a manual cutover? I think others would be interested in the ultimate underlying cause and resolution of this case if you're able to share. Thanks

RyanUrice · ‎2019-04-04

Has there been a support case opened on this? If so, can you provide the case number?

Tas · ‎2019-04-04

Will do.

Have a case open and looking at logs as far back as I can go.

I've also started my own test, monitoring the volume move ref_ss snapshot. So far so good.

I'm touching files, taking a snapshot, and sleeping for 24 hours, while the volume move is waiting for a cut-over. It creates a new ref_ss snapshot every few minutes, at the bottom of the stack.

So it appears to be working as intended. I'll let you know what support has to say after looking at data.

TasP

Tas · ‎2019-04-04

2007845828

Tas · ‎2019-04-15

So several issues came to light through this excersize.

Our last vol move of a 100TB volume, on a tight aggregate, appeared to work; the volume was copied to a new aggregate and transfered. No errors were reported.

However, immediately on Monday after cutover (auto-triggered), users started reporting their files had reverted to an earlier version; The date was the date of the original volume move start operation.

There are several issues with troubleshooting this issue:

This is a restricted site, so no auto-supports go to NetApp
My server which triggered weekly auto-supports to email, had been turned-down (new one not up yet)
Log limit on the array's mroot removed original logs which applied to the trigger operation
Do not have a Syslog server to send "ALL" logs to; not sure that can be done either
Volume Move does not add to the "volume recovery-queue" so I cannot undelete the original source

I ran a test on a small volume, populated with 0 byte files in nested directories. I watched the snapshots and updates "volume move" made and they worked fine. The only difference between my troublesome moves and the test, was that the trouble was on volumes with dedicated aggregates, with minimal free space. (Don't ask why...) The same with the other two large volumes which had problems.

All of the smaller volumes which I had moved appear to have worked okay.

So my conclusion is that this happened because of the lack of free space, for one reason or another, but of course I can't prove it.

I would like to request the following from NetApp:

Please add an option to volume move to keep the original volume around if there is an issue (I forgot, in my case my backup was to a smaller model array, which has a smaller volume size.)

If anyone knows how to setup a network syslog type server which can keep all of the Ontap logs, please let me know.

In the final analysis, I had to ask that the case be closed, because I could not provide logs proving or disproving user allegations. I believe that volume move will work correctly, so long as there is enough space to do what it needs. Of course, this is all conjecture on my part, and I apologize if it is wrong. In the meantime, I've had to revert to SM, which appears to be a little slower than vol move.

TasP

Tas · ‎2019-04-15

I believe I can use 'volume move' and verify data contents by using XCP.

My though is to start a volume move operation with a manual trigger; then take a daily snap, and snapvault to my secondary; I can create a new snap on my secondary to keep the data.

However, I thought that I could also run an XCP scan to capture the file state prior to triggering the cut-over, and then after the cutover, for comparison purposes. I am having a little trouble coming up with the xcp syntax; perhaps someone here can help. My thoughts are:

./xcp scan -newid XXX ontap:/export/path < prior to the manual cutover

./xcp sync dry-run -id XXX < after the manual cutover

I ran a small test, and this seems to be the closest. I also tried the 'xcp scan stats -l' but I can't figure out how to quick compare. When I sent the output to a text file (-l), and reran later, I had a whole lot of diff's. Not sure if that would be helpful.

LORENZO_CONTI · ‎2019-04-16

Hello @Tas ,

you are pointing at "lack of free space": can I ask you if you are referring to the source or destination aggregate? AFAIK "vol move" will not start if there is not enough space on the destination aggregate.
Of course if the vol move takes a long time, you can run out of disk space if something is writing too much data on your volume...

Can I ask you how the option "space-guarantee" is set on all the volumes that share the involved aggregates?

Cheers

Lorenzo

Tas · ‎2019-04-16

My guarantee is none. Available space was at 5% on the source. "ergo the move".

Tas · ‎2019-04-19

Wow, I found this out by accident.

It appears that if a 'volume move start' operation is instantiated, the source volume will be iretrievably deleted at time of the trigger. I asked in one of my posts here for NetApp to consider keeping the volume in the volume-recovery queue, however, I've found a better way, and one which already exists. That is, if a 'volume move start' is instantiated on a volume with a flex-clone, ONTAP warns that the source volume will be kept around temporarily, until the clone is removed, or split.

I have one such operation running right now, and I will post the results.

TasP

Tas · ‎2019-07-08

Just to close out this thread, I believe the issues I experienced were caused by a transient space condition. Each volume move appears to check for available space prior to beginning the operation, but of course has no control over other simultaneous moves on the same aggregate. I believe the problem I had was caused by multiple volume moves to the same destination aggregate, but not by this alone. I was moving volumes around to bring a new HA-pair in to the cluster (and retire an older one).

Many of my volume have tremendous growth rates, and are thin provisioned; I've looked at some internal tickets, and saw that I did run out of space on some destination aggregates, while the volume moves were running. I've experimented with this and found that the volume copy will pause, at the out of space condition, which happened a week or so after the move operations began. In my test, I was able to recover by making aggregate space available (snapshot deletion and such).

I also believe this happened multiple times during the volume move.

* - But based on my testing, this would not have caused a problem, because, the volume move suspends itself until more space becomes available.

* - So what happened?

-1 I don't have the foggiest; however, by the time users complained the moves had completed and the source volume had been irretrievably purged.

-2 From my testing, I was not able to cause any data loss by making the destination volumes run out of space.

-3 It may be that users over-wrote their data from Previous Versions, or thought they had finished work which they hadn't. No way to tell at this time.

Best Practice for me (not suggesting anyone else use it); if I'm going to move an important volume, I will create a clone; this will cause volume move to delay purging the source, until the clone becomes available. Or, if I in future have enough space, I will take a point-in-time snapmirror of the source, use a manual cutover, and update the snapmirror before cutting over. I think I will find that number -3 was the reason for the perceived data loss. (And as always, the squeaky wheels get the grease, there were only small handful of users who complained; It is just that they had high level management backing, who complained verocifously to IT management. )

With that, I would like to close this out.

TasP