Solved: It appears "volume move" will cause massive data loss on large volume - Page 2

Tas · ‎2019-04-01

We have been using "volume move" to shuffle volumes around our cluster, being told that it is a great tool, and will cut-over cleanly.

Well, I've found out that it isn't so clean. When "volume move start" is executed, it appears to create a snapshot for the copy. It then proceeds copying data from that snapshot, and including snapshots which existed prior to the "volume move start" snapshot. Once it is ready to trigger the cutover, it updates, I believe, from the origination snapshot, cuts-over the volume, but does not update files created or modified after the origination snapshot.

This has been wreaking havoc with our moves, with users telling us their data has reverted to a prior time; I now believe the users are correct.

-Unfortunately, I could not find any documentation or TR's which address the issue. So I must assume it is an issue with the volume move command.

-One caveat, is we did not have snapmirror licensed on the entire cluster. Perhaps that would cause "volume move" not to be able to update, however, there should have been a note in the documentation.

If anyone at NetApp can address this, that would be great. I'd like to know if "volume move" can be used in future, or if I need to go back to good old Volume SnapMirror.

Tas · ‎2019-07-08

Just to close out this thread, I believe the issues I experienced were caused by a transient space condition. Each volume move appears to check for available space prior to beginning the operation, but of course has no control over other simultaneous moves on the same aggregate. I believe the problem I had was caused by multiple volume moves to the same destination aggregate, but not by this alone. I was moving volumes around to bring a new HA-pair in to the cluster (and retire an older one).

Many of my volume have tremendous growth rates, and are thin provisioned; I've looked at some internal tickets, and saw that I did run out of space on some destination aggregates, while the volume moves were running. I've experimented with this and found that the volume copy will pause, at the out of space condition, which happened a week or so after the move operations began. In my test, I was able to recover by making aggregate space available (snapshot deletion and such).

I also believe this happened multiple times during the volume move.

* - But based on my testing, this would not have caused a problem, because, the volume move suspends itself until more space becomes available.

* - So what happened?

-1 I don't have the foggiest; however, by the time users complained the moves had completed and the source volume had been irretrievably purged.

-2 From my testing, I was not able to cause any data loss by making the destination volumes run out of space.

-3 It may be that users over-wrote their data from Previous Versions, or thought they had finished work which they hadn't. No way to tell at this time.

Best Practice for me (not suggesting anyone else use it); if I'm going to move an important volume, I will create a clone; this will cause volume move to delay purging the source, until the clone becomes available. Or, if I in future have enough space, I will take a point-in-time snapmirror of the source, use a manual cutover, and update the snapmirror before cutting over. I think I will find that number -3 was the reason for the perceived data loss. (And as always, the squeaky wheels get the grease, there were only small handful of users who complained; It is just that they had high level management backing, who complained verocifously to IT management. )

With that, I would like to close this out.

TasP

View solution in original post

zbrenner_1 · ‎2019-04-16

Sorry it was a mistake to hit me too.

br

Zoltan

It appears "volume move" will cause massive data loss on large volume

NetApp Support Site Wins Silver Stevie® Award

Sign-up for Software Release Notifications!