file snap restore causes a filer to be unable to server data

parkerdrace · ‎2010-11-19

I attempted a file snap restore this morning to restore a virtual server.

After running the command I went to view the folder contents using the Vsphere GUI. This displayed searching datastore ..... and never returned a display.

I logged in to a Vsphere host through putty and was able to see the file there. I assumed the restore was still in progress. Eventually other volumes Vsphere and Oracle started to become inaccessible on the filer, some but not all displayed inactive in Vcenter. A co worker who is the principal administrator failed the filer over and rebooted it

This fixed the problem until they failed it back. Apparently the restore started back up at this point. I deleted the file in the putty session and the symptoms went away.

The volume that file resided does have a snap mirror relationship and a scheduled replication was attempted during the restore. The transfer failed which I guess is normal and replication completed after I deleted the file.

Should peforming a file snap restore cause this? I'm not clear from reading the Data Protection documentation which I've pasted below.

Prerequisites for using SnapRestore

You must meet certain prerequisites before using SnapRestore.

• SnapRestore must be licensed on your storage system.

• There must be at least one Snapshot copy on the system that you can select to revert.

• The volume to be reverted must be online.

• The volume to be reverted is not being used for data replication.

General cautions for using SnapRestore

Before using SnapRestore, ensure that you understand the following facts.

• SnapRestore overwrites all data in the file or volume. After you use SnapRestore to revert to a

selected Snapshot copy, you cannot undo the reversion.

• If you revert to a Snapshot copy created before a SnapMirror Snapshot copy, Data ONTAP can no

longer perform an incremental update of the data using the snapmirror update command.

However, if there is any common Snapshot copy (SnapMirror Snapshot copy or other Snapshot

copy) between the SnapMirror source and SnapMirror destination, then you should use the

snapmirror resync command to resynchronize the SnapMirror relationship.

If there is no common Snapshot copy between the SnapMirror source and SnapMirror destination,

the you should reinitialize the SnapMirror relationship.

• Between the time you enter the snap restore command and the time when reversion is completed,

Data ONTAP stops deleting and creating Snapshot copies.

jkoelker01 · ‎2011-11-01

I was hoping someone would have replied to your posting. I am experiencing the same issue. I am on Data Ontap 7.3.2 and VSC 2.1.1 to do my backups. I am currently only testing this as it has not worked properly yet for me. Basically I have 3 NFS volumes, one for config files and two others for vmdk's. I am able to do a backup perfectly fine but then when I try to restore that snapshot it will saturate the whole interface that connects to the san and cause the volumes to become disconnected into vsphere. This was for a 25 GB vmdk and it wasn't even finished after 45 minutes. I had to reboot my SAN (FAS2020) in order to recover the volumes and get the restore to stop.

I had to reboot because I couldn't find an actual way to cancel the restore. Does anyone know how to do that? Also why would this crash my whole connection? There are no requirements indicating it needs its own NIC to do the transfers.

andrc · ‎2011-11-01

The restore shouldn't have any impact on an interface as a snapshot restore is all done internally. Basically the inode pointers are changed to point to the old file blocks at the time of the snapshot, rather than the current file blocks.

gdefevere · ‎2014-12-05

Any update on this strange behaviour ? I had the same issue with ONTAP 8.1.4P1 (7-mode) on a FAS6240 (not I little one as you see).

It started after a SMO restore operation on a cloned volume for which single file restore was needed/selected. The database has a lot of data files (a very big DB) ! People started complaining, especially about the cifs access. When I looked later on the stats there were latency issues for CIFS and iSCSI (FCP and NFS didn't suffered from it). Strange, because the Oracle environment runs over NFS. Killing the SMO process and later halting the host didn't solve anything. Only after offlining the volume, the controller behaved normal again. Unfortunately when I online it again the single-file snaprestore starts again :(.

Why is there such an impact ? How can I stop this, without destroying my volume ?

SMO Log:

...

--[ INFO] SMO-07200: Beginning restore of database "NGDB"

...

-[ INFO] SD-00010: Beginning single file restore of file(s) [/ngdbhome/ngdb/DATA/CTX/Ctx07.dbf, /ngdbhome/ngdb/DATA/CciLob/22/CciLobData22.dbf5, /ngdbhome/ngdb/DATA/CciLob/22/CciLobData22.dbf6, /ngdbhome/ngdb/DATA/CciLob/22/CciLobData22.dbf3,

...

Messages log:

...

Thu Dec 4 16:38:07 CET [NETAPPXX:wafl.sfsr.done:notice]: Single-file snaprestore of inode 26253 (snapid 19, volume Test_CCIv36_NGDB_Data_clone) to inode 9681 has completed.

Thu Dec 4 16:38:07 CET [NETAPPXX:wafl.scan.start:info]: Starting redirect on volume Test_CCIv36_NGDB_Data_clone.

Thu Dec 4 16:42:05 CET [NETAPPXX:cifs.oplock.break.timeout:warning]: CIFS: An oplock break request to station 10.230.128.31() for filer NETAPPXX, share rmgmailarchive01indexes$, file \Indexes02\166741376BDAE5A4F9BFB6D82329C4E7B_5316\live\log.sqlt has timed out.

Thu Dec 4 16:53:08 CET [NETAPPXX:wafl.sfsr.done:notice]: Single-file snaprestore of inode 13360 (snapid 19, volume Test_CCIv36_NGDB_Data_clone) to inode 2037 has completed.

Thu Dec 4 16:53:16 CET [NETAPPXX:wafl.sfsr.done:notice]: Single-file snaprestore of inode 23958 (snapid 19, volume Test_CCIv36_NGDB_Data_clone) to inode 24675 has completed.

...

There was not much disk activity but the filer was doing lots of WAFL_Ex(Kahu).

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s Cache Cache    CP CP Disk   OTHER    FCP iSCSI     FCP   kB/s   iSCSI   kB/s
                                       in    out    read write    read write    age    hit time ty util                            in    out      in    out
99%    722   5593      0    7793    6263 14818 175412 243976       0      0     0s    99%   68% :    32%       0   1438     40    3194 113031     299    133
99%   1102   5684      0    7760    8756 24136   16296     47       0      0     0s    99%    0% -    14%       0    650    324    3549    350    1936    101
99%    912   5241      0    6804    9035 13149    9500      0       0      0     0s   100%    0% -     9%       0    645      6    4084    237      33      0
99%   1030   3820      0    5412    6953   8243    6785      0       0      0     0s   100%    0% -     9%       0    536     26    5784    500      89     65
99%   3901   4294      0    8571 192951 12448   19609     63       0      0     0s    99%    0% -    19%       5    298     73    2183    181    1681      0
99%    732   4715      0    6133    5367 25579   17283      0       0      0     4    100%    0% -    10%     357    296     33    3735    200      87    178
99%   1184   5176      0    7527    6355 26430   86744      0       0      0     4    100%    0% -    18%     166    950     51    1825 62865     194     65
99%   1169   5427      0    7852    8028 19993   89951     47       0      0     4    100%    0% -    16%       1   1245     10    2806 78600      37      0
99%   2263   5952      0    9059   62261 19515   84106      0       0      0     4    100%    0% -    15%       0    832     12    1521 74958      39      0
98%   5364   4812      0   10627 205358 18857   41395     16       0      0     4     99%    0% -    17%       6    395     50    1609 20012    1524      0
99%   2919   5799      0    9701   17548 26488 192051    269       0      0     4     99%   44% Tn   18%       0    877    106    2021 60060   10643      0

ANY1+ ANY2+ ANY3+ ANY4+ ANY5+ ANY6+ ANY7+ ANY8+ AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s   CP
100%   47%   24%   12%    7%    4%    3%    2% 26% 23% 25% 26% 24% 26% 31% 26% 29%     36%       0%      0%      9%   7%     4%    44%     79%( 55%)          0%        0%   9%     8% 14%   1%   7893   0%
100%   54%   34%   23%   16%   11%    7%    5% 32% 31% 30% 36% 34% 35% 33% 31% 28%     30%       0%      0%     12% 20%     6%    42%    100%( 58%)          7%        0% 12%    15% 14%   1%   7709 21%
100%   95%   78%   59%   42%   30%   20%   14% 56% 51% 50% 56% 61% 49% 53% 51% 78%     34%       0%      0%     22% 85%     6%    25%    161%( 74%)         25%        0% 14%    53% 23%   1% 10204 100%
100%   54%   30%   18%   11%    6%    4%    3% 29% 32% 26% 31% 27% 28% 30% 30% 30%     26%       0%      0%      9% 22%     5%    34%     91%( 65%)          4%        0% 11%    19% 12%   1%   8147 100%

andrc · ‎2011-11-01

Is your datastore accesed via iSCSI, FCP or NFS?

You say you attempted a snap restore, is the snap you're restoring from a standard scheduled volume snapshot?

What command did you use to restore?

BenjaminCoetzer · ‎2016-01-07

Should anyone ever stumble upon this article in a frantic panic trying to get their filer to respond to nfs/cifs/iscsi requests with a running single file snap-restore process in progress (this, by the way, is what VSC uses for VM restores by default on NFS - blows my mind) - you can cancel the single file snap-restore process by deleting the destination file/directory you are restoring to - if you are doing an inline restore, delete the folder/file you are trying to restore and manually copy out that deleted file/folder from your ".snapshots" folder

silus · ‎2016-07-21

We had this same issue. Kicked off a VM restore using VSC, and the filer ground to a halt, vms datastores went off line, as did iscsi etc and as such most major applications offline.

Unfortunately we didn't find this whilst it was happening to know to delete the destination folder and whilst faffing around trying to find a way to recover essentially ended up waiting it out for 4.5 hours outage.

It turns out this KB covers this behaviour: https://kb.netapp.com/support/index?page=content&id=2023372&locale=us

The upshot is, NetApp provide this full vm recovery option in their tool but in short, don't use it. Do it the long way and mount a snapshot and drag the vm out manually. There's no 'fix' aside from going to cluster mode or use a better backup and recovery product.