Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there
We are using Snapcenter v4.4 and testing restore scenarios for a SQL DB. the DB is large and has 4 files (2TB Each) each on a separate LUN/volume and 1 log file on another separate lun/volume. all are connected to physical hosts over iSCSI
When restoring a Snapcenter backup to an alternate host we are seeing something very strange, the first lun/volume restore is actioned in less than a minute, but the remainder take hours each, meaning we have a restore time of almost 12 hours,
Any ideas why we are seeing the time difference between lun/volume restore times when they are on the same cluster with what appears to be the same configuration?
Or any suggestions how we can get the other volumes to restore in minutes like the first?
Thank you
13 REPLIES 13
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Did you check the log for any strange message?
<installation_path>\Program Files\NetApp\SnapCenter\SMCore\log\SMCore_<id>.log
Also, did you check the operating system logs to see if there's any process consuming time in the restore process?
Regards,
Pedro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you the response, check those and nothing until it either completes or times out
seems to disappear into a call to ModifyDBfileswithalternatepaths which has no further logging until it returns. this is inthe SCSQL log file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looks like, according to NetApp TR-4717, this is expected. A 1.5TB database restore operation generally takes about nine hours.
Page-20:
Restoring to an alternate host is a "copy-based" restore mechanism. Large databases might take more time to perform a restore operation because it’s a "streaming process".
For Dev/Test environment: To save restore (clone-split) time.
NetApp best practice is to leverage the cloning methodology to create a copy of the database on the same instance or alternate instance. At a later time, a user can perform a clone split during a peak period or maintenance window to isolate the clone copy from the Snapshot copy and avoid any dependency on either of them in future. To set the restoration timeout value so the large databases do clone split completely successfully, see the section titled, “Restoring a database by restoring to an alternate host option.”
Restoring a database by restoring to an alternate host option:
https://www.netapp.com/pdf.html?item=/media/12400-tr4714pdf.pdf
Why does flex volume clone split take a long time?
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/Why_does_flex_volume_clone_split_take_a_long_time
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So this would make sense if all of the volumes being restored had the same issue, but one 2TB volume restores in minutes, the others in hours.
I now beleive this is because the quick volume and it target are on the same cluster node and the others are split across nodes so requiring a network copy of the data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
forgot to mention the 9hrs is way off I am seeing over 8TB in 10 hours. across 4 volumes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SQL restore time takes longer for larger database in general but the duration (clone-split) varies depending upon the size of the DB (LUN) and also other factors such as : How busy is your Cluster Node during the Restore process . In terms of available resources on Cluster Node, Ontap will give preference to front-end read/writes compared to restores.
1) Could you let us know the ONTAP version ?
2) Is the LUN on Spinning disk/Hybrid/SSD ?
3) Is the Volume containing the LUN "compressed" ?
4) During this restore - Do you have any replication or any other restores running in background ?
5) Could you share the resource utilization data for the Cluster Node/Volume (Latency/Mbs) during the restore process from OCUM/AIQM ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there
thank you for your help
Answers are:-
1) Could you let us know the ONTAP version ? V9.5P3
2) Is the LUN on Spinning disk/Hybrid/SSD ? all on SSD
3) Is the Volume containing the LUN "compressed" ? No
4) During this restore - Do you have any replication or any other restores running in background ?
No all other jobs on that volume/SVM suspended during restore.
Node/Cluster will be running other replication jobs on different SVMs
5) Could you share the resource utilization data for the Cluster Node/Volume (Latency/Mbs) during the restore process from OCUM/AIQM ?
Whilst the restore is in progress I see the Latency increase on the node(s) that are involved in the current restore from nominally <1mS to upto 100mS this is not R or W latency but shown as Others in the performance tab for the node. once the restore is complete the Latency drops to normal
We suspected the Others refers to inter-node network traffic so have moved the volumes around to make each restore an on node memory copy (4 nodes in cluster) this is being tested at present but at the moment does not seem to have improved the overall restore duration.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there
After much testing we noticed that during the restore the Node others Latency time was ballooning up to 100mS as SC copied the DB files back between nodes.
Therefore, we realigned the volume layout between the 2 servers so the volumes (5 off) were aligned on nodes. i.e Vol 1 on node1 vol 2 on node 2 etc. When performing the restore the overall time did not improve still circa 10 hours but the "Other" Latency figure for the target node did drop to Circa 50mS
So with the copy now between volumes on the same node we beleive that we have optimized this as far as we can, and the fact that the read/write latency on the nodes whilst the copy is running is low <1-5mS suggest that this is an internal operation within the cluster that is being throttled or treated as a background task.
This test also confirmed the first volume is not copied in minutes, but that at the windows OS level the file and folder appear, but the underlying data is still being streamed back. This is extrapolated by the fact the file time is initially the time of creation, but then reverts to the restore snapshot time once the restore for that file is complete.
Hence when restoring a DB between servers the time taken appears to be determined the underlying cluster performance and a throttling or background task priority set by Netapp. so for Large Databases be prepared to wait if you have to rollback the DB that is in an AOAG and then copy to another server(s)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm facing the same issue with SC 4.5.
We try to restore 11T DB to an alternate host. There are 3,5T restored in 5 hours.
I tested the read performance, it's possible to do 450MB/s. The storage system is running at 5%.
SATA cluster, physical DB host with 2x16G FC.
I don't see any traffic on the LUNs, don't know how the data are copied.
Marcus
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Marcus
I believe the data is copied on the controller/cluster - check the node performance and look at the other latency figure. My belief is It is run as a background task so will take as long as it needs but is throttled within Ontap
just make sure you have altered the default 3hr Snapcenter timeout on the host RESTTimeout in the config file otherwise the task may be cancelled before completing
hth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey,
applied QoS policies to all involved lun. There is no traffic on them.
I have no idea how they copy the data. Each robocopy job (thats what it is at the end) is way faster that that.
I opened a case for investigation.
Marcus
Marcus
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
please let us know how you get on
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
will do that.
I think they just claim "it's according to documentation".
Marcus
