Snapcenter SQL Restore to alternate host - timing

bobalon · ‎2021-09-09

Hi there

We are using Snapcenter v4.4 and testing restore scenarios for a SQL DB. the DB is large and has 4 files (2TB Each) each on a separate LUN/volume and 1 log file on another separate lun/volume. all are connected to physical hosts over iSCSI

When restoring a Snapcenter backup to an alternate host we are seeing something very strange, the first lun/volume restore is actioned in less than a minute, but the remainder take hours each, meaning we have a restore time of almost 12 hours,

Any ideas why we are seeing the time difference between lun/volume restore times when they are on the same cluster with what appears to be the same configuration?

Or any suggestions how we can get the other volumes to restore in minutes like the first?

Thank you

pedro_rocha · ‎2021-09-09

Hi,

Did you check the log for any strange message?

<installation_path>\Program Files\NetApp\SnapCenter\SMCore\log\SMCore_<id>.log

Also, did you check the operating system logs to see if there's any process consuming time in the restore process?

Regards,

Pedro

bobalon · ‎2021-09-10

Thank you the response, check those and nothing until it either completes or times out

seems to disappear into a call to ModifyDBfileswithalternatepaths which has no further logging until it returns. this is inthe SCSQL log file

Ontapforrum · ‎2021-09-09

Looks like, according to NetApp TR-4717, this is expected. A 1.5TB database restore operation generally takes about nine hours.

Page-20:
Restoring to an alternate host is a "copy-based" restore mechanism. Large databases might take more time to perform a restore operation because it’s a "streaming process".

For Dev/Test environment: To save restore (clone-split) time.
NetApp best practice is to leverage the cloning methodology to create a copy of the database on the same instance or alternate instance. At a later time, a user can perform a clone split during a peak period or maintenance window to isolate the clone copy from the Snapshot copy and avoid any dependency on either of them in future. To set the restoration timeout value so the large databases do clone split completely successfully, see the section titled, “Restoring a database by restoring to an alternate host option.”

Restoring a database by restoring to an alternate host option:
https://www.netapp.com/pdf.html?item=/media/12400-tr4714pdf.pdf

Why does flex volume clone split take a long time?
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/Why_does_flex_volume_clone_split_take_a_long_time

bobalon · ‎2021-09-10

So this would make sense if all of the volumes being restored had the same issue, but one 2TB volume restores in minutes, the others in hours.

I now beleive this is because the quick volume and it target are on the same cluster node and the others are split across nodes so requiring a network copy of the data.

bobalon · ‎2021-09-10

forgot to mention the 9hrs is way off I am seeing over 8TB in 10 hours. across 4 volumes

Ontapforrum · ‎2021-09-10

SQL restore time takes longer for larger database in general but the duration (clone-split) varies depending upon the size of the DB (LUN) and also other factors such as : How busy is your Cluster Node during the Restore process . In terms of available resources on Cluster Node, Ontap will give preference to front-end read/writes compared to restores.

1) Could you let us know the ONTAP version ?

2) Is the LUN on Spinning disk/Hybrid/SSD ?

3) Is the Volume containing the LUN "compressed" ?

4) During this restore - Do you have any replication or any other restores running in background ?

5) Could you share the resource utilization data for the Cluster Node/Volume (Latency/Mbs) during the restore process from OCUM/AIQM ?

bobalon · ‎2021-09-13

Hi there

thank you for your help

Answers are:-

1) Could you let us know the ONTAP version ? V9.5P3

2) Is the LUN on Spinning disk/Hybrid/SSD ? all on SSD

3) Is the Volume containing the LUN "compressed" ? No

4) During this restore - Do you have any replication or any other restores running in background ?

No all other jobs on that volume/SVM suspended during restore.

Node/Cluster will be running other replication jobs on different SVMs

5) Could you share the resource utilization data for the Cluster Node/Volume (Latency/Mbs) during the restore process from OCUM/AIQM ?

Whilst the restore is in progress I see the Latency increase on the node(s) that are involved in the current restore from nominally <1mS to upto 100mS this is not R or W latency but shown as Others in the performance tab for the node. once the restore is complete the Latency drops to normal

We suspected the Others refers to inter-node network traffic so have moved the volumes around to make each restore an on node memory copy (4 nodes in cluster) this is being tested at present but at the moment does not seem to have improved the overall restore duration.

bobalon · ‎2021-09-17

Hi there

After much testing we noticed that during the restore the Node others Latency time was ballooning up to 100mS as SC copied the DB files back between nodes.

Therefore, we realigned the volume layout between the 2 servers so the volumes (5 off) were aligned on nodes. i.e Vol 1 on node1 vol 2 on node 2 etc. When performing the restore the overall time did not improve still circa 10 hours but the "Other" Latency figure for the target node did drop to Circa 50mS

So with the copy now between volumes on the same node we beleive that we have optimized this as far as we can, and the fact that the read/write latency on the nodes whilst the copy is running is low <1-5mS suggest that this is an internal operation within the cluster that is being throttled or treated as a background task.

This test also confirmed the first volume is not copied in minutes, but that at the windows OS level the file and folder appear, but the underlying data is still being streamed back. This is extrapolated by the fact the file time is initially the time of creation, but then reverts to the restore snapshot time once the restore for that file is complete.

Hence when restoring a DB between servers the time taken appears to be determined the underlying cluster performance and a throttling or background task priority set by Netapp. so for Large Databases be prepared to wait if you have to rollback the DB that is in an AOAG and then copy to another server(s)

MaGr · ‎2021-09-22

Hi,

I'm facing the same issue with SC 4.5.

We try to restore 11T DB to an alternate host. There are 3,5T restored in 5 hours.

I tested the read performance, it's possible to do 450MB/s. The storage system is running at 5%.

SATA cluster, physical DB host with 2x16G FC.

I don't see any traffic on the LUNs, don't know how the data are copied.

Marcus

bobalon · ‎2021-09-22

Marcus

I believe the data is copied on the controller/cluster - check the node performance and look at the other latency figure. My belief is It is run as a background task so will take as long as it needs but is throttled within Ontap

just make sure you have altered the default 3hr Snapcenter timeout on the host RESTTimeout in the config file otherwise the task may be cancelled before completing

hth

MaGr · ‎2021-09-23

Hey,

applied QoS policies to all involved lun. There is no traffic on them.

I have no idea how they copy the data. Each robocopy job (thats what it is at the end) is way faster that that.

I opened a case for investigation.

Marcus

bobalon · ‎2021-09-24

please let us know how you get on

MaGr · ‎2021-09-27

will do that.

I think they just claim "it's according to documentation".

Marcus