Solved: Can FabricPool with SnapShot policy ensure me to restore date from any previous day?

netappmagic · ‎2021-11-03

Let's say I have a NFS volume, and with FabricPool and snapshots policy enabled. The minimum cooling days are 2 days, which means any cold BLOCKS have not being modified will be tiered to S3 bucket. The snapshot policy is set to one per day.

Let's also assume that the volume got corrupted or infected (encrypted) by ransomware on Friday, can I then use snapshot taken Thursday, Wednesday, or any day before Friday to restore data?

My answer probably not, because the metadata or the active data may be got corrupted, It is not necessary that all blocks involved for that day I need would be tiered to S3, then I can not restore all blocks back. FabricPool with Snapshots policy is not backup. Am I right?

Can experts here please confirm or shed some light for me?

elementx · ‎2021-11-03

Fabric Pool doesn't play a role in client-side encryption of data (Fabric Pool data is already encrypted by ONTAP before it arrives to S3), so as far as recovery from encryption or ransomware is concerned Fabric Pool can be ignored (it's the same situation with or without it).

> Let's also assume that the volume got corrupted or infected (encrypted) by ransomware on Friday

The volume itself can't get "corrupted" because of ransomware. ONTAP itself would have to get infected for that to happen.

There will be one or more NFS shares, and files on NFS share(s) on that volume could get deleted or encrypted (if it doesn't get stopped by anti-ransomware feature [1] built into ONTAP => 9.10.1).

[1] https://www.netapp.com/blog/prevent-ransomware-spread-ONTAP/

If you have a FlexVol snapshot, only files on NFS share(s) that live(s) on that volume can get encrypted by ransomware. The filesystem (WAFL) itself cannot. If data on NFS shares gets encrypted or deleted, you can revert the volume to last snapshot before NFS clients got infected, or clone it and rescan the clone data with anti-ransomware software to make sure it's clean.

See TR-4572 for the details related to pre-9.10.1 (when this built-in detection becomes available in ONTAP OS):

https://www.netapp.com/media/7334-tr4572.pdf

> FabricPool with Snapshots policy is not backup. Am I right?

Snapshots are not backups. Fabric Pool is not a backup tool. Fabric Pool with Snapshots is also not a backup.

But you don't need a backup to restore a snapshot.

View solution in original post

elementx · ‎2021-11-03

Fabric Pool doesn't play a role in client-side encryption of data (Fabric Pool data is already encrypted by ONTAP before it arrives to S3), so as far as recovery from encryption or ransomware is concerned Fabric Pool can be ignored (it's the same situation with or without it).

> Let's also assume that the volume got corrupted or infected (encrypted) by ransomware on Friday

The volume itself can't get "corrupted" because of ransomware. ONTAP itself would have to get infected for that to happen.

There will be one or more NFS shares, and files on NFS share(s) on that volume could get deleted or encrypted (if it doesn't get stopped by anti-ransomware feature [1] built into ONTAP => 9.10.1).

[1] https://www.netapp.com/blog/prevent-ransomware-spread-ONTAP/

If you have a FlexVol snapshot, only files on NFS share(s) that live(s) on that volume can get encrypted by ransomware. The filesystem (WAFL) itself cannot. If data on NFS shares gets encrypted or deleted, you can revert the volume to last snapshot before NFS clients got infected, or clone it and rescan the clone data with anti-ransomware software to make sure it's clean.

See TR-4572 for the details related to pre-9.10.1 (when this built-in detection becomes available in ONTAP OS):

https://www.netapp.com/media/7334-tr4572.pdf

> FabricPool with Snapshots policy is not backup. Am I right?

Snapshots are not backups. Fabric Pool is not a backup tool. Fabric Pool with Snapshots is also not a backup.

But you don't need a backup to restore a snapshot.

netappmagic · ‎2021-11-04

Thanks for clarifying a lot of issues here for me. One more follow-up.

If I implement SnapVault, it could backup snapshots to remote site, and restore not infected data from previous data on remote site. Will that do any better comparing to rely on local snapshots?

I know this method could save space on the primary site and keep all data should DR happen. But just from the perspective of protecting data from ransomeare attack, I would say there is no difference because the infected data will be replicated to the remote side as well by snapvault.

am I right?

elementx · ‎2021-11-04

That (replication of Snapshots + SnapMirror + SnapVault-ing them on the remote site) is a valid design pattern for several purposes. For example, with it not only do you have a backup (replica copy of volumes from Origin), but you also can SnapVault those snapshots at SnapMirror Destination.

Now you may ask what's the difference between using CVO/FP (with Snapshot tiering policy) and SnapMirror + SnapVault in terms of ransomware protection? It isn't much in the sense that you can always restore from a snapshot, but the extra features are (a) with SnapVault you are also protected from various other destruction, (b) with SnapVault at a remote location, you also get a backup and the ability to move service elsewhere if you have to (example: Ransomware takes down all Active Directory / DNS servers at the Origin - your Snapshots can be restored, but no SMB client can connect as ADS is dead).

There are also other reasons, it really depends.

One example might be where you want SnapVault, not just Snapshots. Another is where you may have a big/fast/expensive Origin ONTAP and a lower cost Destination ONTAP, so you want to keep more snapshots, and longer, at the Destination, whereas at the Origin you may keep just 10-20 for the current week.

And you can also use FP at the Destination - see "Shrink the secondary storage footprint (All)" on page 8 of TR-4598 Fabric Pool Best Practices - https://www.netapp.com/media/17239-tr4598.pdf

Some customers also use the pattern you mention (SM + SV) to backup data to tape. You can't do that in the cloud, but if you SnapMirror + SnapVault to on-prem, you could backup that on-prem data to tape. (This isn't to say "tape is better" or anything like that, it's simply a requirement that some customers have.)

heightsnj · ‎2021-12-14

@elementx wrote:
(b) with SnapVault at a remote location, you also get a backup and the ability to move service elsewhere if you have to (example: Ransomware takes down all Active Directory / DNS servers at the Origin - your Snapshots can be restored, but no SMB client can connect as ADS is dead).

@elementx If you can please explore the point more, it sounds important.

If Ransomware takes down AD/DNS at the origin, cannot we restore them from previous snapshots either from FB with SnapShot policy or from SV, if we can restore AD/DNS, then why no SMB client can connect ? I don't quit understand.

elementx · ‎2021-12-14

ADS (and NTP) are very sensitive to time, and if you bring an outdated (i.e. image backup or snapshot) authoritative AD back online, it may confuse its replicas (and all the clients that use it).

And yes, if ADS is down then you cannot connect with SMB because SMB authentication depends on Active Directory - ONTAP isn't where SMB accounts and passwords are stored; like all modern SMB servers it integrates with directory services to check (authenticate) users, find group memberships, etc. How would you handle the situation where you restore ADS snapshot from last week and can't log in because you changed a complex password earlier this week and deleted the old one from password manager? There's no solution for such problems and there's a lot of time-sensitive information in ADS, not just accounts and passwords.

ONTAP still provides the ability to use local accounts, it's just not a great idea.

But if you want to use Vserver with local accounts, here is some ONTAP and local SMB accounts information:

1) https://docs.netapp.com/us-en/ontap/smb-config/create-local-user-accounts-task.html

2) https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/Are_Local_Users_and_Groups_Supported_for_CIFS%2F%2FSMB_in_ONTAP%3F

Due to that confusion that happens when outdated ADS comes alive, you'll find recommendations to restore ADS with NICs disconnected.

Here's an illustration of such challenges:

https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/detect-and-recover-from-usn-rollback

How to restore individual ADS instance: this article is quite good, so perhaps check it out and then see related MS Windows docs:

http://woshub.com/restore-active-directory-dc-from-backup/

Sometimes you may need to restore an entire Forest:

https://docs.microsoft.com/en-us/windows-server/identity/ad-ds/manage/ad-forest-recovery-determine-how-to-recover

As there are differences in details of different AD versions and what's being restored (one server or entire forest) and where (Hyper-V has some features specific to AD snapshots and restores), I think you'd have to check all relevant MS information and decide for yourself.

If you have Hyper-V you could (as an example) take an ADS VM snapshot on Hyper-V and then take a snapshot of Hyper-V disk on ONTAP. So just create 2 schedules, one in Hyper-V and another in ONTAP 10 mins later. That way you can always restore Hyper-V from ONTAP even if the original Hyper-V was infected and corrupted, and then use the MS procedure to restore ADS from from that Hyper-V snapshot. Similar plans can be made for VMware. If you're on bare metal, then you could still take snapshots, but you'd have to be careful when and how you restore them.

Or, you could set up a small backup share (SMB) on ONTAP and create a schedule that runs a script that dumps AD backup (or backups from multiple servers) to such share(s) to take "application backup" to a vServer SMB share (possibly even a dedicated vServer with local accounts) and that share could be snapshotted on a regular basis.

http://woshub.com/backup-active-directory-domain-controller/

Then you would need just a generic (sysprep'd) Windows VM that you can stand up in minutes and restore this ADS backup to. It goes without saying, you could and should check best practices and test that in an isolated (w/o external network access) environment.

Note: we don't control 3rd party domains and content on them, to get official advice for Windows please check with MS and their partners.

heightsnj · ‎2021-12-14

Thanks! I will have to read through your message multiple times to filly understand.

But, in short, if Ransomware takes down ADS/DNS, then restoring them from Snapshots won’t help, we will have switch the ADS to the remote site - SV destination, and then SMB clients get connected there. Is my understanding correct?

elementx · ‎2021-12-14

Restoring from a snapshot can help you, but ONTAP (or VMware) cannot know what's going on out there (if all ADS VMs are being restored from a snapshot, or just some) or even that the VM snapshot you're restoring is an Active Directory VM.

It can guarantee to give you the exact image of the VM as it was when you took that snapshot.

From there, you need to assess the situation: am I restoring all or just one of ADS nodes?

Am I on Hyper-V, in which case (assuming it got infected at 1pm) I can restore a snapshot taken at noon, and then from that Hyper-V snapshot restore ADS VM "software" snapshot taken with Hyper-V at 11:50am?

And not everyone prepares for all scenarios or has multiple sites.

If you have two ADS servers in the same location, you could take a snapshot of the both at the same time, and then even if just one gets infected, you can test what's better for you:

- Restore the both at once, or

- Restore only the infected one

Maybe the answer would depend on what else happened (such as did other Windows get infected too, or can you even figure out if the other ADS is or isn't infected - maybe you can't, so you'd just restore the both, as well as all other Windows VMs, especially if all have snapshots from the same time).

By the way, in the previous comment I mentioned how all password changes since the snapshot would be rolled back if all ADS instances are restored; this means that not only SMB users, but also many other accounts could be stuck - e.g. IIS Web server's password to SQL Server, etc.

Only you know how often you change account or service passwords or other settings that would get rolled back (DNS entries, for example), so there's no recipe that is foolproof and can apply to all.

Maybe if you take a snapshot of everything you have, and restore all VM snapshots at once, that's one of rare scenarios in which you'd just need to consider just how to let ADS sync with the current time, and then power on the rest of the VMs and let them catch up.

Thankfully, it's not too hard to test.

Create three VMs (two ADS and one generic Windows), put them on an isolated LAN, create a new AD Forest (contoso.com or whatever), try to add users and DNS entries, take a snapshot of all 3, so that you have a "last known good snapshot", and then try different scenarios. For example, take another snapshot of ADS2, delete some entries, shut it down, wait 15 mins, make some changes in ADS1 (create a new user, change a password, create new alias in DNS service), then restore ADS2 from that last snapshot and check how it catches up with regard to AD entries, system time, etc.

Check the AD and OS logs on ADS2, make sure you're happy with the result and that the Windows client can authenticate and otherwise get correct results from ADS2 (at this point you could shut down ADS1 so that only ADS2 is being used for these tests).

Then restore all 3 VMs to the baseline snapshot and try another test - power them on, take another snapshot, delete AD data and/or other system files from both, try to restore the both from that most recent snapshot.

I'd still have a PowerShell script that dumps ADS data to ONTAP SMB share, so that you can quickly set up a completely new ADS VM anywhere, by simply restoring this backup.