I'm pretty new to NetApp and I'm creating a new Snapshot policy on my primary cluster (AFF) and SnapMirror/SnapVault relationships on my DR cluster (FAS). Right now we don't have an official RPO/RTO but I need to get something going. Assuming that we have plenty of space available, does this seem good or maybe a bit excessive?
To be honest, there is no perfect RPO/RTO, it may differ from company to company. You have thought of a very sound protection strategy in terms of snapshot retention and I will take it if I have plenty of space.
Only suggestion probably would be around Hourly, which seems excessive to me, again I have no idea about your business, is it 24/7 shop or 9-5 shop, and what's the rate of change of data for different volumes. So, there could be number of questions here. But, in general (our's is:9-5 shop), we keep '10' hourly, from 8am to 6pm every hour on Production, rest all (daily/weekly/monthy) looks fine.
If I need to recover a file that’s past the prod retention, is it simple to recover if from DR? If so then I think I’ll lower my prod retention. Yes, it's simple (For SAN- You can flexclone and mount it and for NAS - simply mount or use CIFS to use the previous-version to recover.
I think it should looked the other way around. in a identical only-2-copy system, the local should have as many recent snapshots as you can, and the remote should be as frequent replication as possible + any other required sanpshots for retention. as an example:
Snap Local (Mainly RTO) Remote (RPO and extra RTO)
Hourly 168 (7 days) 48 (2 days)
Daily 60 (2 months) 120 (4 months)
Monthly 24 (2 years) 84 (7 years)
Total 252 252
Now for the "why?": Remember that you only need the remote in case of physical/RAID/AGGR-level-logical-damage to the local storage which likely to be noticed immediately (and anyway the mirroring and the snapshot ageing will likely to stop working).
As for the local, this will be the fastest (RTO) way to restore in case of an application/vol-level-logical-damage, hence you will want here as many snapshots you can to choose from to find the one closest to the corruption. Something like: 168 hourly, 60 daily, and 24 monthly.
So going back to the remote, here you want the shortest period from the last replication (RPO) possible on that system. Now after covering the RPO requirement (e.g the replication itself)- I'll put there all the rest of the snapshots (to maximum RTO) that I didn't have space for on the local system. e.g if I'm required 7 years of backups (which many industries now default to, but very unlikely to ever be used) - so that's 84 monthly snapshots (preferable snap per month, not a year - as the deletion process for yearly snapshot can be hard on the system). Some 48 hourly ones (just in the very unlikely case where low-level-logical damage did ended-up causing file/vol-level-logical damage as well) then fill the rest of the 252~ snapshots (always leaving some spare to manual and snapmirror base snapshots) with 120 daily ones.