NetApp Snapshot Snowball Effect

Lumina · ‎2019-04-03

Good Morning Everyone,

I have a little predicament and I'm wondering if anyone has some insight or recommedations on a workable solution for us. We just finsihed migrating off of bare metal servers to VMware vSphere 6.5. Everything has been working great and life has been better overall.

What we have is a training/sandbox environment with around 100 VMs of various sizes. Some of the VMs are used for management software/workstations and the rest are our "production" VMs that the people getting trained use. These production VMs are restored twice a year to their original state to "reset" the data. As new versions of the training software are released, we retire old production VMs and replace them with new ones.

So we decided that the SnapCenter plug-in for VMware and the NetApp snapshot technology would be perfect for our "backup" needs. I set up two different resource groups for our environment. One that takes monthly, weekly, daily backups and another resource group that takes as-needed or "gold" backups. These are the ones that we regularly restore from for our training sessions and need to be available until the VM is retired.

Everything has worked perfectly but while doing a little checkup on the volume sizes, I've noticed that the snapshots are kind of growing out of control and want to be proactive in preventing data loss. What I'm trying to understand is how snapshot space is reclaimed or in other words, how to shrink the cumulative snapshot size after VMs are retired (deleted).

In my mind, the snapshot size would shrink after the VM is deleted AND the retention policy for the VM's snapshot resource group is expired. My concern is that the cumulative snapshot size has never gone down, only up. I understand that people are always using the systems and adding small amounts of data, so maybe there's just more data being added than VMs being retired?

We have more capacity, so I can always add another datastore or increase the volume size, but I don't want to shoot myself in the foot and have this be a problem indefinitely.

I'm also wondering if having some really old "gold" snapshots of the volume are keeping a picture of the entire volume at the point the snapshot was taken, meaning that deleting data or VMs after that point is useless until we get rid of that old snapshot.

I know there's a lot of information here and I may not have explained everything clearly, so if that is the case, please ask away! I would appreciate any advice you have!

Thank you!

colsen · ‎2019-04-03

Hello,

Welcome to the SnapCenter world - we've been using NetApp snapshots with our VMware environment for awhile (SnapCenter for the last ~8 months).

Anyway, I think one of the fundamental issues you're encountering is that NetApp snapshots are designed to capture both block changes and deletions. When a VM gets deleted, the entirety of that system gets captured in the snapshot - your datastore will decrease in size but the snapshot will grow by a representative amount. The only way those snapshots will "shrink" is when they eventually roll-off due to your retention policy. For example, if you delete a VM on day 1 and have a 30 day snapshot retention policy, that deleted VM will be contained within the corresponding snapshot until such time as that snapshot is deleted. You might roll that snapshot into a daily or even weekly, but in the end, the blocks associated with that deleted system will get protected just in case you ever wanted to recover it.

So, in the end, you're not going to "reclaim" snapshot space until such time as you roll the snapshots off due to whatever retention policy you're enforcing via the SnapCenter schedule. With our environment, that snapshot space usually represents somewhere between 45-75% of the actual data on the datastore itself - we keep a day's worth of hourly snaps, 7 daily snaps and ~4 weekly ones. We've had situations where there is a high rate of deletion or change in a datastore where that snapshot space is the same or even larger than the base datastore itself.

Hope that helps. There are a lot of good best practices/etc documents out there on NetApp, snapshots and VMware. This might be a good place to start: https://blog.netapp.com/updated-technical-report-for-vsphere-with-ontap

Good luck!

Chris

Lumina · ‎2019-04-04

Chris,

Great reply. Thank you! It's good to hear that someone is in the same boat. I understand what you are saying about cleanup being based on the retention policy, but I still have a little confusion as to what the SnapCenter snapshot of a VM is actually capturing.

From the OnCommand System Manager level, if I take a snapshot, it captures the whole volume (delta) at that point in time. If I take a snapshot of a VM through SnapCenter, is it doing the same thing but just putting some hooks on the chunk of data relevant to that particular VM?

The reason I ask is that I'm trying to understand how all of this applies to our scenario. Say if I have 50 VMs at a point in time and I take a "gold" snapshot of one of the VMs with an unlimited retention policy. A few months pass with regular daily snapshots being taken. Now I delete the other 49 VMs so we only have the one left with the "gold" snapshot. Does that indefinite "gold" snapshot hold the entire volume size of all 50 VMs, or does the snapshot only capture and retain that VM's data?

I hope that example is clear enough! 🙂

colsen · ‎2019-04-04

Hello,

In the end, a SnapCenter snapshot is a volume snapshot, no different than what you'd take via OCSM. However, SnapCenter maintains some metadata/etc about the snapshot in its own MySQL database so that it can manage retention and everything else that SnapCenter does (i.e. you wouldn't want SnapCenter to just assume any snapshot older than XX days was no longer needed and remove it - this way it only manages its own snaps). Then the SnapCenter UI/plugin has the smarts to parse the .snapshot directory to display the dates of the snapshots and then when you do a restore, it does it in the context of the Guest OS inside VMware that you're wanting to restore (essentially restoring the individual files associated with the machine into the active file system). It's a little more complex than that and I'm sure a SnapCenter guru can elaborate/correct.

To your example of a volume with 50 VMs - when you take a snapshot through SnapCenter, you're grabbing a snapshot of the whole volume with all 50 VMs. If you delete 49 of the VMs, you'll have a snapshot that has dutifully captured the state of those deleted blocks along with tracking any changed blocks on that remaining VM (i.e. that snapshot is going to be pretty big). You'll see the space reclaimed on the original volume (data space) for those deleted machines, but that snapshot is going to balloon to the size of the data that was deleted from the active file system.

We have gone down the path of organizing our VMware datastores by SLA - including snapshot retention policy, snapmirror frequency, SRM priority, etc. If you've got a category of "gold" VMs that you'd like to have backups for spanning a longer period of time, go ahead and organize those together and apply a custom retention/protection policy to them. If you have what I'm going to call "developer scratch space" where systems are spun up and then deleted often, drop those into their own datastore with a much shorter retention/protection policy. Not sure where you're at with AFF and ONTAP, but with aggregate level dedupe, multiple small volumes won't hurt you like they used to with VMware.

Hope that helps!

Chris

bobshouseofcards · ‎2019-05-31

Ah the joys of snapshots. Understanding that a snapshot captures all deltas since it was taken on a volume is key to working your way out of this mess.

The trick here is to realize that while snapshots and SnapCenter can be used for all kinds of fun backup and restore - you want to organize your data so that you don't create pockets of long retention coupled with short retention, relatively speaking. That is - a VM that is going to be long retention should not be in the same datastores that your regular active create and destroy VMs exist. Then the snapshot size colelction issue goes away...

Create your gold VM as the only VM in a datastore. For all your working copy, always restore from the gold into a different datastore so as not to build up the change level in the gold datastore.

But you say - that defeats the purpose of rapid restore from snapshot! And you've exposed a secret - snapshot backup designs are not best for gold master restores. Clones are where that comes into play more efficiently. Clone from your gold copy (using any of a number of NetApp utilities) and you can get the same rapid deploy, just not based in backup. Use backup to keep historical versions of your clone master if you like.

I've run into this exact design issue numerous times - there are just better mechanisms/tools than "backup" to accomplish the purpose in this respect, IMO.

Bob Greenwald

scottharney · ‎2019-04-04

The reason I ask is that I'm trying to understand how all of this applies to our scenario. Say if I have 50 VMs at a point in time and I take a "gold" snapshot of one of the VMs with an unlimited retention policy. A few months pass with regular daily snapshots being taken. Now I delete the other 49 VMs so we only have the one left with the "gold" snapshot. Does that indefinite "gold" snapshot hold the entire volume size of all 50 VMs, or does the snapshot only capture and retain that VM's data?

The volumes contain both current active data and any snapshot data that differs against the current active data. snapcenter uses the netapp snapshot engine to take snaps of volumes, aka datastores. So while you are thinking in terms of individual VM data, it's really all of the VMs on a datastore's underlying volume at that point in time. If you have a volume with 50 VMs and you take a snapshot at a point in time and 49 other snapshots subsequently, those snaps are deltas of the volume data in time. If you delete all the VMs living in the volume, save 1, the overall volume consumption will not change until those snapshots of the volume roll off of retention.

If you're looking for granularity at the individual vdisk level, that's not really what's happening. What you'd need to get there is vVols and, tbh, I haven't looked to see how how well or if snapcenter supports vVols on ONTAP myself.

From the OnCommand System Manager level, if I take a snapshot, it captures the whole volume (delta) at that point in time. If I take a snapshot of a VM through SnapCenter, is it doing the same thing but just putting some hooks on the chunk of data relevant to that particular VM?

That's the crux of it. With traditional VMware datastores on volumes, the underlying snapshot you are taking is a point in time of the metadata pointers of the volume. There's no awareness of which pointers reference a particular VM.

For your use case, you would be better off leaving your gold image VMs in a separate datastore volume. Option B would be to flexclone a volume off that contains your gold copy VM into its own volume. Leave the rest of your datastore volumes with shorter term retention. Forever retention on volume level snapshots will simply grow over time.