Solved: Re: 254 snapshot limit on volume

IGORSTOJNOV · ‎2016-01-11

Hiya!

One of the volumes on our 7-Mode 8.1.4 systems has hit a 254 snapshot limit... It holds the underlying LUN for one of the vSphere datastores (APP01), whose purpose is to host virtual disks for a lot of virtual machines. These machines also have disks on other datastores as well so when VSC is creating snapshots for either one of these other datstores, it triggers a snapshot for APP01 datastore also. And the snapshot retention we have in place made the volume reach it's snapshot limit a few days ago.

node1> snap autodelete app01
snapshot autodelete settings for app01:
state                           : on
commitment                      : try
trigger                         : volume
target_free_space               : 20%
delete_order                    : oldest_first
defer_delete                    : user_created
prefix                          : (not specified)
destroy_list                    : none

As you can see above, snap autodelete is enabled - but as far as I can see, there's no trigger for snapshout count.

Only for volume, snap_reserve or space_reserve utilization. And neither of those is the case here...

Is there a way to automate deletion of oldest snapshots once snapshot count reaches maximum value?

Regards,

Igor

mbeattie · ‎2016-01-11

Hi,

Here is an "example" script for you. The code to delete the snapshots is commented out so it will simply state what it "would" do if you remove the mulit line comments "<# #>" around the do loop.

As an example i have used hourly snapshots. EG

TESTNS01> vfiler run testnv01 snap list volume_01

===== testnv01
Volume volume_01
working...

%/used       %/total date          name
---------- ---------- ------------ --------
29% (29%)    0% ( 0%) Jan 12 20:00 hourly.0
47% (32%)    0% ( 0%) Jan 12 16:00 hourly.1
57% (32%)    0% ( 0%) Jan 12 12:00 hourly.2
64% (32%)    0% ( 0%) Jan 12 08:00 hourly.3
70% (32%)    0% ( 0%) Jan 12 00:00 nightly.0
74% (36%)    0% ( 0%) Jan 11 20:00 hourly.4
77% (32%)    0% ( 0%) Jan 11 16:00 hourly.5
79% (32%)    0% ( 0%) Jan 11 00:00 nightly.1

If you wanted to automate the deletion of hourly snapshots greater than a threshold (i've used 4) to retain hourly.0-3 then the following script will do that. I'd imagine your configuration is somewhat similar with a differenent snapshot naming standard for you 254 snapshots (you might want to consider setting the snapshot threshold variable to a lower value, eg 250 to ensure manual snapshots can be taken if required).

#'------------------------------------------------------------------------------
Import-Module DataONTAP
[String]$controllerName = "testns01"
[String]$vfilerName     = "testnv01"
[String]$snapshotPrefix = "hourly"
[String]$volumeName     = "volume_01"
[Int]$snapshotThreshold = 4
$credentials            = Get-Credential -Credential root
#'------------------------------------------------------------------------------
#'Connect to the controller and vfiler.
#'------------------------------------------------------------------------------
Try{
   Connect-NaController -Name $controllerName -Vfiler $vfilerName -HTTPS -Credential $credentials | Out-Null
   Write-Host "Connected to controller ""$controllerName"" vfiler ""$vfilerName"""
}Catch{
   Throw $("Failed connecting to controller ""$controllerName"" vfiler ""$vfilerName"". Error " + $_.Exception.Message)
}
#'------------------------------------------------------------------------------
#'Enumerate the snapshots for the vfilers volume matching the snapshot prefix.
#'------------------------------------------------------------------------------
Try{
   $snapshots = Get-NaSnapshot $volumeName "$snapshotPrefix*" -Terse | Select-Object -Property Name, Created
   Write-Host "Enumerated snapshots for volume ""$volumeName"" matching ""$snapshotPrefix`*"""
}Catch{
   Throw $("Failed enumerating snapshots for volume ""$VolumeName"" matching snapshot prefix ""$snapshotPrefix"". Error " + $_.Exception.Message)
}
#'------------------------------------------------------------------------------
#'Exit if the snapshot count is less than or equal to the threshold.
#'------------------------------------------------------------------------------
If($snapshots.Count -le $snapshotThreshold){
   Write-Host $("There are " + $snapshots.Count + " snapshots on volume ""$volumeName"" on vfiler ""$vFilerName"". Exiting")
   Break;
}
#'------------------------------------------------------------------------------
#'Delete the snapshots. 
#'------------------------------------------------------------------------------
[Int]$errorCount = 0
For($i = $snapshotThreshold; $i -le ($snapshots.Count -1); $i++){
   [String]$snapshotName = $snapshots[$i].Name
   [String]$creationDate = $snapshots[$i].Created
   Write-Host "Deleting snapshot ""$snapshotName"" created on ""$creationDate"" for volume ""$volumeName"" on vfiler ""$vFilerName"""
   <#
   Do{
      Try{
         Remove-NaSnapshot $volumeName $snapshotName -ErrorAction Stop
         Write-Host "Deleted snapshot ""$snapshotName"" created on ""$creationDate"" for volume ""$volumeName"" on vfiler ""$vFilerName"""
      }Catch{
         Write-Warning -Message $("Failed deleting snapshot ""$snapshotName"" created on ""$creationDate"" for volume ""$volumeName"" on vfiler ""$vFilerName"". Error " + $_.Exception.Message)
         [Int]$errorCount = $errorCount + 1
      }
   }Until($True)
   #>
}
If($errorCount -ne 0){
   Throw "Failed deleting snapshots"
}
#'------------------------------------------------------------------------------

The output will look like:

Connected to controller "testns01" vfiler "testnv01"
Enumerated snapshots for volume "volume_01" matching "hourly*"
Deleting snapshot "hourly.4" created on "01/11/2016 20:00:04" for volume "volume_01" on vfiler "testnv01"
Deleting snapshot "hourly.5" created on "01/11/2016 16:00:34" for volume "volume_01" on vfiler "testnv01"

Have a look at the autodelete options for the volume. EG

TESTNS01> vfiler run testnv01 snap autodelete volume_01

===== testnv01
snapshot autodelete settings for volume_01:
state                           : off
commitment                      : try
trigger                         : volume
target_free_space               : 20%
delete_order                    : oldest_first
defer_delete                    : user_created
prefix                          : (not specified)
destroy_list                    : none

You can use the "defer_delete" and "prefix" options to specify which snapshots to delete last (not first)

/matt

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

View solution in original post

mbeattie · ‎2016-01-11

Hi,

You could certainly automate that (either with a script or WFA workflow) to enumerate the snapshots (and the time each snapshot was created) for the specified volume and if the snapshot count exceeds a threshold limit then delete the oldest snapshot. I suppose the logic to automate the process also depends on your snapshot schedule, the frequency in which manual snapshots might be taken (if any) and if baseline snapshots exists (to be excluded from deletion) and any adherence to snapshot naming standards you might have.

The powershell cmdlet's are:

Get-NaSnapshot
Remove-NaSnapshot

Let me know if this is what you are trying to achieve, shouldn't be difficult to write the code

/matt

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

JGPSHNTAP · ‎2016-01-11

show me

vol options volname

and

df -Vg volname

and

snap reserve volname

I think your trigger is wrong for this situation.. You have it set on volume

IGORSTOJNOV · ‎2016-01-11

Hi Matt,

I suppose a script can achieve this but I'm not very good at scripting. I was expecting out-of-the-box option for this type of automatic snapshot management.

JGPSHNTAP,

The output is below.

node1> vol options app01
nosnap=off, nosnapdir=off, minra=off, no_atime_update=off, nvfail=off,
ignore_inconsistent=off, snapmirrored=off, create_ucode=on,
convert_ucode=on, maxdirsize=73400, schedsnapname=ordinal,
fs_size_fixed=off, guarantee=none, svo_enable=off, svo_checksum=off,
svo_allow_rman=off, svo_reject_errors=off, no_i2p=off,
fractional_reserve=0, extent=off, try_first=snap_delete,
read_realloc=off, snapshot_clone_dependency=off, dlog_hole_reserve=off,
nbu_archival_snap=off

node1> df -Vg app01
Filesystem               total       used      avail capacity Mounted on
/vol/app01/             1024GB      447GB      576GB      44% /vol/app01/
/vol/app01/.snapshot        0GB      204GB        0GB     ---% /vol/app01/.snapshot

node1> snap reserve app01
Volume app01: current snapshot reserve is 0% or 0 k-bytes.

node1> vol status app01
         Volume State           Status            Options
          app01 online          raid_dp, flex     create_ucode=on, convert_ucode=on,
                              mirrored            guarantee=none, fractional_reserve=0,
                              sis                 try_first=snap_delete
                                    64-bit

As you see - volume utilization is not high, snap reservation is OFF and space reservation is probably not a good candidate for a trigger either. It's just the snapshot count that's the issue.

Regards,

Igor

JGPSHNTAP · ‎2016-01-11

That's your issue... Youre thresholds are set up wrong...

You have no snap reserve, and your util is 44%, so the system doesn't know to delete anything.

You need to enable snap reserve and change your thresholds...

IGORSTOJNOV · ‎2016-01-11

Hello JGPSHNTAP,

The volume has been sized with projected data growth in mind (data migration in progess). Reducing the volume size won't solve the snapshot count problem, it will only limit the amount of data that can reside there.

As for snap reserve, everything is thin provisioned and snapshot reservation is 0% as recommended for SAN, to avoid potentialy unnecessary overhead. In any case, our snapshot retention is in direct corelation with RPO objectives set forth by our applications department. Setting a limited snapshot reserve means allowing the system to delete snapshots even if there's free space on the volume, which will mess up the RPO objective unnecessarily.

I need the trigger to be the final limiting factor - snapshot count. Only then would it be justified to let the system delete the oldest snapshot(s).

JGPSHNTAP · ‎2016-01-11

Sorry, the system doesn't appear to be designed that way

See the link

https://library.netapp.com/ecmdocs/ECMP1368826/html/GUID-6653B102-E228-4D1E-82F1-AFF58FE144C5.html

Sees you need to customize the solution.. It's very simple in powershell

mbeattie · ‎2016-01-11

Hi,

Here is an "example" script for you. The code to delete the snapshots is commented out so it will simply state what it "would" do if you remove the mulit line comments "<# #>" around the do loop.

As an example i have used hourly snapshots. EG

TESTNS01> vfiler run testnv01 snap list volume_01

===== testnv01
Volume volume_01
working...

%/used       %/total date          name
---------- ---------- ------------ --------
29% (29%)    0% ( 0%) Jan 12 20:00 hourly.0
47% (32%)    0% ( 0%) Jan 12 16:00 hourly.1
57% (32%)    0% ( 0%) Jan 12 12:00 hourly.2
64% (32%)    0% ( 0%) Jan 12 08:00 hourly.3
70% (32%)    0% ( 0%) Jan 12 00:00 nightly.0
74% (36%)    0% ( 0%) Jan 11 20:00 hourly.4
77% (32%)    0% ( 0%) Jan 11 16:00 hourly.5
79% (32%)    0% ( 0%) Jan 11 00:00 nightly.1

If you wanted to automate the deletion of hourly snapshots greater than a threshold (i've used 4) to retain hourly.0-3 then the following script will do that. I'd imagine your configuration is somewhat similar with a differenent snapshot naming standard for you 254 snapshots (you might want to consider setting the snapshot threshold variable to a lower value, eg 250 to ensure manual snapshots can be taken if required).

#'------------------------------------------------------------------------------
Import-Module DataONTAP
[String]$controllerName = "testns01"
[String]$vfilerName     = "testnv01"
[String]$snapshotPrefix = "hourly"
[String]$volumeName     = "volume_01"
[Int]$snapshotThreshold = 4
$credentials            = Get-Credential -Credential root
#'------------------------------------------------------------------------------
#'Connect to the controller and vfiler.
#'------------------------------------------------------------------------------
Try{
   Connect-NaController -Name $controllerName -Vfiler $vfilerName -HTTPS -Credential $credentials | Out-Null
   Write-Host "Connected to controller ""$controllerName"" vfiler ""$vfilerName"""
}Catch{
   Throw $("Failed connecting to controller ""$controllerName"" vfiler ""$vfilerName"". Error " + $_.Exception.Message)
}
#'------------------------------------------------------------------------------
#'Enumerate the snapshots for the vfilers volume matching the snapshot prefix.
#'------------------------------------------------------------------------------
Try{
   $snapshots = Get-NaSnapshot $volumeName "$snapshotPrefix*" -Terse | Select-Object -Property Name, Created
   Write-Host "Enumerated snapshots for volume ""$volumeName"" matching ""$snapshotPrefix`*"""
}Catch{
   Throw $("Failed enumerating snapshots for volume ""$VolumeName"" matching snapshot prefix ""$snapshotPrefix"". Error " + $_.Exception.Message)
}
#'------------------------------------------------------------------------------
#'Exit if the snapshot count is less than or equal to the threshold.
#'------------------------------------------------------------------------------
If($snapshots.Count -le $snapshotThreshold){
   Write-Host $("There are " + $snapshots.Count + " snapshots on volume ""$volumeName"" on vfiler ""$vFilerName"". Exiting")
   Break;
}
#'------------------------------------------------------------------------------
#'Delete the snapshots. 
#'------------------------------------------------------------------------------
[Int]$errorCount = 0
For($i = $snapshotThreshold; $i -le ($snapshots.Count -1); $i++){
   [String]$snapshotName = $snapshots[$i].Name
   [String]$creationDate = $snapshots[$i].Created
   Write-Host "Deleting snapshot ""$snapshotName"" created on ""$creationDate"" for volume ""$volumeName"" on vfiler ""$vFilerName"""
   <#
   Do{
      Try{
         Remove-NaSnapshot $volumeName $snapshotName -ErrorAction Stop
         Write-Host "Deleted snapshot ""$snapshotName"" created on ""$creationDate"" for volume ""$volumeName"" on vfiler ""$vFilerName"""
      }Catch{
         Write-Warning -Message $("Failed deleting snapshot ""$snapshotName"" created on ""$creationDate"" for volume ""$volumeName"" on vfiler ""$vFilerName"". Error " + $_.Exception.Message)
         [Int]$errorCount = $errorCount + 1
      }
   }Until($True)
   #>
}
If($errorCount -ne 0){
   Throw "Failed deleting snapshots"
}
#'------------------------------------------------------------------------------

The output will look like:

Connected to controller "testns01" vfiler "testnv01"
Enumerated snapshots for volume "volume_01" matching "hourly*"
Deleting snapshot "hourly.4" created on "01/11/2016 20:00:04" for volume "volume_01" on vfiler "testnv01"
Deleting snapshot "hourly.5" created on "01/11/2016 16:00:34" for volume "volume_01" on vfiler "testnv01"

Have a look at the autodelete options for the volume. EG

TESTNS01> vfiler run testnv01 snap autodelete volume_01

===== testnv01
snapshot autodelete settings for volume_01:
state                           : off
commitment                      : try
trigger                         : volume
target_free_space               : 20%
delete_order                    : oldest_first
defer_delete                    : user_created
prefix                          : (not specified)
destroy_list                    : none

You can use the "defer_delete" and "prefix" options to specify which snapshots to delete last (not first)

/matt

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

JGPSHNTAP · ‎2016-01-12

^^

not bad, but super complicating in my mind..

You can run a quick check on count, and keep the count less than 250 and keep with do while or do until

niels · ‎2016-01-20

The first question that I have for you is "how are those snapshots created?"

Meaning is it the ONTAP internal scheduler? Than the scheduler itself has an option of how many snapshots to keep of which type (hourly, nightly, weekly). The scheduler just needs to be configured correctly, e.g.:

toaster> snap sched <volume_name>

Volume <volume_name> 5 20 24@8,12,16,20

This would keep 5 weekly snapshots, 20 nightly snapshots and 24 hourly snapshots which would be created at 8am, 12pm, 4pm and 8pm for a total of 49 snapshots. You would never come close to 255 snapshots.

Are the snapshots triggered externally? If so - how? A (homegrown) script or application integrated tools provided by NetApp or a 3rd party?

As you talk about the LUNs being ESX datastores, do you use the NetApp Virtual Storage Console (VSC) to create those snapshots? If yes, the VSC has its own mechanism for snapshot retention and will also take care of rolling snapshots by itself.

If those snapshots are created by a (homegrown) script, than of course the script needs to care of rolling the snapshots. For that you already got good answers in this thread.

But I think the answers provided to you here solve a problem that you should not have in the first place.

Regards, Niels

niels · ‎2016-01-20

I just read your article again and you indeed use the VSC.

In that case I would double check your VSC backup jobs at it looks they are poorly configured.

The VSC has it's own notion of snapshot retention and should roll those on its own.

I don't know which VSC version you run, but it should work similar to mine (VSC6.1). See screenshot from a backup job:

regards, Niels

IGORSTOJNOV · ‎2016-01-20

Hello Niels,

Thanks for chiming in. Yes, I'm using VSC and I have set the retention (in days) with accordance to our RPO. I could reduce this but as I wrote before, most of the time, when VSC creates a snapshot for my (OS) datastores it triggers a snapshot for APP01 datastore also. Therefore I would have to reduce OS datastores' retention settings as well. Which would in turn affect their RPOs. Naturally, I'd rather avoid that. Having the snapshot count monitored and kept just below the limit seems the least costly way to go...

Regards,

Igor