Solved: Very Large snapshots after upgrade from 7-mode to Cluster Mode (C-Dot)

markey164 · ‎2016-12-22

Hi All,

We have just completed a "Copy Free" transition from 7-mode to C-dot onTap 9.0 using the 7 mode Transition Tool (7MTT) . We did the Copy Free transition by moving the shelves from a FAS3220 to a FAS8040.

The upgrade went well, but on completion, our home drives volume (18TB/50million files), is seeing huge snapshot deltas up to 10 times the size they were before. As a result they are rapidly filling up the volume. The Filer generates these changes at certain points during the day. We can see the snapshot suddenly surge in size from a few hundred MB to over ~200GB in just an hour. Its definitely a Filer process, as it is happening once or twice every day, and no users have quotas of more than 10GB. In fact most users aren't here due to the Xmas holidays.

There is no storage efficiency enabled on this volume, so we know this isn't the cause, and the volume belongs to an SVM, in case that is relevant.

We have a case open with support, via our support partner, but we've not had any response yet, and given this is our primary Home Drives volume, and these huge snapshots are happening every day, we are very quickly going to be in a sticky situation.

I'm just wondering if anyone has seen anything like this, or can give us some clues on where to look.

I ran a wafl scan status whilst the problem was happening, but there was nothing showing there.

The only other thing that was identified from our Autosupports as a possible cause was something called an L2I scan, but neither we, not our support partner know what this is.

Many TIA

Mark

markey164 · ‎2018-07-04

Further update:

After an extended engagement with engineering, it was eventually confirmed that what we were seeing, was essentially a result of "delayed deletes" being captured in snapshots. Blocks that relate to files that have been deleted, or created and deleted "between" snapshots, are not freed immediately, but are instead recorded in what Engineering refer to as the "B log" The B log is drained regularly via certain triggers, hence releasing those blocks and recovering the space as expected. However, if the snapshot runs before the B log has been purged ("drained" in Netapp speak), those blocks will become trapped in the snapshot, and will remain trapped, until that snapshot ages off.

In our case, we had lots of writes/overwrites on a 10TB volume, and ran snapshots twice a day on it, and after on upgrading to Cluster Mode, immediately saw our snapshots grow from ~40GB on 7 mode to 300-400GB on Cluster Mode. This behaviour was apparently introduced in cluster mode which is why we noticed it when moving from 7 mode.

This does not appear to be documented and as such took some time for us to get to the root cause.

Engineering have commands that can help identify this issue if you believe you have a similiar problem, and further commands are available that can be scripted to alleviate the issue. They hope to remedy or improve the behaviour of this behaviour in a future version of Ontap.

Hopefully this helps anyone else having the same problem.

View solution in original post

scottgelb · ‎2016-12-22

Did the workflow delete the aggregate snapshot created by CFT?

markey164 · ‎2016-12-22

@scottgelb wrote:

Did the workflow delete the aggregate snapshot created by CFT?

Hi Scott,

You mean the 7MTT workflow? The work was carried out by our engineer from our support partner on Saturday, so I'm not sure, how can i tell?

I'm also totally new to cdot which doesn't helping, so just trying to get my head around the commands to list the aggr snapshots, this can only be done by the CLI, is that right? If so, this looks to be the command?

node run -node filername -command snap list -A

which returns the following for the aggr in question, Is this what you were looking to check?

Aggregate aggr1
working...

%/used %/total date name
---------- ---------- ------------ --------
1% ( 1%) 1% ( 1%) Dec 22 19:00 hourly.0
2% ( 1%) 1% ( 1%) Dec 22 14:00 hourly.1
2% ( 0%) 1% ( 0%) Dec 22 09:00 hourly.2
2% ( 0%) 1% ( 0%) Dec 22 00:00 nightly.0
3% ( 1%) 2% ( 0%) Dec 21 19:00 hourly.3

scottgelb · ‎2016-12-22

I don't see the CFT snapshot so the workflow was completed. I would turn off snapshots "node run nodename aggr options aggrname nosnap on" then delete the aggr snaps "node run nodename snap delete aggrname -A -a -f"

markey164 · ‎2016-12-23

Hi Scott,

I can do that, although i'd like to understand your thoughts as to what might be happening to suggest this?

What would cause aggr snapshots to affect volume snapshots in this manner?

TIA

markey164 · ‎2017-05-17

Update:

In case anyone else comes across this thread, after a fair amount of back and forth, Netapp have acknowledged this looks like a bug in OnTAP.

Our testing demonstrated that files created and then deleted "in between" snapshots, are not cleaned up properly and associated block changes still go into snapshots, when ordinarily they shouldn't do. They don't go into snapshots immediately either, they only appear in the snapshot some time later when the filer performs housekeeping routines, hence why this was initially so difficult to track what was going on. In the end we resorted to an empty volume and a couple of large files to test with.

In our case we originally spotted this issue on our home drives volume immediately after upgrading to Cluster mode. There is a lot of activity on this volume and applications such as office are creating and deleting temporary files continuously, hence why our daily snapshots increased in size from 40-50GB to 300-400GB, now we know why.

We don't have much detail yet, but it look likely this would affect anyone running cluster mode, in that your snapshot usage might be much larger than it should be.

The Bug ID is here if you want to watch it

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1067749

Cheers

Mark

markey164 · ‎2018-07-04

Further update:

After an extended engagement with engineering, it was eventually confirmed that what we were seeing, was essentially a result of "delayed deletes" being captured in snapshots. Blocks that relate to files that have been deleted, or created and deleted "between" snapshots, are not freed immediately, but are instead recorded in what Engineering refer to as the "B log" The B log is drained regularly via certain triggers, hence releasing those blocks and recovering the space as expected. However, if the snapshot runs before the B log has been purged ("drained" in Netapp speak), those blocks will become trapped in the snapshot, and will remain trapped, until that snapshot ages off.

In our case, we had lots of writes/overwrites on a 10TB volume, and ran snapshots twice a day on it, and after on upgrading to Cluster Mode, immediately saw our snapshots grow from ~40GB on 7 mode to 300-400GB on Cluster Mode. This behaviour was apparently introduced in cluster mode which is why we noticed it when moving from 7 mode.

This does not appear to be documented and as such took some time for us to get to the root cause.

Engineering have commands that can help identify this issue if you believe you have a similiar problem, and further commands are available that can be scripted to alleviate the issue. They hope to remedy or improve the behaviour of this behaviour in a future version of Ontap.

Hopefully this helps anyone else having the same problem.

AlexDawson · ‎2018-07-04

You're the MVP @markey164 - my kudos to you in coming back to updating this thread as you deal with this abnormal functionality.

Reading through the bug notes, I see that this can often be a case of the snapshot showing a much larger size than is actually being consumed, which is good, but at the same time, still not helpful.

markey164 · ‎2018-07-05

Hi @AlexDawson,

Those bug notes are out of date. They were from our first engagement with Netapp, when the root cause was not correctly identified. We had to be fairly persistent in convincing Netapp that our rate of change had not suddenly increased 8 fold on the same day we carried out the upgrade, which is what we were intially informed must be the case, as there was no other possible explanation.

On our second engagement, we were escalated to engineering who eventually figured out what was going on. The snapshot sizes were being reported correctly, the discrepancy between what *should* be in a snapshot, and what was *actually* being captured in the snapshot differed due to this (delayed deletes) behaviour of Ontap Cluster Mode.

I looked and there is no publicly available documentation on this "B log" feature. Neither our support partner, nor first line support were aware of this, I guess because it is baked into Ontap at a very low level, and is not transparent to the end user via either the GUI or documented commands.

Our interim fix is to manually purge this "B log" (with commands supplied by engineering) at regular intervals and *before* the snapshot runs, which has helped recover several Terabytes of space.

If you need to know more, feel free to DM me 🙂

AlexDawson · ‎2018-07-22

Thanks for the clarification!