Subscribe

What caused this snapshot growth?

I narrowly averted an outage on one of our most important servers today - a Windows Oracle server. It has several iSCSI LUNs, each in their own volume. DFM started alerting >90% utilisation on the volume that contains the C:\ drive LUN  just after midday. The snapshots were huge. I deleted all the snapshots and gave the volume another 50GB to make sure it didn't stop. Now utilisation is down to ~25%.

But now I have to figure out what caused this.

A regular snapshot and snapmirror update happens at midday. And then the volume utilisation explodes. Here's the graph from DFM : http://imgur.com/2zA1VdP

The volume utilisation at about 2PM:

# df -g |grep por01_vol0

/vol/uraoly_por01_vol0/       60GB       55GB        4GB      93%  /vol/uraoly_por01_vol0/

/vol/uraoly_por01_vol0/.snapshot        0GB       19GB        0GB     ---%  /vol/uraoly_por01_vol0/.snapshot


And the snapshots:

# snap list uraoly_por01_vol0

%/used       %/total  date          name

----------  ----------  ------------  --------

  0% ( 0%)    0% ( 0%)  Aug 16 14:37  uraoly-nas03(0118074784)_uraoly_por01_vol0.99312 (snapmirror)

34% (34%)   31% (31%)  Aug 16 12:00  hourly.0

34% ( 0%)   31% ( 0%)  Aug 16 08:00  hourly.1

34% ( 1%)   31% ( 0%)  Aug 16 00:00  nightly.0

34% ( 0%)   32% ( 0%)  Aug 15 20:00  hourly.2

35% ( 1%)   32% ( 0%)  Aug 15 16:00  hourly.3

35% ( 1%)   32% ( 0%)  Aug 15 12:00  hourly.4

35% ( 0%)   32% ( 0%)  Aug 15 08:00  hourly.5

35% ( 1%)   33% ( 0%)  Aug 15 00:00  nightly.1


Then after I deleted all the snapshots:

# df -g uraoly_por01_vol0

Filesystem               total       used      avail capacity  Mounted on

/vol/uraoly_por01_vol0/       60GB       36GB       23GB      60%  /vol/uraoly_por01_vol0/

/vol/uraoly_por01_vol0/.snapshot        0GB        0GB        0GB     ---%  /vol/uraoly_por01_vol0/.snapshot


So, I am trying to work out WHEN whatever activity caused this actually happened. The snapmirror happened at midday, then the volume utilisation exploded.

Am I right in thinking that whatever caused this snapshot growth happened AFTER the midday snapshot, rather than before midday?

There's nothing unusual in the event logs, so I'm going to have to ask the general population  if anyone was doing any changes on that server at the time, and want to make sure I have my timeframes right.

Thanks,

Dean

Re: What caused this snapshot growth?

Snapshots "grow" when data contained in these snapshots changes. So after snapshot was taken some host activity resulted in large amount of changed data. You have to check what your hosts do - there is nothing NetApp can do about it.

Re: What caused this snapshot growth?

ask your windows or oracle guys. there is no way to find out what's going on in the LUN from the storage side.

maybe an Oracle admin did a database reorganization or a windows admin thought it's a good idea to do a defrag...

Re: What caused this snapshot growth?

Also you said you are serving luns.. What is your fractional reserve setting?

You should look into volume autogrow so you don't have this issue and can sleep like a baby

Re: What caused this snapshot growth?

Yes, I definitely want to sleep like a baby. I think the fractional reserve is set at the default - 100%??. I can't tell right now (at home). But for what it's worth, this issue seems to have actually been caused by reallocate. We recently upgraded from 7.3 to 8.something, and the first reallocate on this volume seems to have exploded the snapshots. I'm reading up on how reallocate changes between 7.x and 8.x. It's pretty confusing. But thanks for the tips..

Re: What caused this snapshot growth?

by default, reallocate does not move data in the snapshots (because they are basically read-only). so yes, this could be the issue.

for the future you should use reallocate -p, which will also move snapshot data.

Re: What caused this snapshot growth?

I've inherited this system, so I don't know what's been setup in the past. Reallocate seems to be happening on this volume automatically. How do I change that to use the -p flag?

Re: What caused this snapshot growth?

Give us the output from reallocate status

Also, with default fractional reserve of 100, that should be the volume should be 2x the size of the lun, as to protect the lun.  (unless i'm wrong)

Best practice now is fractional reserve 0, and autogrow

Re: What caused this snapshot growth?

How do I change that to use the -p flag?

for scheduled reallocation, you can't.