Volume with no activity filling up

DAIRYMILK · ‎2012-02-27

Hi there,

I have a strange issue...

We have three volumes, each one containg a FC LUN which is the SAN boot drive for an ESXi 4.1 host. So, 3 ESXI hosts, 3 LUNs, 3 volumes. Each volume created at the same time as far as I know and a vol status -v shows the same settings for all three.

One of the volumes is constantly filling up and since the only thing in the volume is the LUN with ESXi on it, I thought it was probably to do that, maybe excessive logging or something. After a bit of troubleshooting which showed no particular issues, we switched off the ESXi host.

The volume is still filing up, however. Every time a snapshot is taken, the volume pretty much fills up but the snapshots are tiny. If I delete the snapshot, the space is reclaimed.

The other two volumes do not exhibit this behaviour but regardless of that, the volume is filling up even with the ESXi host switched off, i.e. no activity on the LUN at all.

Can anybody suggest a place to start looking for the cause of this?

vol settings...

filer> vol status volume -v
         Volume State           Status            Options
volume online          raid_dp, flex     nosnap=off, nosnapdir=off,
                                                  minra=off, no_atime_update=off,
                                                  nvfail=off,
                                                  ignore_inconsistent=off,
                                                  snapmirrored=off,
                                                  create_ucode=on,
                                                  convert_ucode=off,
                                                  maxdirsize=73400,
                                                  schedsnapname=ordinal,
                                                  fs_size_fixed=off,
                                                  compression=off,
                                                  guarantee=volume, svo_enable=off,
                                                  svo_checksum=off,
                                                  svo_allow_rman=off,
                                                  svo_reject_errors=off,
                                                  no_i2p=off,
                                                  fractional_reserve=100,
                                                  extent=off,
                                                  try_first=volume_grow,
                                                  read_realloc=off,
                                                  snapshot_clone_dependency=off,
                                                  nbu_archival_snap=off
                         Volume UUID: 9506e00c-20dd-11e1-84f0-00a09816e2b8
                Containing aggregate: 'aggr_sas0'

                Plex /aggr_stor02_sas0/plex0: online, normal, active
                    RAID group /aggr_sas0/plex0/rg0: normal
                    RAID group /aggr_sas0/plex0/rg1: normal
                    RAID group /aggr_sas0/plex0/rg2: normal

        Snapshot autodelete settings for volume:
                                        state=off
                                        commitment=try
                                        trigger=volume
                                        target_free_space=20%
                                        delete_order=oldest_first
                                        defer_delete=user_created
                                        prefix=(not specified)
                                        destroy_list=none
        Volume autosize settings:
                                state=off

Many thanks!

bertaut · ‎2012-02-27

Peter,

When provisioning LUNs, the volume hosting the LUN should have 2 * LUN size + Snapshot_Space. By default the fractional reserve is 100% on volumes with a "volume" guarantee; number that could be adjusted if needed. What's the snap reserve on the volume holding the LUN? What's the output of the df -g cmd on that volume? What's the LUN size? What's the lun stat output on that volume after running lun stats -z?

You might want to verify that there are no CIFS, NFS shares on the volume hosting the LUN and that no other device outside the ESXi host is writing to your LUN.

Regards,

DAIRYMILK · ‎2012-02-27

Hi Bertraut, thanks for your reply.

Some more info:

Volume is 12GB

LUN is 8GB

Fractional reserve is 100% but we do not expect much change in the LUN as it's an install of ESXi only.

Snap reserve is 0%

There are no CIFS shares or NFS exports. No other devices are mapped to the LUN so nothing else can be writing to it, only the igroup with that server in it.

df -g with some snapshots existing:

filer> df -g v_boot

Filesystem total used avail capacity Mounted on

/vol/v_boot/ 12GB 11GB 0GB 97% /vol/v_boot/

/vol/v_boot/.snapshot 0GB 0GB 0GB ---% /vol/v_boot/.snapshot

List of (very small) snapshots:

filer> snap list v_boot
Volume v_boot
working...

%/used       %/total date          name
---------- ---------- ------------ --------
0% ( 0%)    0% ( 0%) Feb 27 16:00 hourly.0
0% ( 0%)    0% ( 0%) Feb 27 15:00 hourly.1
0% ( 0%)    0% ( 0%) Feb 27 14:00 hourly.2
0% ( 0%)    0% ( 0%) Feb 27 13:00 hourly.3

df -g with no snapshots:

filer> df -g v_boot

Filesystem total used avail capacity Mounted on

/vol/v_boot/ 12GB 8GB 3GB 67% /vol/v_boot

/vol/v_boot/.snapshot 0GB 0GB 0GB ---% /vol/v_boot/.snapshot

All this is with the following LUN stats:

filer> lun stats /vol/v_boot/q_boot/l_boot

/vol/v_boot_/q_boot/l_boot (0 hours, 18 minutes, 39 seconds)

Read (kbytes) Write (kbytes) Read Ops Write Ops

0 0 0 0

If I then manually create a snapshot, it goes straight back up to 97% again.

filer> df -g v_boot

Filesystem total used avail capacity Mounted on

/vol/v_boot/ 12GB 11GB 0GB 97% /vol/v_boot/

/vol/v_boot/.snapshot 0GB 0GB 0GB ---% /vol/v_boot/.snapshot

All still with no LUN activity.

I guess this is something to do with fractional reserve since the volume is not big enough to support the LUN with 100% fractional reserve. However, if the data is not changing, shouldn't it be OK?

DAIRYMILK · ‎2012-02-27

I have some more info - apparently there used to be a lot more data in that LUN which has since been removed. It was ISO files stored in the local ESXi datatstore. So, I guess that at some point the space reserved by fractional reserve would have been higher. Is there some sort of "tidemark" set when using fractional reserve? So that even if the data in the LUN is reduced, the snapshots still reserve the same amount of space?

aborzenkov · ‎2012-02-27

a) Fractional_reserve plays role exactly when you create snapshot – not “at some point in the past”. So every time you create snapshot it reserves exactly the amount of space that LUN currently occupies on NetApp.

b) When host deletes data in filesystem, space is not freed from the NetApp point of view. So if it was huge once, it remains huge now.

Unfortunately, the only way to reduce space consumption is to create another LUN and copy data over and destroy existing one. When doing this new LUN could temporary be set to no space reservation to reduce consumption.

bertaut · ‎2012-02-27

Thanks for providing those extracts Peter, your issue has to do with the fact that there is not enough space in the volume to accomodate your LUN, the fractional reserve (when used) & the snapshots. I've witnessed instances where the system didn't report the actual snapshot sizes in Filerview and while running the snap delta cmd - with your storage at 97%, the LUN in the snapshot pulled storage from the fractional reserve to allow writing over part of the LUN that was already written. Increase the size of your volume to have enough storage for 2 * LUN size + snapshot_storage and that will take care of your current issue. If you run a df -r on that volume, you'll clearly see how much of the reserve was used.

Regards,

DAIRYMILK · ‎2012-02-27

Thanks for both of your answers, I understand what is going on now. I think the key is that the space in the LUN will not be returned to the volume since the filer knows nothing about the file system on top.

I'd like to give you both points if possible, how can I do that? Do I use the 'helpful answer' option for both?

Many thanks