Solved: EMERGENCY: Aggregate filling up when volumes are not full

JBERG_DIGITALREALTY · ‎2014-08-26

My issue is almost identical to this one, which went unresolved: http://www.gossamer-threads.com/lists/netapp/toasters/11645

I have an 8.65TB 64-bit dual-parity aggregate that contains three NFS volumes, which total in size to 7TB. This leaves 1.65TB of free space available to the aggregate. The volumes are thick provisioned, so they have 100% space guarantee. Because of this all 7TB of space has been committed, and each volume has on average about 500GB of free space. Fractional reserve is set to 100% (which was the default setting for the volumes) but doesn't apply here because we are not using LUNS.

My aggregate is losing space by the minute. It is currently at 97% utilization with only 289GB of free space. It loses roughly 1GB per 5 minutes. There is nothing else in this aggregate at all. We do take snapshots but that doesn't matter because the volumes are not full and they are not set to auto-grow. However, we decided to delete a bunch of snaps and we got some space back, only to watch it slip away over a short time.

I am blown away and have no idea what the hell is going on here. I opened a ticket with NetApp and they aren't sure what is going on either. Nothing is shown in the message or system logs regarding any operations that could be consuming space. Has anyone seen this before? I can only shrink the volumes a few more times and then I will be out of options and will have to sit here and allow this aggregate to fill up. I cannot understand how the aggregate can be filling up when the total volume size is 1.65TB smaller than the aggregate, and the volumes are not full nor are they growing.

Any help here would be most appreciated.

JBERG_DIGITALREALTY · ‎2014-08-26

NetApp support was able to figure out what the problem was. There is a bug that affects multiple versions of DOT that causes the de-duplication fingerprint file to become stagnant (BURT 657692). They had me stop the currently running SIS processes and change my storage efficiency schedule to on-demand, which freed up all the 'missing' space and prevents SIS from running again except when invoked manually. The short term fix is to run SIS manually and monitor the jobs and aggregate space. The long term fix is to upgrade DOT to 8.1.4P4, which contains the fix for this bug.

Here are some links on the bug:

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

https://kb.netapp.com/support/index?page=content&id=7010056

View solution in original post

kryan · ‎2014-08-26

Please post your support case # so that it can be followed up on.

Thanks,

Kevin

JBERG_DIGITALREALTY · ‎2014-08-26

Hey kryan, my case # is 2005208405.

JBERG_DIGITALREALTY · ‎2014-08-26

NetApp support was able to figure out what the problem was. There is a bug that affects multiple versions of DOT that causes the de-duplication fingerprint file to become stagnant (BURT 657692). They had me stop the currently running SIS processes and change my storage efficiency schedule to on-demand, which freed up all the 'missing' space and prevents SIS from running again except when invoked manually. The short term fix is to run SIS manually and monitor the jobs and aggregate space. The long term fix is to upgrade DOT to 8.1.4P4, which contains the fix for this bug.

Here are some links on the bug:

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

https://kb.netapp.com/support/index?page=content&id=7010056