Any one else hit with this deduplication bug?

cgeck0000 · ‎2014-07-16

We're dealing with this bug on our FAS3240 systems. Wanted to see who else has this issue and if you got it resolved. What did you do the fix it? How long did it take you?

Stale metadata not automatically removed as part of the 'sis start' operation on the volume when running Data ONTAP® 8.1x

KB ID: 7010056 Version: 13.0 Published date: 06/23/2014 Views: 8775

https://kb.netapp.com/support/index?page=content&id=7010056&actp=search&viewlocale=en_US&searchid=1405437488407

Darkstar · ‎2014-07-25

We have a few customers who were hit by that problem. You can see it if your "sis status -l" prints out incredibly huge numbers for "stale metadata" (in one case we had like 3500% stale metadata, but everything over 30% or so might indicate a problem)

If you're hit by that bug, I found that the only solution to 100% fix it is the following:

disable SIS on the volume(s) in question:
- sis off /vol/volumename
upgrade OnTap to a newer version (8.1.2P4 is recommended as per the KB article, but newer is better, I'd suggest going directly to 8.1.4Px)
delete the SIS database
- priv set diag; sis reset /vol/volumename
re-start sis
- sis on /vol/volumename; sis start -s /vol/volumename

This has always fixed it for us. Note that we had a few cases with 8.1.2P4 where a simple "sis start -s" after the upgrade did not help; we had to do a "sis reset"

-Michael

cgeck0000 · ‎2014-07-25

Michael, thank you for the reply.

I believe we have it resolved now.

I upgraded our nodes to 8.2.1P1 early Wednesday morning and let dedup run on its schedule twice on the volumes afterwards (Wed. night and Thursday night) as stated in the KB.

After running the math again based on NetApp's formula we saw the stale fingerprints reduce. Some volumes were in the 500 - 900% range and are now down to 8 - 20% range.

VIRTUALLYMIKEB · ‎2014-07-28

Yes, indeed. A managed services client had a LUN go offline because of this. Looking at the used storage, I couldn't figure out where it all was. One support case later and I find out this is the problem. The volume had about 700GB of stale metadata which probably took 8 hours or so to reduce. We found the bug after we upgraded to 8.2.1 from a buggy 8.1.2 release. It just so happened the volume filled up the morning after the upgrade. Awesome timing.

-----------------------------------------

Please consider marking this answer "correct" or "helpful" if you found it useful

Mike Brown

VMware, Cisco Data Center, and NetApp dude

Consulting Engineer

michael.b.brown3@gmail.com

Twitter: @VIRTUALLYMIKEB

Blog: http://VirtuallyMikeBrown.com

LinkedIn: http://LinkedIn.com/in/michaelbbrown

cgeck0000 · ‎2014-07-29

To add another tidbit to this.

We originally engaged NetApp performance, because when we tried to commit VMware snapshots with memory snapshot it would completely paralyze our controllers. Now that we have upgraded and ran dedup twice to remove the stale fingerprints our issue is not more. We're able to commit the snapshots now and the controllers keep rolling.

Not sure if there was another bug or the stale fingerprints. Either way, not fun.

mancusomjm · ‎2015-06-25

Hi can someone email me a copy of this article to mancusomjm@gmail.com I cant see it

VIRTUALGEEK2 · ‎2015-08-09

We too have this problem at my work and the bug is effecting all versions of ONTAP

http://mysupport.netapp.com/NOW/cgi-bin/bugrellist?bugno=931439

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365503(v=vs.85).aspx

No ETA from NetApp as to when we can expect the fix

aborzenkov · ‎2015-08-09

How exactly re parse points are related to de duplication? Big s non-public, so it is not clear what you mean.

VIRTUALGEEK2 · ‎2015-08-09

Just repeating what NetApp support advised:

"

I have found our answer in bug 931439. It is new and very much still under investigation but Ill outline whats going on. When you do a copy in this manor Windows tries to create a SIS link instead of rehydrating the file immediately.

We can see that this is the case from the trace in packet 1842.

.... .... .... .... .... ..1. .... .... = Sparse: A SPARSE file

.... .... .... .... .... .1.. .... .... = Reparse Point: Has an associated REPARSE POINT

Here is more on reparse points.

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365503(v=vs.85).aspx

"

I questioned this with the respective person and was advised to follow the bug and read the article.

I too questioned that but was told to read the above link.