Efficiency Errors on DR Volumes

TMADOCTHOMAS · ‎2021-03-05

Hello,

I have been noticing over the past six + months or so an ever increasing number of volumes on our DR cluster generating volume efficiency errors. I don't understand this as I don't run efficiency jobs in DR. The destination volumes of course include whatever savings were obtained on the source, and I do have compaction enabled on the destination, but there are no scheduled efficiency jobs running.

I am looking at an example now. It says an efficiency job started last night at 10:02:03 and ended at 10:02:21 with a Failure because "Operation was stopped". Changelog usage is 0%. Stale Fingerprint Percent is 1.

Anyone else encounter this before and/or have any ideas? It doesn't appear to be a significant problem, but the main thing I want to reduce is all the alert noise, so I can focus on legitimate issues to resolve.

jcolonfzenpr · ‎2021-03-05

Any errors on the destination sis or snapmirror_[audit,error] logs?

https://<cluster-mgmt-LIF>/spi/

https://docs.netapp.com/ontap-9/topic/com.netapp.doc.dot-cm-sag/GUID-E593FD00-D062-4649-853A-4409E282FA12.html

Hope this help.

Jonathan Colón | Blog | Linkedin

TMADOCTHOMAS · ‎2021-03-05

Thanks @jcolonfzenpr , I hadn't thought of that!

Here is a clean log for one volume's sis job, no errors:

[sid: 0] Info (sis start vault)
[sid: 1614593546] Begin (sis start)
[sid: 1614593546] Processing transfer data logs (94314583 log entries)
[sid: 1614593546] Generating transfer change logs (82548031 log entries)

[sid: 1614593546] Sort (82548031 fp entries)

[sid: 1614593546] Dedup Pass1 (69042 dup entries)
[sid: 1614593546] Dedup Pass2 (1774 dup entries)
[sid: 1614593546] Sharing (0 return status)
[sid: 1614593546] Stats (blks gathered 0,finger prints sorted 1671058788,dups found 69042,new dups found 1774,blks deduped 0,finger prints checked 0,finger prints deleted 0)
[sid: 1614593546] End (330192124 KB)

Then, here's another from a different time with the error:

[sid: 0] Info (sis start vault)
[sid: 1614720003] Begin (sis start)
[sid: 1614720003] Processing transfer data logs (86548136 log entries)
[sid: 0] Info (sis stop vault)
[sid: 0] Info (Dedupe operation is pausing)
[sid: 1614720003] Stats (blks gathered 0,finger prints sorted 0,dups found 0,new dups found 0,blks deduped 0,finger prints checked 0,finger prints deleted 0)
[sid: 0] Error (Operation was stopped )

As you can see, no real explanation - it just pauses and then stops. The email alert gets sent when it sees "Operation was stopped". Any ideas?

jcolonfzenpr · ‎2021-03-06

If you have many volume with the same behavior I think its better to open a case.

Can you provide me this info:

volume efficiency show -volume <volume with error> -vserver <vserver> -instance

Jonathan Colón | Blog | Linkedin

TMADOCTHOMAS · ‎2021-03-08

Here's an example of one that alerted this weekend. Unfortunately I can't open a case as this is our DR system that we only have under third party support.

Vserver Name: <vserver>
Volume Name: <volume>
Volume Path: <path>
State: Enabled
Status: Idle
Progress: Idle for 02:01:50
Type: Snapvault
Schedule: -
Efficiency Policy Name: -
Blocks Skipped Sharing: 0
Last Operation State: Success
Last Success Operation Begin: Mon Mar 08 05:47:13 2021
Last Success Operation End: Mon Mar 08 05:49:59 2021
Last Operation Begin: Mon Mar 08 05:47:13 2021
Last Operation End: Mon Mar 08 05:49:59 2021
Last Operation Size: 751.4MB
Last Operation Error: -
Changelog Usage: 0%
Logical Data Size: 5.24TB
Logical Data Limit: 640TB
Logical Data Percent: 1%
Queued Job: -
Stale Fingerprint Percentage: 1
Compression: false
Inline Compression: false
Constituent Volume: false
Inline Dedupe: false
Data Compaction: true
Cross Volume Inline Deduplication: false
Cross Volume Background Deduplication: false
Extended Compressed Data: true

jcolonfzenpr · ‎2021-03-08

Sorry for asking too many question but:

Can you validate this?

The destination volumes of course include whatever savings were obtained on the source
- Can you validate this using the following KB:
  - How to tell if a SnapMirror transfer is maintaining storage efficiency savings

Because i saw from your last comment this:

Extended Compressed Data: true
- And this KB SnapMirror storage efficiency configurations and behavior says
  - Since extended compressed data includes adaptive compression, enabling on a SnapMirror destination volume will result in an LRE transfer.
  - Logical Replication (LRE) - Source-side storage efficiency savings are not maintained by SnapMirror but can be re-gained at the destination

I have doubts about this and the reported sis errors, maybe a NetApp folks can take a look and contribute!

Jonathan Colón | Blog | Linkedin

TMADOCTHOMAS · ‎2021-03-08

No apologies needed @jcolonfzenpr , this is very helpful! I searched through the log but didn't find a case where LRE was in use, including specifically with several of the volumes that have been alerting. Having said that, I think you may be on to something. We only just recently upgraded to OnTAP 9.5 on both source and destination, and that may have had some kind of impact regarding the new "extended compression" setting.

Do you know how to determine if a volume is configured to set "extended compression"? I know the one I showed you revealed there is "extended compressed" data on the target volume, but does that mean it is configured to "run" on that volume or just that the type of data is there (i.e. as transferred from the source)?

I don't know if I mentioned that our source is an AFF system which may help make sense of this.

The second KB says "If a source volume uses Extended compressed data, the destination must be running ONTAP 9.5 or later for SnapMirror to maintain storage efficiency savings (LRSE), regardless of the destination's storage efficiency settings." We upgraded both systems to 9.5 on the same day so LRSE should be in use, and I show that it is.

It also says, "Since extended compressed data includes adaptive compression, enabling on a SnapMirror destination volume will result in an LRE transfer." (emphasis mine)

The key question here is: what does it mean by "enabling"? I couldn't find a setting that deliberately enables or disables this feature. Any additional help will be greatly appreciated!

jcolonfzenpr · ‎2021-03-09

I think Extended Compressed Data is related to adaptive compression.

Can you provide this info from both source and destination?

volume efficiency show <volume with problem> -fields compression,state,compression-type,policy

also make sure you are on a the latest patch release of the version your are running.

Jonathan Colón | Blog | Linkedin

TMADOCTHOMAS · ‎2021-03-09

Thanks @jcolonfzenpr . Results:

On the production system:

vserver volume state policy compression-type compression
--------------- --------------- ------- ------------- ---------------- -----------
<vserver> <volume> Enabled eff_1000pm_01 adaptive true

On the DR system:

vserver volume state policy compression-type compression
--------------- --------------- ------- ------------- ---------------- -----------
<vserver> <volume> Enabled - adaptive false

jcolonfzenpr · ‎2021-03-09

I think your configuration is ok.

what ontap 9.5 version are you running (patch releases)?

Jonathan Colón | Blog | Linkedin

TMADOCTHOMAS · ‎2021-03-10

Thanks @jcolonfzenpr . As of a couple weeks ago we are on OnTAP 9.5P16.

TMADOCTHOMAS · ‎2021-04-02

Hi @jcolonfzenpr , was curious if you had any other suggestions. Would love to hear anyone else's comments as well! I'm at a loss as to why this keeps happening.

jcolonfzenpr · ‎2021-04-02

i send a pm to you.

Hope this helps

Jonathan Colón | Blog | Linkedin

TMADOCTHOMAS · ‎2021-04-02

Thanks, I saw that! I don't have bandwidth at the moment to work through joining the tool you mentioned but will see if I can do that sometime soon.