Does anyone have an idea, how I can troubleshoot deduplication?

WASZKIEWICZ · ‎2011-09-21

Hello!

I'm having a hudge deduplication issue on a Volume.

Let me explain. On a FAS6080c, runing ONTAP 7.3.2P6, so with with a dedupe size limit from 16TB, I am runing one unique Deduplication procedure.

This is on an NFS Datastore with following capacity:

Filesystem total used avail capacity Mounted on

/vol/ESX_FS/ 2048GB 1101GB 946GB 54% /vol/ESX_FS/

The whole system is really slow due to the deduplication process started two days ago:

Path: /vol/ESX_FS

State: Enabled

Status: Active

Progress: 68295784 KB (1%) Done

Type: Regular

Schedule: sun-sat@20

Minimum Blocks Shared: 1

Blocks Skipped Sharing: 0

Last Operation Begin: Sat Sep 17 20:00:00 CEST 2011

Last Operation End: Mon Sep 19 10:37:48 CEST 2011

Last Operation Size: 0 KB

Last Operation Error: -

Changelog Usage: 99%

Checkpoint Time: Sat Sep 17 22:41:59 CEST 2011

Checkpoint Op Type: Start

Checkpoint stage: Saving_pass2

Checkpoint Sub-stage: sort_pass2

Checkpoint Progress: -

I'm reaching all the time 98-99% of CPU Utilization.

Moreover, this Volume is already deduplicated:

Filesystem used saved %saved

/vol/ESX_FS/ 1154671504 285134996 20%

Anyone familiar with that "BUG" or knows how to solve it, I would be really really thankful!

Best Regards,

Ronald Waszkiewicz.

ERIC_TSYS · ‎2011-09-21

Hi Ronald,

A CPU running that high whilst doing deduplication is not a problem per se. The thing is that if a controller is not

busy when dedupe starts dedupe will "hog" the CPU all that it can to finish as fast as possible. If IOPS on the controller

increased dedupe will back off. So dedupe process will dynamically take CPU or back off if needed, its clever.

Are your clients seeing any issues? have you received complaints about slowness? also, what does your sysstat output give?

Eric

WASZKIEWICZ · ‎2011-09-21

Hi Eric,

Thanks for your answer.

The sysstat output:

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache

in out read write read write age

99% 5017 0 0 15644 61355 23540 16 0 0 3

99% 4381 0 0 15667 20558 23158 12 0 0 3

95% 5327 0 0 15113 26484 44091 36358 0 0 3

99% 4890 0 0 17757 30911 18556 43157 0 0 3

99% 5412 0 0 25259 37043 26346 45670 0 0 4

99% 5929 0 0 31861 44366 30113 82632 0 0 4

99% 4723 0 0 27365 37546 16155 72 0 0 4

99% 5140 0 0 41493 36226 17743 16 0 0 4

99% 7025 0 0 75088 24130 17609 0 0 0 4

99% 6707 0 0 52590 32427 12435 12 0 0 3

The problem isn't that much client related right now, but the fact that a simple de-dupe with 800GB of data is taking so long on such a huge machine! When I stop dedupe, the CPU usage goes back to 1-15%...

The deduplication speed is approximately 1GB/2 hours!! That's not really normal. We have more than a hundred NetApp, and it's the only one with this symptoms... Maybe it's not dedupe related at all, but something that's going on with the Volume?

Thanks for your help.

Cheers,

Ronald.

columbus_admin · ‎2011-09-21

Make sure you check "sysstat -M" rather than "sysstat -x" to get true CPU utilization. The "-x" sometimes gives you the highest amount of work across all CPUs(in a multi CPU system), and is frequently incorrect.

How much space is free in the aggregate? WAFL uses aggr free space for a lot of the swapping needed for various operations and a full aggr is going to be a problem point. Depending on size, other dedupe levels, 97 to 98% utilized in your aggr would be a problem.

Assuming from the volume name that you are using this for VMware VMDKs, you may have a lot of white space and the filer is constantly having to ignore space that is not valid.

Are you using LUNs or NFS to present to your ESX? LUNs will give you the white space issue, NFS will not. LUNs are simply large files and because ESX hosts don't overwrite data when it is deleted you end up with garbage, it throws off backups and volume space reports too.

- Scott

WASZKIEWICZ · ‎2011-09-21

Hi Scott,

You are right on everything...

Aggr ist 86% filled, and the only CPU working is the CPU7 (99%). The others are normal. That's why nobody is complaining...

In this case it's NFS. Would it explain the slow process?

Best Regards and thanks a lot for your precious answer!

Ronald.

columbus_admin · ‎2011-09-21

hmmm, the only time I have seen NFS impact filer performance was(at a previous employer), we had a group that had mounted the e0m interface. There were six or seven hosts connected via GbE to the 10/100 interface. That pushed CPU resources up, but overall the impact on the filer was limited to network connectivity, so high latency NFS responses.

- Scott

thomas_glodde · ‎2011-09-21

You are running on a defined schedule as far as i can see from your output. Have you manualy ran a "sis start -s /vol/ESX_FS" before enabling the schedule?!

WASZKIEWICZ · ‎2011-09-21

Hi Thomas. No I stopped it to see if it was what influenced CPU Performance in the first hand.

Then I restarted it.

Regards,

Ronald.