ONTAP Discussions

Does anyone have an idea, how I can troubleshoot deduplication?

WASZKIEWICZ
4,825 Views

Hello!

I'm having a hudge deduplication issue on a Volume.

Let me explain. On a FAS6080c, runing ONTAP 7.3.2P6, so with with a dedupe size limit from 16TB, I am runing one unique Deduplication procedure.

This is on an NFS Datastore with following capacity:

Filesystem               total                 used           avail      capacity       Mounted on

/vol/ESX_FS/            2048GB     1101GB      946GB      54%            /vol/ESX_FS/

The whole system is really slow due to the deduplication process started two days ago:

Path:                    /vol/ESX_FS

State:                   Enabled

Status:                  Active

Progress:                68295784 KB (1%) Done

Type:                    Regular

Schedule:                sun-sat@20

Minimum Blocks Shared:   1

Blocks Skipped Sharing:  0

Last Operation Begin:    Sat Sep 17 20:00:00 CEST 2011

Last Operation End:      Mon Sep 19 10:37:48 CEST 2011

Last Operation Size:     0 KB

Last Operation Error:    -

Changelog Usage:         99%

Checkpoint Time:         Sat Sep 17 22:41:59 CEST 2011

Checkpoint Op Type:      Start

Checkpoint stage:        Saving_pass2

Checkpoint Sub-stage:    sort_pass2

Checkpoint Progress:     -

I'm reaching all the time 98-99% of CPU Utilization.

Moreover, this Volume is already deduplicated:

Filesystem                used                     saved            %saved

/vol/ESX_FS/        1154671504       285134996          20%

Anyone familiar with that "BUG" or knows how to solve it, I would be really really thankful!

Best Regards,

Ronald Waszkiewicz.

7 REPLIES 7

ERIC_TSYS
4,826 Views

Hi Ronald,

A CPU running that high whilst doing deduplication is not a problem per se. The thing is that if a controller is not

busy when dedupe starts dedupe will "hog" the CPU all that it can to finish as fast as possible. If IOPS on the controller

increased dedupe will back off. So dedupe process will dynamically take CPU or back off if needed, its clever.

Are your clients seeing any issues? have you received complaints about slowness? also, what does your sysstat output give?

Eric

WASZKIEWICZ
4,826 Views

Hi Eric,

Thanks for your answer.

The sysstat output:

CPU    NFS   CIFS   HTTP      Net kB/s     Disk kB/s      Tape kB/s    Cache

                               in   out     read  write    read write     age

99%   5017      0      0   15644 61355    23540     16       0     0       3

99%   4381      0      0   15667 20558    23158     12       0     0       3

95%   5327      0      0   15113 26484    44091  36358       0     0       3

99%   4890      0      0   17757 30911    18556  43157       0     0       3

99%   5412      0      0   25259 37043    26346  45670       0     0       4

99%   5929      0      0   31861 44366    30113  82632       0     0       4

99%   4723      0      0   27365 37546    16155     72       0     0       4

99%   5140      0      0   41493 36226    17743     16       0     0       4

99%   7025      0      0   75088 24130    17609      0       0     0       4

99%   6707      0      0   52590 32427    12435     12       0     0       3

The problem isn't that much client related right now, but the fact that a simple de-dupe with 800GB of data is taking so long on such a huge machine! When I stop dedupe, the CPU usage goes back to 1-15%...

The deduplication speed is approximately 1GB/2 hours!! That's not really normal. We have more than a hundred NetApp, and it's the only one with this symptoms... Maybe it's not dedupe related at all, but something that's going on with the Volume?

Thanks for your help.

Cheers,

Ronald.

columbus_admin
4,826 Views

Make sure you check "sysstat -M" rather than "sysstat -x" to get true CPU utilization.  The "-x" sometimes gives you the highest amount of work across all CPUs(in a multi CPU system), and is frequently incorrect.

How much space is free in the aggregate?  WAFL uses aggr free space for a lot of the swapping needed for various operations and a full aggr is going to be a problem point.  Depending on size, other dedupe levels, 97 to 98% utilized in your aggr would be a problem.

Assuming from the volume name that you are using this for VMware VMDKs, you may have a lot of white space and the filer is constantly having to ignore space that is not valid. 

Are you using LUNs or NFS to present to your ESX?  LUNs will give you the white space issue, NFS will not.  LUNs are simply large files and because ESX hosts don't overwrite data when it is deleted you end up with garbage, it throws off backups and volume space reports too.

- Scott

WASZKIEWICZ
4,826 Views

Hi Scott,

You are right on everything...

Aggr ist 86% filled, and the only CPU working is the CPU7 (99%). The others are normal. That's why nobody is complaining...

In this case it's NFS. Would it explain the slow process?

Best Regards and thanks a lot for your precious answer!

Ronald.

columbus_admin
4,827 Views

hmmm, the only time I have seen NFS impact filer performance was(at a previous employer), we had a group that had mounted the e0m interface.  There were six or seven hosts connected via GbE to the 10/100 interface.  That pushed CPU resources up, but overall the impact on the filer was limited to network connectivity, so high latency NFS responses.

- Scott

thomas_glodde
4,825 Views

You are running on a defined schedule as far as i can see from your output. Have you manualy ran a "sis start -s /vol/ESX_FS" before enabling the schedule?!

WASZKIEWICZ
4,825 Views

Hi Thomas. No I stopped it to see if it was what influenced CPU Performance in the first hand.

Then I restarted it.

Regards,

Ronald.

Public