ONTAP Hardware

Performance degradation after head upgrade

martijnvana
5,558 Views

Hi All,

A customer of mine has got an issue with performance on their snapvault / snapmirror destination filer.

In last august they did a head upgrade of the HA-cluster / snapvault_snapmirror setup.

Old:

FAS3140 (HA-cluster)

FAS3140 (Snapvault)

New:

FAS3250 (HA-cluster)

FAS3250 (Snapvault)

I would expect an performance upgrade with this upgrade, going from 4 core to 8 cores. But unfortunately this is not the case. What I see is an performance degradation. The CPU utilization is almost continuously on 100% (I've attached some graphs).

I see that the domains Kahu and Kahuna taking up 100% of CPU together. The snapvault backups are setup with DFM. I think this is to be expected for snapvault is running backups on file-based principle (checking the inodes and stuff. That's got to be Kahuna intensive). But what I don't understand is why the CPU utilization on the old FAS3140 is lower than on the new FAS3250. I would expect the other way around.

In the graphs I attached you can see the following:

CPU Util before - This is the performance before the upgrade.

CPU Util upgrade - On the 11th of august the upgrade was preformed.

CPU Util current - The current situation.

Setup - gives the current setup.

Any help would be appreciated.

2 REPLIES 2

colin_graham
5,558 Views

Is there actually a performance degradation?. or is it just the CPU utilization DFM is showing that is concerning you?

DataOntap uses a different kernel and cpu scheduling algorithm on systems with more than 4cores so the cores are more evenly used - unfortunately this also gives "inaccurate" cpu utilisation stats to DFM etc..

In fact the latest version of DFM, cpu monitoring is disabled by default because of this.

On our 6210 (also 8 core) DFM can show cpu pinned at 100% - but real cpu util can be as low as 30% as reported with "sysstat -m" or powershell... however the controller performs as normal.

martijnvana
5,558 Views

Well, Looking at the output of commands that I run, they don't feel as snappy as on other systems. Like doing a rdfile /etc/messages takes op to 1:34 minutes (clocked) to display 263 lines. As for the other system, also a FAS3250, it is done in less 5 seconds. 

Here's the output of sysstat -M 1 (cut in half for readability).

ANY1+ ANY2+ ANY3+ ANY4+ ANY5+ ANY6+ ANY7+ ANY8+  AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7

100%   94%   80%   59%   41%   27%   18%   11%  56%  60%  58%  56%  57%  56%  53%  54%  55%

100%   93%   78%   56%   39%   27%   18%   12%  55%  54%  56%  59%  58%  52%  50%  53%  54%

100%   91%   75%   56%   38%   26%   17%   11%  53%  55%  54%  53%  55%  52%  52%  52%  50%

100%   92%   77%   57%   40%   27%   18%   11%  54%  56%  55%  55%  55%  54%  54%  53%  51%

100%   92%   78%   59%   42%   29%   20%   12%  55%  56%  56%  55%  56%  53%  55%  55%  54%

100%   93%   80%   61%   43%   30%   20%   12%  57%  58%  58%  56%  60%  56%  55%  57%  56%

100%   93%   79%   60%   43%   31%   20%   14%  57%  58%  56%  58%  58%  56%  55%  56%  55%

100%   93%   79%   59%   41%   28%   18%   12%  55%  58%  55%  56%  57%  53%  52%  53%  54%

100%   93%   78%   57%   40%   27%   19%   13%  54%  56%  56%  56%  56%  53%  53%  54%  53%

100%   89%   74%   55%   38%   26%   17%   11%  52%  53%  55%  51%  58%  52%  51%  51%  51%

100%   93%   79%   60%   43%   31%   19%   13%  57%  58%  56%  58%  60%  56%  54%  56%  55%

100%   90%   74%   53%   36%   24%   16%   10%  52%  54%  52%  54%  53%  51%  51%  50%  50%

100%   79%   55%   36%   23%   14%    8%    4%  41%  42%  43%  41%  42%  41%  39%  39%  40%

100%   77%   54%   35%   22%   14%    9%    5%  41%  41%  43%  41%  46%  39%  39%  39%  38%

100%   84%   63%   44%   29%   20%   11%    6%  46%  46%  48%  47%  49%  45%  46%  45%  44%

100%   93%   78%   59%   42%   30%   21%   12%  56%  59%  59%  56%  58%  55%  54%  54%  53%

100%   91%   77%   57%   40%   26%   17%   11%  54%  55%  55%  56%  55%  53%  52%  52%  52%

100%   93%   80%   60%   43%   29%   20%   13%  56%  58%  57%  56%  55%  56%  54%  55%  55%

100%   92%   76%   55%   37%   25%   16%   10%  53%  57%  54%  54%  55%  51%  51%  51%  50%

Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host  Ops/s   CP

     2%       0%      0%     53%  39%     0%    50%    140%( 49%)          0%       32%   0%   106%   9%  16%      0   0%

     3%       0%      0%     49%  35%     0%    53%    138%( 46%)          0%       38%   0%   104%   8%   9%      0   0%

     2%       0%      0%     42%  28%     0%    52%    137%( 47%)          0%       41%   0%   104%   8%   7%      0   0%

     3%       0%      0%     47%  32%     0%    52%    140%( 48%)          0%       39%   0%   106%   6%  10%      0   0%

     2%       0%      0%     50%  36%     0%    52%    133%( 46%)          6%       46%   0%   102%   7%   6%      0  18%

     3%       0%      0%     54%  41%     0%    52%    140%( 48%)          1%       38%   0%   107%   7%  15%      0 100%

     2%       0%      0%     51%  36%     0%    48%    149%( 50%)          0%       46%   0%   104%   9%   7%      0  72%

     2%       0%      0%     49%  34%     0%    48%    147%( 52%)          0%       40%   0%   106%   7%   6%      0   0%

     2%       0%      0%     50%  34%     0%    50%    147%( 50%)          0%       39%   0%   102%   5%   7%      0   0%

     2%       0%      0%     46%  33%     0%    45%    142%( 54%)          0%       37%   0%    98%   9%   7%      0   0%

     3%       0%      0%     51%  34%     0%    50%    138%( 47%)          0%       60%   0%    99%   9%   8%      0   0%

     3%       0%      0%     41%  30%     0%    51%    142%( 48%)          0%       41%   0%    92%   6%   9%      0   0%

     2%       0%      0%      5%   5%     0%    63%    111%( 36%)          0%       92%   0%    39%   2%   6%      0   0%

     3%       0%      0%      9%   9%     0%    65%    101%( 31%)          0%       82%   0%    43%   7%   8%      0   0%

     2%       0%      0%     18%  11%     0%    58%    117%( 39%)          0%       86%   0%    63%   6%   7%      0   0%

     2%       0%      0%     49%  34%     0%    46%    149%( 50%)          0%       46%   0%   103%   9%  10%      0   0%

     2%       0%      0%     50%  37%     0%    46%    139%( 51%)          8%       27%   0%   104%   8%   8%      0  29%

     2%       0%      0%     55%  41%     0%    50%    142%( 50%)          0%       29%   0%   113%   5%   6%      0 100%

     2%       0%      0%     47%  32%     0%    51%    134%( 49%)          0%       39%   0%   101%   9%   7%      0  67%

As you can see Kahu and Kahuna are running at 100%. Right now snapvault is making a backup of 9 userdata shares (small files galore) and currently transferring inodes.

That's making the system slower. I won't say it is unresponsive, but you can see it is much slower on the console compared to the other filer.

Public