ONTAP Discussions

Filer high CPU cause vcs service group goes down

khairulanuar
6,463 Views

Hi All,

In my company, we are using vcs cluster running oracle and the storage is on the Netapp filer FAS3020 version 7.2.4P7.

There is one time some service group on VCS went down. and at the same time the filer having high CPU usage.

I didn't capture the CPU usage at that time, But this is what i got after that:-

ANY1+ ANY2+  AVG CPU0 CPU1 Network Storage Raid Target Kahuna Cifs Exempt Intr Host Ops/s   CP
  87%   39%  64%  85%  47%     37%      9%  17%     0%    50%   6%     5%   4%   0% 11757 100%
  74%   21%  48%  70%  26%     27%      8%  15%     0%    33%   4%     4%   4%   0%  7841 100%
  82%   23%  53%  76%  30%     30%      9%  12%     0%    46%   4%     2%   4%   0%  8752  43%
  83%   22%  53%  79%  28%     35%      7%   7%     0%    49%   4%     0%   4%   0% 10205   0%
  85%   18%  52%  80%  25%     34%      6%   6%     0%    50%   4%     0%   4%   0% 11074   0%
  79%   17%  49%  76%  22%     35%      6%   6%     0%    42%   3%     0%   4%   0% 10815   0%
  83%   20%  53%  80%  24%     39%      8%   8%     0%    42%   4%     0%   4%   0% 11123   0%
  64%   10%  38%  62%  15%     26%      6%   6%     0%    30%   3%     0%   4%   0%  7414   0%
  50%    5%  28%  48%   9%     20%      4%   4%     0%    21%   3%     0%   4%   0%  8062   0%
  76%   19%  49%  70%  26%     31%      6%   6%     0%    46%   4%     0%   4%   0%  8673   0%
  96%   52%  74%  82%  67%     45%      7%   7%     0%    79%   8%     0%   4%   0% 10407   0%
--
Summary Statistics (   11 samples  1.0 secs/sample)
ANY1+ ANY2+  AVG CPU0 CPU1 Network Storage Raid Target Kahuna Cifs Exempt Intr Host Ops/s   CP
Min
28%   6357    880      0    5024 27894    23300      0       0     0       3
Avg
51%   8480   1167      0   10135 40876    30658   3335       0     0       3
Max
80%  10209   1780      0   13475 59917    39357  16955       0     0       3

What puzzle me, There is no clear error in /etc/messages on the filer shows there is problem with the filer. Only the snapshot delete.
The filer some time having CPU spike, since there is no issue, we just ignore it.Please help to investigate this.

9 REPLIES 9

khairulanuar
6,463 Views

Anyone have any ideas? In the VCS log shows nothing abnormal. My question maybe, what can cause CPU high?

Some possibilities :-

1. Overload on the NFS --> how to check and confirm there is overload at that time?

ekashpureff
6,463 Views

It doesn't look that loaded down to me - just the normal noises in here.

I'd leave the options open - it may nat have been the filer that caused the crash.

Performance is an end to end thing with complex environs in this day and age.

How about some 'sysstat -x 1' output ?


I hope this response has been helpful to you.

At your service,


Eugene E. Kashpureff
ekashp@kashpureff.org
Fastlane NetApp Instructor and Independent Consultant
http://www.fastlaneus.com/ http://www.linkedin.com/in/eugenekashpureff

(P.S. I appreciate points for helpful or correct answers.)

khairulanuar
6,463 Views

Currently i receive this error from the console.

======================

Object action: admin.monitor.getvar
Object action: admin.util.ONTAP
Object action: admin.monitor.getvar
Object action: admin.util.ONTAP
Object action: admin.monitor.getvar
Object action: admin.util.ONTAP
Object action: admin.monitor.getvar
Object action: admin.util.ONTAP
Object action: admin.monitor.getvar
Object action: admin.util.ONTAP
Object action: admin.monitor.getvar
Object action: admin.util.ONTAP

No change been done. and the filer having slowness right now. Can anyone tell what it is, tried to google found nothing useful.

khairulanuar
6,463 Views

This is the output of "ps" command. After sort it out base on CPU.

Process statistics over 232077.452 seconds...
   ID State Domain %CPU StackUsed %StackUsed Name

    2 RR    i       60%       368         8% idle_thread1

  601 RR    k       41%      9640        29% wafl_hipri

    1 RR    i       23%       368         8% idle_thread0

  251 RR    n       13%      3552        10% nfsadmin

  184 BR    r       12%      4608        28% raidio_thread

   76 RR    n        7%      3768        11% 10/100/1000-VI/e0a

   72 RR    n        6%      3544        10% 10/100/1000-V/e1a

   63 BR    s        5%      2672        32% ispfc_main

   73 BR    n        5%      3316        10% 10/100/1000-V/e1b

ANY1+ ANY2+  AVG CPU0 CPU1 Network Storage Raid Target Kahuna Cifs Exempt Intr Host Ops/s   CP
100%   96%  98% 100%  98%     52%     18%  32%     0%    74%   9%     8%   3%   0% 13375 100%
  99%   70%  85%  93%  76%     50%     16%  17%     0%    73%   9%     0%   4%   0% 11684  18%
  99%   66%  83%  94%  68%     55%     12%  12%     0%    73%   9%     0%   4%   0% 12354   0%
100%   86%  93%  98%  91%     60%     15%  15%     0%    84%   9%     0%   3%   0% 12815   0%
100%   82%  91%  96%  85%     58%     15%  15%     0%    81%   8%     0%   3%   0% 11404   0%
100%   92%  96%  83%  77%     56%     15%  21%     0%    87%   6%     4%   3%   0% 10451  27%
100%  100% 100% 100% 100%     45%     15%  36%     0%    85%   4%    11%   3%   0% 10235 100%
100%  100% 100% 100% 100%     50%     16%  31%     0%    87%   5%     9%   3%   0% 11924 100%
100%   98%  99% 100% 100%     49%     17%  34%     0%    78%   7%    10%   3%   0% 10813 100%
100%   94%  97% 100%  97%     53%     18%  29%     0%    77%   8%     7%   3%   0% 12485 100%
100%   85%  92%  97%  88%     58%     15%  15%     0%    85%   9%     0%   3%   0% 14064  15%
100%   81%  91%  95%  87%     54%     14%  14%     0%    86%   9%     0%   4%   0% 12504   0%
100%   74%  87%  97%  78%     54%     14%  14%     0%    77%  10%     0%   4%   0% 13205   0%
  96%   44%  70%  93%  48%     48%     14%  14%     0%    53%   7%     0%   5%   0% 14745   0%
  98%   60%  79%  95%  64%     56%     16%  16%     0%    60%   6%     0%   4%   0% 12839   0%
  98%   57%  78%  95%  61%     56%     15%  15%     0%    58%   7%     0%   5%   0% 13254   0%
100%   83%  92%  88%  93%     36%     15%  34%     0%    82%   4%    10%   3%   0%  7039  95%
--
Summary Statistics (   36 samples  1.0 secs/sample)
Min
70%   6076    787      0    5666 45209    28548      0       0     0      30s
Avg
92%  10636   1702      0   11872 64211    52321  12303       0     0      34s
Max
99%  15020   2476      0   18905 81916    72986  41458       0     0      51s

shane_bradley
6,463 Views

Next time it happens drop into priv set advanced and run a statit -b wait a little bit (30 seconds or so) and run a statit -e  post the results.

Is this NFS connected? if so and nfsstat would help.

On a side note 10k op/s would have to be getting close to the max a FAS3020 could do? (dont know havent really worked with them much)

khairulanuar
6,463 Views

shane

Right now the CPU is running normal. on the statit command , what particular event should i focus to?

You mention 10k op/s would have to be getting close to the max a FAS3020, How can i calculate what is the max it can go.

Sorry for newbie question.

Thanks

shane_bradley
6,463 Views

Hey

Never apologise for being a newbie, everyone was a newbie once   This is IT face it everyone becomes a newbie all over again every 3-5 years

The statit gathers a pile of useful stats for troubleshooting performance issues, its not quite as hardcore as a perfstat but it has alot of good info for working out where a performance bottleneck maybe.

Run it when you hit the next high CPU event.

There are a few questions to ask too, how often do they happen? do they happen regularly?  the last performance issue i saw was on a FAS2040 turns out it was due to a DBA running a database dump into the snapinfo lun at a certain time every day.

The op/s comment was more a guess based on what i've seen FAS3050 and 6080's do.  How many "op's" a box can do is dependant on a lot of things the least of which is what size.

khairulanuar
6,463 Views

Thanks for the advice.

From DFM the CPU spikes few times last month and this month, but not at the same timing.

Yes, the filer has oracle run on VCS in it, i might need to check on oracle after this.

Just another question, is there any possibilities too big size of volume may cause the slowness and latency?

Just throwing a wild guess cause i found one bigger size of volume was there.

#df -g

/vol/WDB/          1170GB     1084GB       85GB      93%  /vol/TIPITWDB/
/vol/WDB/.snapshot      130GB       49GB       80GB      38%  /vol/TIPITWDB/.snapshot

thanks for the feedback.

thomas_glodde
6,463 Views

You should really post a "sysstat -x 1" output during the peaks. besides that, better multitasking performance is achieved with ontap 7.3.x.

Public