Re: Filer high CPU cause vcs service group goes down - NetApp Community

Welcome!

An account will enable you to access:
- NetApp support's essential features
- NetApp communities
- NetApp trainings
- Sign in to my account
- Don't have an account?
  Create an account
Learn
Browse

ONTAP Discussions

7,636 Views

Hi All,

In my company, we are using vcs cluster running oracle and the storage is on the Netapp filer FAS3020 version 7.2.4P7.

There is one time some service group on VCS went down. and at the same time the filer having high CPU usage.

I didn't capture the CPU usage at that time, But this is what i got after that:-

ANY1+ ANY2+ AVG CPU0 CPU1 Network Storage Raid Target Kahuna Cifs Exempt Intr Host Ops/s CP

87% 39% 64% 85% 47% 37% 9% 17% 0% 50% 6% 5% 4% 0% 11757 100%

74% 21% 48% 70% 26% 27% 8% 15% 0% 33% 4% 4% 4% 0% 7841 100%

82% 23% 53% 76% 30% 30% 9% 12% 0% 46% 4% 2% 4% 0% 8752 43%

83% 22% 53% 79% 28% 35% 7% 7% 0% 49% 4% 0% 4% 0% 10205 0%

85% 18% 52% 80% 25% 34% 6% 6% 0% 50% 4% 0% 4% 0% 11074 0%

79% 17% 49% 76% 22% 35% 6% 6% 0% 42% 3% 0% 4% 0% 10815 0%

83% 20% 53% 80% 24% 39% 8% 8% 0% 42% 4% 0% 4% 0% 11123 0%

64% 10% 38% 62% 15% 26% 6% 6% 0% 30% 3% 0% 4% 0% 7414 0%

50% 5% 28% 48% 9% 20% 4% 4% 0% 21% 3% 0% 4% 0% 8062 0%

76% 19% 49% 70% 26% 31% 6% 6% 0% 46% 4% 0% 4% 0% 8673 0%

96% 52% 74% 82% 67% 45% 7% 7% 0% 79% 8% 0% 4% 0% 10407 0%

--

Summary Statistics ( 11 samples 1.0 secs/sample)

ANY1+ ANY2+ AVG CPU0 CPU1 Network Storage Raid Target Kahuna Cifs Exempt Intr Host Ops/s CP

Min

28% 6357 880 0 5024 27894 23300 0 0 0 3

Avg

51% 8480 1167 0 10135 40876 30658 3335 0 0 3

Max

80% 10209 1780 0 13475 59917 39357 16955 0 0 3

What puzzle me, There is no clear error in /etc/messages on the filer shows there is problem with the filer. Only the snapshot delete.
The filer some time having CPU spike, since there is no issue, we just ignore it.Please help to investigate this.

9 REPLIES 9

7,636 Views

Anyone have any ideas? In the VCS log shows nothing abnormal. My question maybe, what can cause CPU high?

Some possibilities :-

1. Overload on the NFS --> how to check and confirm there is overload at that time?

7,636 Views

It doesn't look that loaded down to me - just the normal noises in here.

I'd leave the options open - it may nat have been the filer that caused the crash.

Performance is an end to end thing with complex environs in this day and age.

How about some 'sysstat -x 1' output ?

I hope this response has been helpful to you.

At your service,

Eugene E. Kashpureff
ekashp@kashpureff.org
Fastlane NetApp Instructor and Independent Consultant
http://www.fastlaneus.com/ http://www.linkedin.com/in/eugenekashpureff

(P.S. I appreciate points for helpful or correct answers.)

7,636 Views

Currently i receive this error from the console.

======================

Object action: admin.monitor.getvar

Object action: admin.util.ONTAP

Object action: admin.monitor.getvar

Object action: admin.util.ONTAP

Object action: admin.monitor.getvar

Object action: admin.util.ONTAP

Object action: admin.monitor.getvar

Object action: admin.util.ONTAP

Object action: admin.monitor.getvar

Object action: admin.util.ONTAP

Object action: admin.monitor.getvar

Object action: admin.util.ONTAP

No change been done. and the filer having slowness right now. Can anyone tell what it is, tried to google found nothing useful.

7,636 Views

This is the output of "ps" command. After sort it out base on CPU.

Process statistics over 232077.452 seconds...

ID State Domain %CPU StackUsed %StackUsed Name

2 RR i 60% 368 8% idle_thread1

601 RR k 41% 9640 29% wafl_hipri

1 RR i 23% 368 8% idle_thread0

251 RR n 13% 3552 10% nfsadmin

184 BR r 12% 4608 28% raidio_thread

76 RR n 7% 3768 11% 10/100/1000-VI/e0a

72 RR n 6% 3544 10% 10/100/1000-V/e1a

63 BR s 5% 2672 32% ispfc_main

73 BR n 5% 3316 10% 10/100/1000-V/e1b

ANY1+ ANY2+ AVG CPU0 CPU1 Network Storage Raid Target Kahuna Cifs Exempt Intr Host Ops/s CP

100% 96% 98% 100% 98% 52% 18% 32% 0% 74% 9% 8% 3% 0% 13375 100%

99% 70% 85% 93% 76% 50% 16% 17% 0% 73% 9% 0% 4% 0% 11684 18%

99% 66% 83% 94% 68% 55% 12% 12% 0% 73% 9% 0% 4% 0% 12354 0%

100% 86% 93% 98% 91% 60% 15% 15% 0% 84% 9% 0% 3% 0% 12815 0%

100% 82% 91% 96% 85% 58% 15% 15% 0% 81% 8% 0% 3% 0% 11404 0%

100% 92% 96% 83% 77% 56% 15% 21% 0% 87% 6% 4% 3% 0% 10451 27%

100% 100% 100% 100% 100% 45% 15% 36% 0% 85% 4% 11% 3% 0% 10235 100%

100% 100% 100% 100% 100% 50% 16% 31% 0% 87% 5% 9% 3% 0% 11924 100%

100% 98% 99% 100% 100% 49% 17% 34% 0% 78% 7% 10% 3% 0% 10813 100%

100% 94% 97% 100% 97% 53% 18% 29% 0% 77% 8% 7% 3% 0% 12485 100%

100% 85% 92% 97% 88% 58% 15% 15% 0% 85% 9% 0% 3% 0% 14064 15%

100% 81% 91% 95% 87% 54% 14% 14% 0% 86% 9% 0% 4% 0% 12504 0%

100% 74% 87% 97% 78% 54% 14% 14% 0% 77% 10% 0% 4% 0% 13205 0%

96% 44% 70% 93% 48% 48% 14% 14% 0% 53% 7% 0% 5% 0% 14745 0%

98% 60% 79% 95% 64% 56% 16% 16% 0% 60% 6% 0% 4% 0% 12839 0%

98% 57% 78% 95% 61% 56% 15% 15% 0% 58% 7% 0% 5% 0% 13254 0%

100% 83% 92% 88% 93% 36% 15% 34% 0% 82% 4% 10% 3% 0% 7039 95%

--

Summary Statistics ( 36 samples 1.0 secs/sample)

Min

70% 6076 787 0 5666 45209 28548 0 0 0 30s

Avg

92% 10636 1702 0 11872 64211 52321 12303 0 0 34s

Max

99% 15020 2476 0 18905 81916 72986 41458 0 0 51s

7,636 Views

Next time it happens drop into priv set advanced and run a statit -b wait a little bit (30 seconds or so) and run a statit -e post the results.

Is this NFS connected? if so and nfsstat would help.

On a side note 10k op/s would have to be getting close to the max a FAS3020 could do? (dont know havent really worked with them much)

7,636 Views

shane

Right now the CPU is running normal. on the statit command , what particular event should i focus to?

You mention 10k op/s would have to be getting close to the max a FAS3020, How can i calculate what is the max it can go.

Sorry for newbie question.

Thanks

7,636 Views

Hey

Never apologise for being a newbie, everyone was a newbie once This is IT face it everyone becomes a newbie all over again every 3-5 years

The statit gathers a pile of useful stats for troubleshooting performance issues, its not quite as hardcore as a perfstat but it has alot of good info for working out where a performance bottleneck maybe.

Run it when you hit the next high CPU event.

There are a few questions to ask too, how often do they happen? do they happen regularly? the last performance issue i saw was on a FAS2040 turns out it was due to a DBA running a database dump into the snapinfo lun at a certain time every day.

The op/s comment was more a guess based on what i've seen FAS3050 and 6080's do. How many "op's" a box can do is dependant on a lot of things the least of which is what size.

7,636 Views

Thanks for the advice.

From DFM the CPU spikes few times last month and this month, but not at the same timing.

Yes, the filer has oracle run on VCS in it, i might need to check on oracle after this.

Just another question, is there any possibilities too big size of volume may cause the slowness and latency?

Just throwing a wild guess cause i found one bigger size of volume was there.

#df -g

/vol/WDB/ 1170GB 1084GB 85GB 93% /vol/TIPITWDB/

/vol/WDB/.snapshot 130GB 49GB 80GB 38% /vol/TIPITWDB/.snapshot

thanks for the feedback.

7,636 Views

You should really post a "sysstat -x 1" output during the peaks. besides that, better multitasking performance is achieved with ontap 7.3.x.

All Community Forums

Public