Re: CIFS performance problem

khairulanuar · ‎2010-09-06

Hi All,

My CIFS are having performance issue. the daily CPU usage is 80-90%. And so, when system doing snapshot.

the cpu will spike to 100%.

sysstat

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache

in out read write read write age

82% 7127 9003 0 8177 39226 31073 2606 0 0 1

91% 6785 9091 0 12189 37056 34302 12137 0 0 1

84% 9599 9360 0 9488 32152 29497 6466 0 0 1

96% 7827 9637 0 15959 38404 33400 16630 0 0 1

91% 9361 10466 0 6490 34312 27940 3994 0 0 1

90% 8826 10128 0 4958 32662 29347 4403 0 0 1

96% 6339 9484 0 10731 34338 32835 8322 0 0 1

94% 5309 9929 0 13300 35917 35079 15151 0 0 1

cifs stat 10

GetAttr Read Write Lock Open/Cl Direct Other

113626497782 33474482130 1294930546 767394214 60353397035 9863767165 26313299

46118 13394 475 301 24960 4007 11

46332 14480 564 346 24191 3978 10

42864 11059 402 243 20880 3534 26

43499 12055 430 281 21502 3585 18

43934 11888 403 297 21675 3573 9

45586 13033 373 279 24417 3877 24

45639 12739 432 276 24351 3889 10

48137 14694 564 359 26145 4356 6

50359 15435 541 326 29034 4658 25

and i found the cifs.audit.enable was on. can this cause the high CPU. here is my cifs.audit option.

options cifs.audit

cifs.audit.account_mgmt_events.enable off

cifs.audit.autosave.file.extension

cifs.audit.autosave.file.limit 0

cifs.audit.autosave.onsize.enable off

cifs.audit.autosave.onsize.threshold

cifs.audit.autosave.ontime.enable off

cifs.audit.autosave.ontime.interval

cifs.audit.enable on

cifs.audit.file_access_events.enable on

cifs.audit.liveview.enable off

cifs.audit.logon_events.enable on

cifs.audit.logsize 524288

cifs.audit.nfs.enable off

cifs.audit.nfs.filter.filename

cifs.audit.saveas /etc/log/adtlog.evt

thanks for the help.

khairulanuar · ‎2010-09-07

Hi Guys,

Really need your assistance to solve this matter.

Please let me know if you need more info.

thanks

sprovince · ‎2010-09-07

I would suggest that you set cifs.audit.enable to off. Is there a reason why it was set to on, as the default setting is off? This option logs all the cifs access from Windows clients on the controller. If you have a clustered system, check it's partner and turn off cifs.audit if it is enabled.

Also, you should check your /etc/log directory for event logs (this is specified in the option "cifs.audit.saveas /etc/log/adtlog.evt") as you may be filling up the /etc directory and the snapshots for /etc. The File Access and Protocols Management Guide has more information about configuring auditing for both CIFS and NFS.

Start with turning off the auditing to eliminate one potential bottleneck. Hope this helps.

Susan

khairulanuar · ‎2010-09-07

Thanks Susan

-i have turn off the audit option. right now, i'm monitoring the result. As for /etc/ space. it doesn't get full yet. No issue..

-Is there any other possibility cause of high CPU?

-is there any ways we can trace who used up the CPU? As far i can tell CIFS cause it. How to narrow it down?

please help, Thanks

chinchilla · ‎2010-09-08

Is it high all the time, or just periodically?

What other purposes is the system used for?

Please check the syslog, last time we had similar issue there was NVRAM failure, after replacement it became OK.

Another case we had on a system serving as destination for SnapVaults and tape backups (was DOT 7.2.4 that time) - reboot helped, might have been a memory leak, never got any confirmation from NetApp.

khairulanuar · ‎2010-09-08

-Yes it is high all the time. 70-80% and spike to 100% when snapshot create and delete.

-mainly for CIFS

-couldn't find anything strange on the log.

th other option to trace is by enable cifs.per_client_stats.enable , then using "cifs top" to trace it.

But, i don't dare to do it because it cause overhead associated with collecting the per-client stats.

This overhead may affect filer performance.

-is there a way to trace it without affecting current performance? Please help..

Darkstar · ‎2010-09-10

What model of filer is this? Which OnTAP version?

The best way for you would probably be to file a support request with your reseller. We debug performance problems like this quite often and there are so many factors that could be involved.

Some examples:

*extensive CIFS logging/auditing

*volume fill rates >80-85%. check "df -h"

*volume fragmentation. check "reallocate measure /vol/<volname>"

*maybe it's simply too much I/O for your system

*more disks/shelves could also help improve I/O performance

*SMBv2 features that have vastly improved in newer versions of OnTAP

etc. etc. etc.

There's so much to consider which makes it very hard to debug via the community forum

-Michael

thomas_glodde · ‎2010-09-10

Hi there,

run a "sysstat -x 1" and check for "disk util", it shows the highest utilization a single disks has. If its 80%+, your disks are the bottlenet. Besides that, please post a "sysconfig -r" and "aggr status -v" output for us to check if your aggr & volume layout is correctly.

Kind regards

Thomas

basvanberkel · ‎2010-09-10

Could you please send us the output of options cifs

further more cifs stat would be nice.

try:

cifs.smb2.signing.required off

cifs.max_mpx 50 (try increasing this to 126, 253 or 1124)

khairulanuar · ‎2010-09-20

Hi All,

i look at the I/O, looks OK. But, the cache hit is 99%. Do need to increase cache or what?

BTW, the version

version NetApp Release 7.2.4P7: Fri Apr 11 00:22:07 PDT 2008

Model Name: FAS3020

#sysstat -s -u 1

CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk

ops/s in out read write read write age hit time ty util

95% 19929 5066 42300 37052 8 0 0 3 99% 0% - 70%

93% 17804 4127 28514 32748 16799 0 0 3 99% 84% T 71%

94% 16874 4423 32560 27702 2630 0 0 3 99% 16% : 61%

89% 18591 4826 38286 23155 0 0 0 3 99% 0% - 51%

90% 20254 5134 43737 22705 0 0 0 3 99% 0% - 54%

93% 18424 5273 52143 24155 32 0 0 3 99% 0% - 44%

93% 18457 4644 50198 25574 0 0 0 3 99% 0% - 45%

90% 17536 4776 49377 30262 0 0 0 3 98% 0% - 57%

84% 20655 5729 46937 15242 24 0 0 3 99% 0% - 62%

--

Summary Statistics ( 9 samples 1 secs/sample)

CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk

ops/s in out read write read write age hit time ty util

Min

84% 16874 4127 28514 15242 0 0 0 3 98% 0% * 44%

Avg

91% 18724 4888 42672 26510 2165 0 0 3 99% 11% * 57%

Max

95% 20655 5729 52143 37052 16799 0 0 3 99% 84% * 71%

thanks for the help

basvanberkel · ‎2010-09-20

No, the cache hit is great. 99% is what u want.

The high cpu usage is a big concern.

can u reply on my previous post, with your output?

do u have snapmirror relations?

timo · ‎2010-09-20

khairulanuar wrote:

Hi All,

i look at the I/O, looks OK. But, the cache hit is 99%. Do need to increase cache or what?

BTW, the version

thanks for the help

Hey mate

the higher the cache hit the better. It means it is reading from the cache instead from the disk.

It is not an "cache is 99% full" indicator.

remember, when showing us sysstat outputs - make sure to show us at least 10 seconds.

The reason is the "CP" stat column. By default on a idle machine, a "consitency point" process would happen

every 10 seconds.

For netapp analysts, it is important to see whether such a process is complete withing 10 seconds

before the next CP happens.

A CP is the moment when stuff written into the filers NVRAM gets comitted to disk.

On some occasions with hammered systems, we can have "back to back cp's" meaning

one CP isnt done before the next one wants to start.

in the systats ive seen here, this is not the case - your machine is working hard, but it's not jammed.

what other taks are you doing with it? snapmirrors/backups? this level of CPU usage is above the optimum and

it might help to find the reason for it.

khairulanuar · ‎2010-09-20

thanks for the explanations. please refer to the attachment for all the info. i have put the output of below command

#sysstat -x 1

#sysconfig -r

#aggr status -v

#options cifs

#cifs stat 10

the filer only run nfs and cifs. snapmirror was off on the system. please let me know if you need more information on this.

thanks

nitish · ‎2010-09-21

Is this latency experienced at a particular time ?

Do you have Fpolicy or Vscan enabled on the system ?

They might be causing cifs latencey. /etc/messages will give you an idea if fpolicy or vscan is causing any issues.

Try to tweek your snapmirror transfers ....

check for any Network issues via ifstat (check for speed and current state)

khairulanuar · ‎2010-09-21

it doesn't have latency, as far as i understand. Only high CPU usage.

Yes, the fpolicay and vscan was enable but i didnt found anything saying about vscan or fpolicy in the log.

And snapmirror was disable on the server.

ljason · ‎2010-09-27

It is important to remember that CPU utilization on a NetApp storage controller is not a "first-order" metric for performance analysis. Instead, it is recommended to utilize throughout (ops or bytes/second) coupled with latency instead. There are many "internal" operations (such as snapshot creation) that may cause CPU utilization to "spike" while an operation is occurring. That said, Data ONTAP will generally prioritize user work (e.g., your CIFS clients) over that of system work (in this case snapshot work). The end result is that while some variance in throughput and/or latency may be seen on your clients while snapshot operations are being completed, the impact should be minor.

Thanks,

-jbl

vreddypalli · ‎2010-09-28

Hi,

I am also facing almost the same issue.

Can you provide stats show -i cifs and sysstat -m?

khairulanuar · ‎2010-09-30

here is the output..

stats show cifs

cifs:cifs:cifs_ops:11647/s

cifs:cifs:cifs_latency:0.76ms

sysstat -m

ANY AVG CPU0 CPU1

99% 87% 97% 78%

98% 81% 95% 68%

97% 82% 95% 69%

99% 86% 96% 76%

97% 83% 95% 70%

97% 81% 94% 67%

90% 64% 87% 41%

87% 61% 83% 39%

92% 69% 89% 49%

94% 68% 90% 46%

92% 66% 89% 44%

91% 67% 88% 46%

88% 63% 85% 42%

Did you get the solution for your problem yet?

vreddypalli · ‎2010-10-01

Hi,

I came to know that there are other activities are fired at the same time like Snapmirror, Dedupe, SQL backup.

lhoffman1 · ‎2010-09-29

Hi,

the vscan is anabled and you don't see any vsan entries in your messagelog ?

strange..

Anyway, disable vscan temporary and watching the results...

And of course don't underestimate the fpolicy process within the cifs environment.,

so you might disable fpolicy too for a moment.

let us know your results...

regards

Lutz Hoffmann