ONTAP Discussions

CIFS performance problem

khairulanuar
18,243 Views

Hi All,

My CIFS are having performance issue. the daily CPU usage is 80-90%. And so, when system doing snapshot.

the cpu will spike to 100%.

sysstat
CPU    NFS   CIFS   HTTP      Net kB/s     Disk kB/s      Tape kB/s    Cache
                               in   out     read  write    read write     age
82%   7127   9003      0    8177 39226    31073   2606       0     0       1
91%   6785   9091      0   12189 37056    34302  12137       0     0       1
84%   9599   9360      0    9488 32152    29497   6466       0     0       1
96%   7827   9637      0   15959 38404    33400  16630       0     0       1
91%   9361  10466      0    6490 34312    27940   3994       0     0       1
90%   8826  10128      0    4958 32662    29347   4403       0     0       1
96%   6339   9484      0   10731 34338    32835   8322       0     0       1
94%   5309   9929      0   13300 35917    35079  15151       0     0       1

cifs stat 10

  GetAttr      Read     Write      Lock   Open/Cl    Direct     Other

113626497782  33474482130  1294930546  767394214  60353397035  9863767165  26313299

    46118     13394       475       301     24960      4007        11

    46332     14480       564       346     24191      3978        10

    42864     11059       402       243     20880      3534        26

    43499     12055       430       281     21502      3585        18

    43934     11888       403       297     21675      3573         9

    45586     13033       373       279     24417      3877        24

    45639     12739       432       276     24351      3889        10

    48137     14694       564       359     26145      4356         6

    50359     15435       541       326     29034      4658        25

and i found the cifs.audit.enable was on. can this cause the high CPU. here is my cifs.audit option.
options cifs.audit
cifs.audit.account_mgmt_events.enable off
cifs.audit.autosave.file.extension
cifs.audit.autosave.file.limit 0
cifs.audit.autosave.onsize.enable off
cifs.audit.autosave.onsize.threshold
cifs.audit.autosave.ontime.enable off
cifs.audit.autosave.ontime.interval
cifs.audit.enable            on
cifs.audit.file_access_events.enable on
cifs.audit.liveview.enable   off
cifs.audit.logon_events.enable on
cifs.audit.logsize           524288
cifs.audit.nfs.enable        off
cifs.audit.nfs.filter.filename
cifs.audit.saveas            /etc/log/adtlog.evt
thanks for the help.
19 REPLIES 19

khairulanuar
18,060 Views

Hi Guys,

Really need your assistance to solve this matter.

Please let me know if you need more info.

thanks

sprovince
18,060 Views

I would suggest that you set cifs.audit.enable to off. Is there a reason why it was set to on, as the default setting is off? This option logs all the cifs access from Windows clients on the controller. If you have a clustered system, check it's partner and turn off cifs.audit if it is enabled.

Also, you should check your /etc/log directory for event logs (this is specified in the option "cifs.audit.saveas  /etc/log/adtlog.evt") as you may be filling up the /etc directory and the snapshots for /etc. The File Access and Protocols Management Guide has more information about configuring auditing for both CIFS and NFS.

Start with turning off the auditing to eliminate one potential bottleneck. Hope this helps.

Susan

khairulanuar
18,060 Views

Thanks Susan

-i have turn off the audit option. right now, i'm monitoring the result. As for /etc/ space. it doesn't get full yet. No issue..

-Is there any other possibility cause  of high CPU?

-is there any ways we can trace who used up the CPU? As far i can tell CIFS cause it. How to narrow it down?

please help, Thanks

chinchilla
18,059 Views

Is it high all the time, or just periodically?

What other purposes is the system used for?

Please check the syslog, last time we had similar issue there was NVRAM failure, after replacement it became OK.

Another case we had on a system serving as destination for SnapVaults and tape backups (was DOT 7.2.4 that time) - reboot helped, might have been a memory leak, never got any confirmation from NetApp.

khairulanuar
18,059 Views

-Yes it is high all the time. 70-80% and spike to 100% when snapshot create and delete.

-mainly for CIFS

-couldn't find anything strange on the log.

th other option to trace is by enable cifs.per_client_stats.enable , then using "cifs top" to trace it.

But, i don't dare to do it because it cause overhead associated with collecting the per-client stats.

This overhead may affect filer performance.

-is there a way to trace it without affecting current performance? Please help..

Darkstar
18,059 Views

What model of filer is this? Which OnTAP version?

The best way for you would probably be to file a support request with your reseller. We debug performance problems like this quite often and there are so many factors that could be involved.

Some examples:

*extensive CIFS logging/auditing

*volume fill rates >80-85%. check "df -h"

*volume fragmentation. check "reallocate measure /vol/<volname>"

*maybe it's simply too much I/O for your system

*more disks/shelves could also help improve I/O performance

*SMBv2 features that have vastly improved in newer versions of OnTAP

etc. etc. etc.

There's so much to consider which makes it very hard to debug via the community forum

-Michael

thomas_glodde
18,059 Views

Hi there,

run a "sysstat -x 1" and check for "disk util", it shows the highest utilization a single disks has. If its 80%+, your disks are the bottlenet. Besides that, please post a "sysconfig -r" and "aggr status -v" output for us to check if your aggr & volume layout is correctly.

Kind regards

Thomas

basvanberkel
18,059 Views

Could you please send us the output of options cifs

further more cifs stat would be nice.

try:

cifs.smb2.signing.required   off

cifs.max_mpx 50 (try increasing this to 126, 253 or 1124)

khairulanuar
18,057 Views

Hi All,

i look at the I/O, looks OK. But, the cache hit is 99%. Do need to increase cache or what?

BTW, the version

version                NetApp Release 7.2.4P7: Fri Apr 11 00:22:07 PDT 2008

Model Name:         FAS3020

#sysstat -s -u 1

CPU   Total    Net kB/s    Disk kB/s    Tape kB/s Cache Cache  CP  CP Disk
       ops/s    in   out   read  write  read write   age   hit time ty util
95%   19929  5066 42300  37052      8     0     0     3   99%   0%  -  70%
93%   17804  4127 28514  32748  16799     0     0     3   99%  84%  T  71%
94%   16874  4423 32560  27702   2630     0     0     3   99%  16%  :  61%
89%   18591  4826 38286  23155      0     0     0     3   99%   0%  -  51%
90%   20254  5134 43737  22705      0     0     0     3   99%   0%  -  54%
93%   18424  5273 52143  24155     32     0     0     3   99%   0%  -  44%
93%   18457  4644 50198  25574      0     0     0     3   99%   0%  -  45%
90%   17536  4776 49377  30262      0     0     0     3   98%   0%  -  57%
84%   20655  5729 46937  15242     24     0     0     3   99%   0%  -  62%
--
Summary Statistics (    9 samples  1 secs/sample)
CPU   Total    Net kB/s    Disk kB/s    Tape kB/s Cache Cache  CP  CP Disk
       ops/s    in   out   read  write  read write   age   hit time ty util
Min
84%   16874  4127 28514  15242      0     0     0     3   98%   0%  *  44%
Avg
91%   18724  4888 42672  26510   2165     0     0     3   99%  11%  *  57%
Max
95%   20655  5729 52143  37052  16799     0     0     3   99%  84%  *  71%
thanks for the help

basvanberkel
13,493 Views

No, the cache hit is great. 99% is what u want.

The high cpu usage is a big concern.

can u reply on my previous post, with your output?

do u have snapmirror relations?

timo
13,493 Views

khairulanuar wrote:

Hi All,

i look at the I/O, looks OK. But, the cache hit is 99%. Do need to increase cache or what?

BTW, the version

thanks for the help

Hey mate

the higher the cache hit the better. It means it is reading from the cache instead from the disk.

It is not an "cache is 99% full" indicator.

remember, when showing us sysstat outputs - make sure to show us at least 10 seconds.

The reason is the "CP" stat column.   By default on a idle machine,  a "consitency point" process would happen

every 10 seconds.

For netapp analysts, it is important to see whether such a process is complete withing 10 seconds

before the next CP happens.

A CP is the moment when stuff written into the filers NVRAM gets comitted to disk.  

On some occasions with hammered systems, we can have "back to back cp's" meaning

one CP isnt done before the next one wants to start.

in the systats ive seen here, this is not the case - your machine is working hard, but it's not jammed.

what other taks are you doing with it? snapmirrors/backups?  this level of CPU usage is above the optimum and

it might help to find the reason for it.

khairulanuar
13,493 Views

thanks for the explanations. please refer to the attachment for all the info. i have put the output of below command

#sysstat -x 1

#sysconfig -r

#aggr status -v

#options cifs

#cifs stat 10

the filer only run nfs and cifs. snapmirror was off on the system. please let me know if you need more information on this.

thanks

nitish
13,493 Views

Is this latency experienced at a particular time ?

Do you have Fpolicy or Vscan enabled on the system ?

They might be causing cifs latencey. /etc/messages will give you an idea if fpolicy or vscan is causing any issues.

Try to tweek your snapmirror transfers ....

check for any Network issues via ifstat (check for speed and current state)

khairulanuar
13,493 Views

it doesn't have latency, as far as i understand. Only high CPU usage.

Yes, the fpolicay and vscan was enable but i didnt found anything saying about vscan or fpolicy in the log.

And snapmirror was disable on the server.

ljason
13,493 Views

It is important to remember that CPU utilization on a NetApp storage controller is not a "first-order" metric for performance analysis. Instead, it is recommended to utilize throughout (ops or bytes/second) coupled with latency instead. There are many "internal" operations (such as snapshot creation) that may cause CPU utilization to "spike" while an operation is occurring. That said, Data ONTAP will generally prioritize user work (e.g., your CIFS clients) over that of system work (in this case snapshot work). The end result is that while some variance in throughput and/or latency may be seen on your clients while snapshot operations are being completed, the impact should be minor.

Thanks,

-jbl

vreddypalli
11,710 Views

Hi,

I am also facing almost the same issue.

Can you provide stats show -i cifs and sysstat -m?

khairulanuar
11,709 Views

here is the output..

stats show cifs
cifs:cifs:cifs_ops:11647/s
cifs:cifs:cifs_latency:0.76ms
sysstat -m
ANY  AVG  CPU0 CPU1
99%  87%   97%  78%
98%  81%   95%  68%
97%  82%   95%  69%
99%  86%   96%  76%
97%  83%   95%  70%
97%  81%   94%  67%
90%  64%   87%  41%
87%  61%   83%  39%
92%  69%   89%  49%
94%  68%   90%  46%
92%  66%   89%  44%
91%  67%   88%  46%
88%  63%   85%  42%

Did you get the solution for your problem yet?

vreddypalli
11,709 Views

Hi,

I came to know that there are other activities are fired at the same time like Snapmirror, Dedupe, SQL backup.

lhoffman1
11,709 Views

Hi,

the vscan is anabled and you don't see any vsan entries in your messagelog ?

strange..

Anyway, disable vscan temporary and watching the results...

And of course don't underestimate the fpolicy process within the cifs environment.,

so you might disable fpolicy too for a moment.

let us know your results...

regards

Lutz Hoffmann

Public