Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
My CIFS are having performance issue. the daily CPU usage is 80-90%. And so, when system doing snapshot.
the cpu will spike to 100%.
sysstat
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache
in out read write read write age
82% 7127 9003 0 8177 39226 31073 2606 0 0 1
91% 6785 9091 0 12189 37056 34302 12137 0 0 1
84% 9599 9360 0 9488 32152 29497 6466 0 0 1
96% 7827 9637 0 15959 38404 33400 16630 0 0 1
91% 9361 10466 0 6490 34312 27940 3994 0 0 1
90% 8826 10128 0 4958 32662 29347 4403 0 0 1
96% 6339 9484 0 10731 34338 32835 8322 0 0 1
94% 5309 9929 0 13300 35917 35079 15151 0 0 1
cifs stat 10
GetAttr Read Write Lock Open/Cl Direct Other
113626497782 33474482130 1294930546 767394214 60353397035 9863767165 26313299
46118 13394 475 301 24960 4007 11
46332 14480 564 346 24191 3978 10
42864 11059 402 243 20880 3534 26
43499 12055 430 281 21502 3585 18
43934 11888 403 297 21675 3573 9
45586 13033 373 279 24417 3877 24
45639 12739 432 276 24351 3889 10
48137 14694 564 359 26145 4356 6
50359 15435 541 326 29034 4658 25
and i found the cifs.audit.enable was on. can this cause the high CPU. here is my cifs.audit option.
options cifs.audit
cifs.audit.account_mgmt_events.enable off
cifs.audit.autosave.file.extension
cifs.audit.autosave.file.limit 0
cifs.audit.autosave.onsize.enable off
cifs.audit.autosave.onsize.threshold
cifs.audit.autosave.ontime.enable off
cifs.audit.autosave.ontime.interval
cifs.audit.enable on
cifs.audit.file_access_events.enable on
cifs.audit.liveview.enable off
cifs.audit.logon_events.enable on
cifs.audit.logsize 524288
cifs.audit.nfs.enable off
cifs.audit.nfs.filter.filename
cifs.audit.saveas /etc/log/adtlog.evt
thanks for the help.
19 REPLIES 19
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Guys,
Really need your assistance to solve this matter.
Please let me know if you need more info.
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would suggest that you set cifs.audit.enable to off. Is there a reason why it was set to on, as the default setting is off? This option logs all the cifs access from Windows clients on the controller. If you have a clustered system, check it's partner and turn off cifs.audit if it is enabled.
Also, you should check your /etc/log directory for event logs (this is specified in the option "cifs.audit.saveas /etc/log/adtlog.evt") as you may be filling up the /etc directory and the snapshots for /etc. The File Access and Protocols Management Guide has more information about configuring auditing for both CIFS and NFS.
Start with turning off the auditing to eliminate one potential bottleneck. Hope this helps.
Susan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Susan
-i have turn off the audit option. right now, i'm monitoring the result. As for /etc/ space. it doesn't get full yet. No issue..
-Is there any other possibility cause of high CPU?
-is there any ways we can trace who used up the CPU? As far i can tell CIFS cause it. How to narrow it down?
please help, Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is it high all the time, or just periodically?
What other purposes is the system used for?
Please check the syslog, last time we had similar issue there was NVRAM failure, after replacement it became OK.
Another case we had on a system serving as destination for SnapVaults and tape backups (was DOT 7.2.4 that time) - reboot helped, might have been a memory leak, never got any confirmation from NetApp.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-Yes it is high all the time. 70-80% and spike to 100% when snapshot create and delete.
-mainly for CIFS
-couldn't find anything strange on the log.
th other option to trace is by enable cifs.per_client_stats.enable , then using "cifs top" to trace it.
But, i don't dare to do it because it cause overhead associated with collecting the per-client stats.
This overhead may affect filer performance.
-is there a way to trace it without affecting current performance? Please help..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What model of filer is this? Which OnTAP version?
The best way for you would probably be to file a support request with your reseller. We debug performance problems like this quite often and there are so many factors that could be involved.
Some examples:
*extensive CIFS logging/auditing
*volume fill rates >80-85%. check "df -h"
*volume fragmentation. check "reallocate measure /vol/<volname>"
*maybe it's simply too much I/O for your system
*more disks/shelves could also help improve I/O performance
*SMBv2 features that have vastly improved in newer versions of OnTAP
etc. etc. etc.
There's so much to consider which makes it very hard to debug via the community forum
-Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
run a "sysstat -x 1" and check for "disk util", it shows the highest utilization a single disks has. If its 80%+, your disks are the bottlenet. Besides that, please post a "sysconfig -r" and "aggr status -v" output for us to check if your aggr & volume layout is correctly.
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you please send us the output of options cifs
further more cifs stat would be nice.
try:
cifs.smb2.signing.required off
cifs.max_mpx 50 (try increasing this to 126, 253 or 1124)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
i look at the I/O, looks OK. But, the cache hit is 99%. Do need to increase cache or what?
BTW, the version
version NetApp Release 7.2.4P7: Fri Apr 11 00:22:07 PDT 2008
Model Name: FAS3020
#sysstat -s -u 1
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk
ops/s in out read write read write age hit time ty util
95% 19929 5066 42300 37052 8 0 0 3 99% 0% - 70%
93% 17804 4127 28514 32748 16799 0 0 3 99% 84% T 71%
94% 16874 4423 32560 27702 2630 0 0 3 99% 16% : 61%
89% 18591 4826 38286 23155 0 0 0 3 99% 0% - 51%
90% 20254 5134 43737 22705 0 0 0 3 99% 0% - 54%
93% 18424 5273 52143 24155 32 0 0 3 99% 0% - 44%
93% 18457 4644 50198 25574 0 0 0 3 99% 0% - 45%
90% 17536 4776 49377 30262 0 0 0 3 98% 0% - 57%
84% 20655 5729 46937 15242 24 0 0 3 99% 0% - 62%
--
Summary Statistics ( 9 samples 1 secs/sample)
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk
ops/s in out read write read write age hit time ty util
Min
84% 16874 4127 28514 15242 0 0 0 3 98% 0% * 44%
Avg
91% 18724 4888 42672 26510 2165 0 0 3 99% 11% * 57%
Max
95% 20655 5729 52143 37052 16799 0 0 3 99% 84% * 71%
thanks for the help
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, the cache hit is great. 99% is what u want.
The high cpu usage is a big concern.
can u reply on my previous post, with your output?
do u have snapmirror relations?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
khairulanuar wrote:
Hi All,
i look at the I/O, looks OK. But, the cache hit is 99%. Do need to increase cache or what?
BTW, the version
thanks for the help
Hey mate
the higher the cache hit the better. It means it is reading from the cache instead from the disk.
It is not an "cache is 99% full" indicator.
remember, when showing us sysstat outputs - make sure to show us at least 10 seconds.
The reason is the "CP" stat column. By default on a idle machine, a "consitency point" process would happen
every 10 seconds.
For netapp analysts, it is important to see whether such a process is complete withing 10 seconds
before the next CP happens.
A CP is the moment when stuff written into the filers NVRAM gets comitted to disk.
On some occasions with hammered systems, we can have "back to back cp's" meaning
one CP isnt done before the next one wants to start.
in the systats ive seen here, this is not the case - your machine is working hard, but it's not jammed.
what other taks are you doing with it? snapmirrors/backups? this level of CPU usage is above the optimum and
it might help to find the reason for it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for the explanations. please refer to the attachment for all the info. i have put the output of below command
#sysstat -x 1
#sysconfig -r
#aggr status -v
#options cifs
#cifs stat 10
the filer only run nfs and cifs. snapmirror was off on the system. please let me know if you need more information on this.
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this latency experienced at a particular time ?
Do you have Fpolicy or Vscan enabled on the system ?
They might be causing cifs latencey. /etc/messages will give you an idea if fpolicy or vscan is causing any issues.
Try to tweek your snapmirror transfers ....
check for any Network issues via ifstat (check for speed and current state)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it doesn't have latency, as far as i understand. Only high CPU usage.
Yes, the fpolicay and vscan was enable but i didnt found anything saying about vscan or fpolicy in the log.
And snapmirror was disable on the server.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is important to remember that CPU utilization on a NetApp storage controller is not a "first-order" metric for performance analysis. Instead, it is recommended to utilize throughout (ops or bytes/second) coupled with latency instead. There are many "internal" operations (such as snapshot creation) that may cause CPU utilization to "spike" while an operation is occurring. That said, Data ONTAP will generally prioritize user work (e.g., your CIFS clients) over that of system work (in this case snapshot work). The end result is that while some variance in throughput and/or latency may be seen on your clients while snapshot operations are being completed, the impact should be minor.
Thanks,
-jbl
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am also facing almost the same issue.
Can you provide stats show -i cifs and sysstat -m?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
here is the output..
stats show cifs
cifs:cifs:cifs_ops:11647/s
cifs:cifs:cifs_latency:0.76ms
sysstat -m
ANY AVG CPU0 CPU1
99% 87% 97% 78%
98% 81% 95% 68%
97% 82% 95% 69%
99% 86% 96% 76%
97% 83% 95% 70%
97% 81% 94% 67%
90% 64% 87% 41%
87% 61% 83% 39%
92% 69% 89% 49%
94% 68% 90% 46%
92% 66% 89% 44%
91% 67% 88% 46%
88% 63% 85% 42%
Did you get the solution for your problem yet?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I came to know that there are other activities are fired at the same time like Snapmirror, Dedupe, SQL backup.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
the vscan is anabled and you don't see any vsan entries in your messagelog ?
strange..
Anyway, disable vscan temporary and watching the results...
And of course don't underestimate the fpolicy process within the cifs environment.,
so you might disable fpolicy too for a moment.
let us know your results...
regards
Lutz Hoffmann
