ONTAP Discussions
ONTAP Discussions
Hi All,
My CIFS are having performance issue. the daily CPU usage is 80-90%. And so, when system doing snapshot.
the cpu will spike to 100%.
cifs stat 10
GetAttr Read Write Lock Open/Cl Direct Other
113626497782 33474482130 1294930546 767394214 60353397035 9863767165 26313299
46118 13394 475 301 24960 4007 11
46332 14480 564 346 24191 3978 10
42864 11059 402 243 20880 3534 26
43499 12055 430 281 21502 3585 18
43934 11888 403 297 21675 3573 9
45586 13033 373 279 24417 3877 24
45639 12739 432 276 24351 3889 10
48137 14694 564 359 26145 4356 6
50359 15435 541 326 29034 4658 25
Hi Guys,
Really need your assistance to solve this matter.
Please let me know if you need more info.
thanks
I would suggest that you set cifs.audit.enable to off. Is there a reason why it was set to on, as the default setting is off? This option logs all the cifs access from Windows clients on the controller. If you have a clustered system, check it's partner and turn off cifs.audit if it is enabled.
Also, you should check your /etc/log directory for event logs (this is specified in the option "cifs.audit.saveas /etc/log/adtlog.evt") as you may be filling up the /etc directory and the snapshots for /etc. The File Access and Protocols Management Guide has more information about configuring auditing for both CIFS and NFS.
Start with turning off the auditing to eliminate one potential bottleneck. Hope this helps.
Susan
Thanks Susan
-i have turn off the audit option. right now, i'm monitoring the result. As for /etc/ space. it doesn't get full yet. No issue..
-Is there any other possibility cause of high CPU?
-is there any ways we can trace who used up the CPU? As far i can tell CIFS cause it. How to narrow it down?
please help, Thanks
Is it high all the time, or just periodically?
What other purposes is the system used for?
Please check the syslog, last time we had similar issue there was NVRAM failure, after replacement it became OK.
Another case we had on a system serving as destination for SnapVaults and tape backups (was DOT 7.2.4 that time) - reboot helped, might have been a memory leak, never got any confirmation from NetApp.
-Yes it is high all the time. 70-80% and spike to 100% when snapshot create and delete.
-mainly for CIFS
-couldn't find anything strange on the log.
th other option to trace is by enable cifs.per_client_stats.enable , then using "cifs top" to trace it.
But, i don't dare to do it because it cause overhead associated with collecting the per-client stats.
This overhead may affect filer performance.
-is there a way to trace it without affecting current performance? Please help..
What model of filer is this? Which OnTAP version?
The best way for you would probably be to file a support request with your reseller. We debug performance problems like this quite often and there are so many factors that could be involved.
Some examples:
*extensive CIFS logging/auditing
*volume fill rates >80-85%. check "df -h"
*volume fragmentation. check "reallocate measure /vol/<volname>"
*maybe it's simply too much I/O for your system
*more disks/shelves could also help improve I/O performance
*SMBv2 features that have vastly improved in newer versions of OnTAP
etc. etc. etc.
There's so much to consider which makes it very hard to debug via the community forum
-Michael
Hi there,
run a "sysstat -x 1" and check for "disk util", it shows the highest utilization a single disks has. If its 80%+, your disks are the bottlenet. Besides that, please post a "sysconfig -r" and "aggr status -v" output for us to check if your aggr & volume layout is correctly.
Kind regards
Thomas
Could you please send us the output of options cifs
further more cifs stat would be nice.
try:
cifs.smb2.signing.required off
cifs.max_mpx 50 (try increasing this to 126, 253 or 1124)
Hi All,
i look at the I/O, looks OK. But, the cache hit is 99%. Do need to increase cache or what?
BTW, the version
version NetApp Release 7.2.4P7: Fri Apr 11 00:22:07 PDT 2008
#sysstat -s -u 1
No, the cache hit is great. 99% is what u want.
The high cpu usage is a big concern.
can u reply on my previous post, with your output?
do u have snapmirror relations?
khairulanuar wrote:
Hi All,
i look at the I/O, looks OK. But, the cache hit is 99%. Do need to increase cache or what?
BTW, the version
thanks for the help
Hey mate
the higher the cache hit the better. It means it is reading from the cache instead from the disk.
It is not an "cache is 99% full" indicator.
remember, when showing us sysstat outputs - make sure to show us at least 10 seconds.
The reason is the "CP" stat column. By default on a idle machine, a "consitency point" process would happen
every 10 seconds.
For netapp analysts, it is important to see whether such a process is complete withing 10 seconds
before the next CP happens.
A CP is the moment when stuff written into the filers NVRAM gets comitted to disk.
On some occasions with hammered systems, we can have "back to back cp's" meaning
one CP isnt done before the next one wants to start.
in the systats ive seen here, this is not the case - your machine is working hard, but it's not jammed.
what other taks are you doing with it? snapmirrors/backups? this level of CPU usage is above the optimum and
it might help to find the reason for it.
thanks for the explanations. please refer to the attachment for all the info. i have put the output of below command
#sysstat -x 1
#sysconfig -r
#aggr status -v
#options cifs
#cifs stat 10
the filer only run nfs and cifs. snapmirror was off on the system. please let me know if you need more information on this.
thanks
Is this latency experienced at a particular time ?
Do you have Fpolicy or Vscan enabled on the system ?
They might be causing cifs latencey. /etc/messages will give you an idea if fpolicy or vscan is causing any issues.
Try to tweek your snapmirror transfers ....
check for any Network issues via ifstat (check for speed and current state)
it doesn't have latency, as far as i understand. Only high CPU usage.
Yes, the fpolicay and vscan was enable but i didnt found anything saying about vscan or fpolicy in the log.
And snapmirror was disable on the server.
It is important to remember that CPU utilization on a NetApp storage controller is not a "first-order" metric for performance analysis. Instead, it is recommended to utilize throughout (ops or bytes/second) coupled with latency instead. There are many "internal" operations (such as snapshot creation) that may cause CPU utilization to "spike" while an operation is occurring. That said, Data ONTAP will generally prioritize user work (e.g., your CIFS clients) over that of system work (in this case snapshot work). The end result is that while some variance in throughput and/or latency may be seen on your clients while snapshot operations are being completed, the impact should be minor.
Thanks,
-jbl
Hi,
I am also facing almost the same issue.
Can you provide stats show -i cifs and sysstat -m?
here is the output..
Did you get the solution for your problem yet?
Hi,
I came to know that there are other activities are fired at the same time like Snapmirror, Dedupe, SQL backup.
Hi,
the vscan is anabled and you don't see any vsan entries in your messagelog ?
strange..
Anyway, disable vscan temporary and watching the results...
And of course don't underestimate the fpolicy process within the cifs environment.,
so you might disable fpolicy too for a moment.
let us know your results...
regards
Lutz Hoffmann