2011-05-06 02:11 AM
i was wondering if anybody has had the problem that the filer goes way over 60% CPU Load for a certain time and then FALLS down to below 10 % for a few minutes and then goes back to over 60% again without any major changes in the devices connected to it. it looks like a scheduled task but not even the support can say if something is wrong.
using around 10 ESX Servers with 120 VMs and 2 Brocade FC switches connected to it. perfstat file just showed Netapp that IO response times are okay. but why does it go up and down all the time ?
2011-05-06 02:44 AM
I guess you could run a stats or perfstats run to try to pinpoint things a bit closer. I'd check your reallocation schedules one more time. You might also want to check 'lun config_check' to see if you are (in the case that you have a cluster... again... no details... ) to see if you have some passive paths in use. The last thing is to run through the list of fixed bugs. I ran 7.3.2P4 on a few boxes for a while. It's not the worst release, but 126.96.36.199 seems to be very good up to this point. You might (will probably) find some bugs on wafl scanners that are running too often. Then you can schedule an upgrade.
2011-05-06 02:50 AM
Could a passive path usage produce high cpu load ?
Do you think it could be a lack of free space in one of the 2 aggregates ?
where could i check the reallocation schedules ?
thanks for the help
2011-05-06 03:19 AM
2011-05-06 04:12 AM
If you have a NetApp HA cluster, then incorrectly configured FC paths will cause you extra CPU load, yes. This is essentially when the hosts try to access LUN's that are on the NetApp "partner" controller. Again, run 'lun config_check -v ' on the cli.
I don't know how much free space you have, so it is hard to comment on whether or not that is a problem.
To check your reallocation schedules, just run 'reallocate status -v'. I've managed to fat-finger the interval based things to run minutes apart a few times before... instead of days apart.
It seems you really need to familiarize yourself with the NOW website as well. All of the documentation is there. There is a ton of knowledge and help to be found there. I think what you are seeing is a wafl scanner that is running buggy. Even if I'm not a fan of pushing upgrades for everything, I was glad to get away from 7.2.x and over to 188.8.131.52 .
2011-05-06 04:34 AM
You can drive yourself crazy chasing CPU reasons in Data ONTAP. In 7.x it's a proprietary (made by NetApp) kernel that reprioritizes CPU usage based on actual workload. I've personally seen a system pegged at 100% CPU doing internal workload drop to low numbers on new/incoming client workload (NAS/SAN traffic). The thought process in the design is to use the maximum CPU for existing tasks to get them done quicker, but make sure that it never impacts the client protocols.
NetApp encourages you to investigate the cause of any increased or unacceptable latency and if it's a runaway process consuming CPU then we can help you find if that's tied to a workload request, undersized hardware, or some type of bug in the software.