Solved: DFM 3.7 CPU Utilization Very High

swhitehead · ‎2010-03-15

Howdy-

We're using DFM and Protection Manager to manage about 20 Filers with Snapmirror replication and OSSV. Most of the time the processor is more or less pinned so it's difficult to manage. DFM database backups fail as a matter of course. Reporting is very slow.

The box is installed on a Windows 2k3 ESX VM with 4 vProcs and 4GB or RAM. The management database is about 1.4 GB which seems larger than the typical sizes referenced at the NOW site.

1) Anyone think 1.4 GB is large for the DFM monitor db?

2) Anyone have similar experience with performance?

Thanks,
Scott

adaikkap · ‎2010-03-16

Have you changed any default monitoring interval options ?

Can you paste the copy of dfm diag esp the object counts and monitoring interval parts ?

Regards

adai

View solution in original post

adaikkap · ‎2010-03-15

Hi Scott,

I think you are hitting the following bug.

http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=301280

Kindly upgrade to 3.7D4 or latter. My suggestion would be to upgrade to 3.8.1 which is the current GA release.

Regards

adai

swhitehead · ‎2010-03-15

Thanks for the post. That's great information but unfortunately I double checked my version and it's 3.7.1...

3.7.1.6014 (3.7.1)

Scott

adaikkap · ‎2010-03-16

Since the post said 3.7, that was my first take.Since you are on 3.7.1 its not the product bug.

Looks like your VM is not able to handle, as evilensky suggested can you check your esx and the performance of this VM ?

Regards

adai

Message was edited by: Adaikkappan Arumugam changed the title to reflect the correct version(ie 3.7.1)

evilensky · ‎2010-03-15

What does esxtop say about CPU scheduling efficiency? workloads which are poorly threaded, increased vCPU actually increase physical CPU contention creating poor performance for a virtual machine:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1005362

Additional instrumentation from outside the VM would probably help. Might be chasing rabbits, but multiple vCPU always raise eyebrows based on past experience.

http://communities.vmware.com/docs/DOC-5240

Message was edited by: evilensky

swhitehead · ‎2010-03-16

Thanks for your thoughts, folks. In troubleshooting this we did try dropping the CPU's one by one to see what would happen.

Basically, it got slower.

When I run ESXTop I don't see a %CSTP counter but I do see basically that it's the busiest guest on the ESX host which is a pretty good trick considering the other guests. None of the other boxes are complaining or slow, either.

Is 1.4 GB large for a DFM db? I'm looking for tuning options to see how much of this can be spooled to RAM or if there are other steps that can be taken.

I wonder if there aren't DFM tasks that can be disable or de-prioritized or scheduled.

adaikkap · ‎2010-03-16

Have you changed any default monitoring interval options ?

Can you paste the copy of dfm diag esp the object counts and monitoring interval parts ?

Regards

adai

swhitehead · ‎2010-03-16

We added a disk and CPU counter and lowered some of the retention schedules. In truth, the performance issue

predates the new counters but I'm open. If we have to get rid of them then so be it.

Details attached - excerpts below.

Thanks again. I really appreciate you (both) digging into this with me.

Scott

Management Station
Version                    3.7.1.6014 (3.7.1)
                           15.5 GB free (51.7%)
Licensed Features          Operations Manager: installed
                           Protection Manager: installed

Object Counts
Object Type                    Count
Administrator                  6
Aggregate                      28
Configuration                  1
Data Set                       28
Directory                      89
Disk                           565
DP Policy                      38
DP Schedule                    55
DP Throttle                    4
Host                           72
Initiator Group                58
Interface                      118
Lun Path                       325
Mgmt Station                   1
Mirror                         92
Network                        15
OSSV Directory                 944
OSSV Hosts                     27
Primary Storage Systems        3
Qtree                          154
report schedule                1
Resource Group                 39
Resource Pool                  7
Role                           27
schedule                       2
Secondary Storage Systems      18
SnapMirror Rels                204
SnapVault Rels                 89
Storage Set                    66
UserQuota                      0
vFilers                        0
Volume                         490
Zapi Hosts                     44

Monitoring Timestamps
Timestamp Name       Interval     Default      Last Updated Error if older than ...
cacheTimestamp       5 minutes   5 minutes      16 Mar 14:05
ccTimestamp          2 hours     4 hours        16 Mar 12:10
cfTimestamp          2 minutes   5 minutes    16 Mar 14:10 Normal 16 Mar 14:08
cpuTimestamp         5 minutes   5 minutes    16 Mar 14:10 Normal 16 Mar 14:05
dfTimestamp          15 minutes 30 minutes   16 Mar 14:09 Normal 16 Mar 13:55
diskTimestamp        2 hours     4 hours      16 Mar 14:04 Normal 16 Mar 12:10
envTimestamp         5 minutes   5 minutes    16 Mar 14:10 Normal 16 Mar 14:05
fcTimestamp          5 minutes   5 minutes    16 Mar 14:10 Normal 16 Mar 14:05
fsTimestamp          15 minutes 15 minutes   16 Mar 14:10 Normal 16 Mar 13:55
hostPingTimestamp    1 minute    1 minute     16 Mar 14:10 Normal 16 Mar 14:09
ifTimestamp          5 minutes   15 minutes   16 Mar 14:10 Normal 16 Mar 14:05
licenseTimestamp     4 hours     4 hours      16 Mar 13:41 Normal 16 Mar 10:10
lunTimestamp         30 minutes 30 minutes   16 Mar 14:10 Normal 16 Mar 13:40
opsTimestamp         10 minutes 10 minutes   16 Mar 14:10 Normal 16 Mar 14:00
qtreeTimestamp       8 hours     8 hours        16 Mar 06:10
rbacTimestamp        1 day       1 day        16 Mar 12:18 Normal 15 Mar 14:10
userQuotaTimestamp   1 day       1 day        16 Mar 14:07 Normal 15 Mar 14:10
sanhostTimestamp     5 minutes   5 minutes    16 Mar 14:10 Normal 16 Mar 14:05
snapmirrorTimestamp 10 minutes 30 minutes   16 Mar 14:10 Normal 16 Mar 14:00
snapshotTimestamp    30 minutes 30 minutes   16 Mar 13:59 Normal 16 Mar 13:40
statusTimestamp      5 minutes   10 minutes   16 Mar 14:10 Normal 16 Mar 14:05
sysInfoTimestamp     15 minutes 1 hour       16 Mar 14:10 Normal 16 Mar 13:55
svTimestamp          30 minutes 30 minutes   16 Mar 14:10 Normal 16 Mar 13:40
svMonTimestamp       8 hours     8 hours      16 Mar 07:05 Normal 16 Mar 06:10
xmlQtreeTimestamp    8 hours     8 hours      16 Mar 14:09 Normal 16 Mar 06:10
vFilerTimestamp      1 hour      1 hour         16 Mar 13:10

Database
monitordb.db 1.75 GB
dbFileVersion 9

ConnCount                  33 connections
MaxCacheSize               392184 KBytes
CurrentCacheSize           350280 KBytes
PeakCacheSize              392184 KBytes
PageSize                   8192 Bytes

Logs
discovery      247 KB 16 Mar 14:04
DFMMonitor     2.34 MB 16 Mar 14:00
DFMEvent       1.07 MB 16 Mar 14:03
DFMServer      1.84 MB 16 Mar 13:56
DFMScheduler   401 KB 16 Mar 09:00
DFMWatchDog    300 KB 16 Mar 14:10
dfm            587 KB 16 Mar 14:10
sybase         9.23 MB 16 Mar 14:10
pingmon        264 KB 15 Mar 16:34
audit          613 KB 16 Mar 14:10

Services
sql        Normal Started
http       Normal Started
eventd     Normal Started
monitor    Normal Started
scheduler Normal Started
server     Normal Started
watchdog   Normal Started

Time Since Confirmed Alive
Eventd     6 seconds
Monitor    3 seconds
Scheduler 15 seconds
Server     14 seconds
Watchdog   3 seconds

adaikkap · ‎2010-03-16

You are running the following monitors aggressively, than its default values.

Can you bring them to default and see if still CPU utilization is very high?

Go to Web UI Control Center->Options->Monitoring and set the following to blank values and update.

ccTimestamp

swhitehead · ‎2010-03-18

I'm gonna say that fixed it. Setting the CC (Conformance Checking) to a bigger number means that the task runs less frequently and doesn't consume the processor as often. While we were at it we set some other tasks to run less frequently. Thanks for the help, I really appreciate it.

Scott

adaikkap · ‎2010-03-18

Reply via mail clipped off some part of the post.

Even these monitors are running more frequently than default.

cfTimestamp-----------------------------Cluster Failover Monitoring Interval

dfTimestamp-----------------------------Disk Free Space Monitoring Interval

diskTimestamp---------------------------Disk Monitoring Interval

ifTimestamp------------------------------Interface Monitoring Interval

snapmirrorTimestamp------------------SnapMirror Monitoring Interval

statusTimestamp------------------------Global Status Monitoring Interval

sysInfoTimestamp------------------------System Information Monitoring Interval

Bring them back to default values.

Regards

adai