2010-12-07 07:05 PM
Eight Physical Cores (Physical server, not a VM)
Dual Path'd 4G FCP San to HP-EVA (only application on this system at the time)
EVA shows no load
The upgrade without a backup (manual backup made by hand the day before..and -it- took 11.5 hours) took about 9.5hrs to perform.
During the whole process, perfmon shows 500-2500 Disk operations in the queue, and very very little KB/sec of throughput during the last 4hrs that I was in a war-room with the users to determine why this was taking so long.
Now that it's over, user wants to know what this upgrade process is doing, to where a fast SAN system has nothing going on, no CPU load during the process (1-2% total on the Windows 2003 server) and tons of IO in the queue says perfmon.
Execution Throttle on the windows Qlogic HBA, is set to 128...moving it up makes no change it appears.
We are assuming that the same problems are in play, when the backups that months ago used to take about 2hrs to perform, now take 10-14hrs to perform..and when we test this next, we want to get data back to netapp to diagnose _what_ is going on here.
What to do next or check?
2010-12-08 11:06 AM
I assume, you upgraded from DFM 3.8 or earlier to 4.0 or 4.0.1
Can you give me the size of your perfdata under the perfdata dir ?
If there is lot of perf data, it is expected.As we widen some of the perffile for each counter group.
To look what really took more time, go to the log folder and look for pa_upgrade.log.
look for the following and take note of how long it took. for each of this.
Upgrading counter group.
Upgrading counter group
Widening counter group:
Widening counter group
Time for creating trendfiles.:
Started populating trend files
Completed populating trend files
Also in order to reduce your perfdata files, take a look at the following bug.439756.
Another suggestion to reduce your backup time from hours to minutes is to go for snapshot based backup, using SDW or SDU.
Also using the PA feature of configuring data collection you can set different collection frequency and retention time for each counter group.
You can also enable or disable data collection of a particular coutner for each storage system.
the same template can also be copied over to multiple storage system.
2010-12-08 01:31 PM
Not surprising, as when I upgrade my DFM from 3.8 to 4 it tool 14 hours.
Few things which helped me was
* DFM DB reload
* disabling unwanted performance counters
* smaller retention period for performance data, protection manager jobs and OM events.
hope that helps
2010-12-08 02:35 PM
But my user, is going to have me go through a case, to understand -why- it should take that long, considering that a distinct lack of any CPU is in use, as well as a lack of any real disk IO to very fast SAN was observed.
Basically, what affected this, is also why backups take the same or more time, when with no changes, they used to take only a few hours. So the question of "why so long" we hope to result in an ah-ha moment as to why nothing seems to happen, for so so long.
2010-12-08 04:15 PM
Perfdata size: 492G Customers counter retention requirements are all 1yr.
First CG update started at Dec 07 10:27:29, last one ended at Dec 07 16:23:14 (6hrs)
First CG widening started at Dec 07 16:23:15, last one ended at Dec 07 17:55:57 (1.5hrs)
First trending started at Dec 07 17:56:04, finished trending work at Dec 07 18:06:38..pretty fast.
And our final goal, is an understanding of why this takes so long, with little observable IO?
Additional data, the dfm backup database command we started at 7pm last night, is still running right now, 21hrs later. During which time, PA cannot log data, or provide alerts and alarms. Thats a 21hr P1 outage, not just a backup.
Were going to need some direct answers from Eng, on what the heck is going on here. This is not a database backup, this is watching paint dry.
And while moving to Netapp SAN would resolve the backup speed issue, its only masking a real problem they feel. (Im on our side, just putting the issue in scope)
Now..if the backup was shows to be hammering their SAN storage, which it is by far...not doing so, we'd have a different target to chase, but for now..we're the target, and require an RCA on what takes so long to do this.
What data, from the windows side and HP-EVA side does Eng want us to gather, when we run another backup next week (will be painful) to properly isolate what the heck the environment is doing, from DFM IO itself, down to storage statistics. We wont get more than one run at this, considering its nearly a 24hr data outage to gather it.
And..thanks both, for your inputs.
2010-12-08 04:50 PM
than you better do what all I have listed as I was also in same boat and realized that once I went through all these stuffs my DFM again came back alive and shortened my backup time and other stuffs significantly.
2010-12-08 04:55 PM
To summarize, there are 2 issues:
1. Database backup taking lot of time (> 21 hours)
There are 3 steps that could be taking time, as follows:
a. database validation
b. waiting for the performance advisor jobs to stop storing data
c. zipping the database and performance advisor files.
Fix for burt 439756 could help reducing time of step (3).
2. Upgrade look a long time (with steps for performance advisor taking
The issue (as Adai said earlier) is the file processing involved, with
one file processed at a time.
Processing files in parallel could reduce this time (I think Adai has
filed a defect already for
2010-12-09 01:01 AM
Thats huge perf data, you have collected, I am sure with just 35 countrollers you cant get there with default 1 year retention.
I suspect there are lot of state enties in the db.
Pls get a case open against bug 439756 to prune your db as well as perfdata.
If you want to open a case of upgrade open it against bug 432189.Also explore the possibility of going to Snapshot based backups.