DFM Upgrade Very very very....slow.

jmohler · ‎2010-12-07

Eight Physical Cores (Physical server, not a VM)

32G RAM

Dual Path'd 4G FCP San to HP-EVA (only application on this system at the time)

EVA shows no load

The upgrade without a backup (manual backup made by hand the day before..and -it- took 11.5 hours) took about 9.5hrs to perform.

During the whole process, perfmon shows 500-2500 Disk operations in the queue, and very very little KB/sec of throughput during the last 4hrs that I was in a war-room with the users to determine why this was taking so long.

Now that it's over, user wants to know what this upgrade process is doing, to where a fast SAN system has nothing going on, no CPU load during the process (1-2% total on the Windows 2003 server) and tons of IO in the queue says perfmon.

Execution Throttle on the windows Qlogic HBA, is set to 128...moving it up makes no change it appears.

We are assuming that the same problems are in play, when the backups that months ago used to take about 2hrs to perform, now take 10-14hrs to perform..and when we test this next, we want to get data back to netapp to diagnose _what_ is going on here.

What to do next or check?

adaikkap · ‎2010-12-08

I assume, you upgraded from DFM 3.8 or earlier to 4.0 or 4.0.1

Can you give me the size of your perfdata under the perfdata dir ?

If there is lot of perf data, it is expected.As we widen some of the perffile for each counter group.

To look what really took more time, go to the log folder and look for pa_upgrade.log.

look for the following and take note of how long it took. for each of this.

Upgrading counter group.

=====================

Upgrading counter group

Widening counter group:

=================

Widening counter group

Time for creating trendfiles.:

======================

Started populating trend files

Completed populating trend files

Also in order to reduce your perfdata files, take a look at the following bug.439756.

Another suggestion to reduce your backup time from hours to minutes is to go for snapshot based backup, using SDW or SDU.

Also using the PA feature of configuring data collection you can set different collection frequency and retention time for each counter group.

You can also enable or disable data collection of a particular coutner for each storage system.

the same template can also be copied over to multiple storage system.

Regards

adai

jmohler · ‎2010-12-08

Perfdata size: 492G Customers counter retention requirements are all 1yr.

First CG update started at Dec 07 10:27:29, last one ended at Dec 07 16:23:14 (6hrs)

First CG widening started at Dec 07 16:23:15, last one ended at Dec 07 17:55:57 (1.5hrs)

First trending started at Dec 07 17:56:04, finished trending work at Dec 07 18:06:38..pretty fast.

And our final goal, is an understanding of why this takes so long, with little observable IO?

Additional data, the dfm backup database command we started at 7pm last night, is still running right now, 21hrs later. During which time, PA cannot log data, or provide alerts and alarms. Thats a 21hr P1 outage, not just a backup.

Were going to need some direct answers from Eng, on what the heck is going on here. This is not a database backup, this is watching paint dry.

And while moving to Netapp SAN would resolve the backup speed issue, its only masking a real problem they feel. (Im on our side, just putting the issue in scope)

Now..if the backup was shows to be hammering their SAN storage, which it is by far...not doing so, we'd have a different target to chase, but for now..we're the target, and require an RCA on what takes so long to do this.

What data, from the windows side and HP-EVA side does Eng want us to gather, when we run another backup next week (will be painful) to properly isolate what the heck the environment is doing, from DFM IO itself, down to storage statistics. We wont get more than one run at this, considering its nearly a 24hr data outage to gather it.

And..thanks both, for your inputs.

adaikkap · ‎2010-12-09

Thats huge perf data, you have collected, I am sure with just 35 countrollers you cant get there with default 1 year retention.

I suspect there are lot of state enties in the db.

Pls get a case open against bug 439756 to prune your db as well as perfdata.

If you want to open a case of upgrade open it against bug 432189.Also explore the possibility of going to Snapshot based backups.

Regards

adai

lovik_netapp · ‎2010-12-08

Not surprising, as when I upgrade my DFM from 3.8 to 4 it tool 14 hours.

Few things which helped me was

* DFM DB reload

* disabling unwanted performance counters

* smaller retention period for performance data, protection manager jobs and OM events.

hope that helps

jmohler · ‎2010-12-08

Thanks Lovik...

But my user, is going to have me go through a case, to understand -why- it should take that long, considering that a distinct lack of any CPU is in use, as well as a lack of any real disk IO to very fast SAN was observed.

Basically, what affected this, is also why backups take the same or more time, when with no changes, they used to take only a few hours. So the question of "why so long" we hope to result in an ah-ha moment as to why nothing seems to happen, for so so long.

lovik_netapp · ‎2010-12-08

How many systems are managed by this poor DFM that you got 492 GB of perfdata? that's way too much what I have experienced.

jmohler · ‎2010-12-08

Bout 35, all 6080s will 6 full loops each, and many thousands of relationships in Protection Manager.

lovik_netapp · ‎2010-12-08

than you better do what all I have listed as I was also in same boat and realized that once I went through all these stuffs my DFM again came back alive and shortened my backup time and other stuffs significantly.

harish · ‎2010-12-08

To summarize, there are 2 issues:

1. Database backup taking lot of time (> 21 hours)

There are 3 steps that could be taking time, as follows:

a. database validation

b. waiting for the performance advisor jobs to stop storing data

c. zipping the database and performance advisor files.

Fix for burt 439756 could help reducing time of step (3).

2. Upgrade look a long time (with steps for performance advisor taking

7.5+ hours)

The issue (as Adai said earlier) is the file processing involved, with

one file processed at a time.

Processing files in parallel could reduce this time (I think Adai has

filed a defect already for

this).

Regards

Harish

yan · ‎2010-12-09

We have just opened case 2001873880. the upgrade was from 3.8.1 to 4.0.1 on tuesday (12/7). the cusotmer did a backup on 12/6 and it took 14 hours to finish. the database backup after the upgrade has been running for more thna 40 hours. Would the DFM 4.0.1 requires longer time to backup?

pascalduk · ‎2010-12-09

yan wrote:

We have just opened case 2001873880. the upgrade was from 3.8.1 to 4.0.1 on tuesday (12/7). the cusotmer did a backup on 12/6 and it took 14 hours to finish. the database backup after the upgrade has been running for more thna 40 hours. Would the DFM 4.0.1 requires longer time to backup?

I also encountered the long upgrade time (about 4 hours) because of the performance data, but my backup times (3 hours) after the upgrade did not change

jmohler · ‎2010-12-09

That BURT is a very good read, thank you very much.

soehlig · ‎2011-01-10

Hi Jeff, I own the case that was opened up and Mason Y. asked me to take a look at it. Did we get the 'perf data list -v' output from the customer and did we get a dir listing from the DFM server that lists out the contents and size of the perfdata dir to compare them and see if 439756 is applicable here? (I am almost sure it is) If we can see a noticeable size difference, we can have the customer run the pruning script.

One other thing you can look for is the logs dir for a log file called PA_Upgrade.log. This file might show us of any errors that occurred with the perf data during the upgrade.

I am reviewing the case will reply there as well.