OnCommand 5.0 Semaphore leak problem

salindley · ‎2012-03-05

I'm with the rest of the crowd. Unfortunately we were recently forced to upgrade to OnCommand 5.0 in order to couple Protection Manager with SMVI, and it's been trouble ever since. First it flat out ran my Linux host out of semaphores, causing massive failures in backup jobs. We eventually fixed it by upping the count by a factor of 8. The secret is to up the number of arrays from 128 to 1024, AND up the system-wide number of semaphores from 32000 to 256000.

Another killer is the browser incompatibilities. I'd love to be able to run the GUI in Firefox on Linux, but it's a brick, and it's really tiresome to get the braindead browser warning every time you connect, especially given that it actually works in Firefox on Windows, even with version 10. I agree with a previous post - please add a 'STFU' (AKA 'don't show this mindless warning again') button.

I had also hoped that going from a 32-bit app to a 64-bit app would make the database manager a lot snappier, but it's the opposite - it drags the host down even more than 4.x. While I really like some of the new OnCommand features - like the flexibility to name a secondary qtree a billion times more intelligently than in 4.x, the pain almost isn't worth it. It's not really ready for prime time.

pascalduk · ‎2012-03-05

Scott Lindley wrote:

The secret is to up the number of arrays from 128 to 1024, AND up the system-wide number of semaphores from 32000 to 256000.

Can you share with use the size of environment you monitor/manage with OnCommand?

salindley · ‎2012-03-06

Certainly. Our DFM server is a SunFire x86 host, I forget the model. It's got 4 quad-core AMD Opteron 8384 CPUs and 128G of RAM, and a fair amount of local storage mainly used or the DFM installation, with a NetBackup server on the side that is only used once a week or so. We host our DFM database locally, so backups take about 45 minutes - such is life. The DFM server is running RHEL 5.7, soon to be downgraded to RHEL 5.6 due to OS support issues with OnCommand 5.0.

As of this AM, we back up 577 primary hosts, 46 of which are NetApp Filers and the rest are Wintel and Unix OSSV clients. There are 2606 primary dirs and 1035 secondary volumes spread across 6 Filers, 2 at our AZ site and 2 each at our 2 sites in MA. Data at the 3 major sites is backed up to local Filers. Someday we hope to be able to cross-pollenate for DR purposes.

Our backup window runs from about 6PM to 6AM in the host's local time zone. We are hoping to compress our backups down to 6PM to midnight so our Operations people aren't paged at 3AM because of a stupid failed backup. We are also currently backing up most of our filesystems using Business Continuity, and are only now migrating to using Protection Manager. I must say, BC was a LOT easier to configure, drive and maintain. DPM has seriously hypercomplicated the backup process, and so far as I can see, with no particular advantages other than more flexible scheduling and better handling of backup snapshot retention. I'm not sure it's worth it, but I have no choice.

We run a lot of reports and general data gathering activities against our DFM installation and our Filers over the course of a day, so the DFM server is kept fairly busy. As many of these data gathering activities as possible are performed via Perl scripts using the NMSDK, which does a really good job but is a major PITA to use and is not very well documented. I've had to figure out 'way too many things by trial-and-error.

It'd be interesting to see how our site compares to others using the same toolsets. I'm sure we're neither the smallest nor the largest. I'm also wondering how other users validate that they are backing up all required data. That's been one of our major headaches since we started with BC several years ago. DPM was supposed to be able to assist us with our validation via reports such as "external relationships" and in particular "unprotected data", but so far those have been a great disappointment. They don't catch all cases by a long shot according to our testing, making them semi-useless.

adaikkap · ‎2012-03-07

Hi Scott,

Your main problem is you are running DFM on a unsupported version of RHEL. Pls move to one of the supported ones where you shouldnt find any semaphore leaks. Also your configuration seem to be very normal and quite a common one.

Regards

adai

salindley · ‎2012-03-07

We will be down-revving our kernel to the 5.6 version tomorrow AM at the same time the hardware weenies come in to swap out the motherboard. We'll see how it goes after that.

salindley · ‎2012-04-10

Thought that I would add to my previous post - we have downrevved the RHEL kernel to the 5.5 version, and are still running into semaphore issues. Looks like it's not the OS, it's the application. I'm being forced to reboot once a week to clear the semaphores so I can continue to use OnCommand. Sorta feels like the old ISS days on Winders 2000.

All I can say is that OnCommand 5.0.1 had better be a LOT more stable than 5.0. Not that it would be hard, 5.0 tips over with very little provocation. The "server" service spontaneously stops at least once during the week and has to be restarted. Nothing else seems to be affected, and simply restarting it seems to work fine.

adaikkap · ‎2012-04-11

Hi Scott,

Do all your ossv host have a NHA (NetApp Host Agent ) as well ? If so you are hitting the issue you seem to be a victim of bug 556462. The public report of the same is available below and has a workaround.

Has this work around been done in your case ? If not can you tell us you observations after doing the specified workaround ?

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=556462

Regards

adai

salindley · ‎2012-04-19

We used to have about 6 hosts with the host agent installed and configured. I disabled all of them a few weeks ago, and we're still running out of semaphores. It appears to me that OnCommand 5.0 just wasn't really ready for "prime time". I keep losing parts of the service set. I'll run "dfm service list" at random times, and one of the service will be "not started", in other words it died. So far it's been the eventd a couple of times, but it's mostly like to be the "server" service. For either of these, issuing a "dfm service start" against them starts them right back up, but obviously this shouldn't be happening. We've been running DFM for years, and until 5.0 we've never had any of these issues. If anything, it's been almost amazingly stable, though we did run into some issues with the scheduler in 4.0.2 - it would randomly not schedule jobs, no apparent rhyme, reason or pattern.

As I said, I'm anxiously awaiting OnCommand 5.0.1 - if it fixes half the bugs it's supposed to, it'll be coolness. Of course, it's already over 4 weeks overdue, but I'd rather have a working product than an on-schedule product. Apparently 5.0 is the latter...