Is it possible to monitor the status of the datasoruce acquisition and then also to alert should the acquistion fail or fail behind a specific lag ??
Wondering if you ever figured this out? I ask because I just ran across a situation today where I found that my WFA server (version 184.108.40.206.2 - I know I need to upgrade) hadn't updated it's cache via DFM since July 29. Here's a screenshot (the update from today was after I manually initiated and reset the interval back to 0 then back to 5):
I'm going to comb through the directory structure under the WFA install and see if I can find the log file that tracks this history. I may be able to have a process check for new entries and send an alert if one doesn't appear after a defined amount of time.
This is actually an important issue and I'm going to need some attention on this. Timely cache updates are vital to proper WFA operation and so I need some help troubleshooting it. I don't see any way to open a support case on this issue through typical channels because WFA isn't listed as a software option.
I need to know:
I have not seen an issue where the scheduled updates just stop working. I have seen issues where one cache update job is hung either due to the JVM on the WFA server (which will affect workflows executions as well) or because DFM 'went away' during a cache update (this recently happened). In both of these cases, I would see a 'long' running job in the cache results. The wfa_stats log contains information about the job execution for cache query updates. Since you haven't had that many occur since the failure, you could likely look at that log and find the last execution. See if there are any issues with 'attempts' after the last success in July. I would also look at the wfa_error log and see if anything jumps out. I would be happy to take a look as well.
Regarding, 'how do I ensure that cache updates are happening' question from the OP: I don't see any options to either schedule a cache update externally (web services) or to set up an SNMP trap on cache updates failing. I would agree with both of you that this is something that we should look at getting added as it is vital to validation.
Note: unless the second poster's issue actually had generated an error, SNMP would not have helped, per se.
Thank you, this is very helpful!
I'm looking at the wfa_stats logs and do see the class=CacheJobExecutorImpl entry occurring every 5 minutes which I believe is this process (I also see the 3 ETL processes logged too). I don't have a wfa_error log file anywhere but I do have an err.txt file but there's only 10 entries in total there. However, I did find WARN & ERROR entries in the server.log file that correspond exactly time wise to the last Cache job update in the wfa_stats log (2012-07-29 07:55:21)
Thanks for offering to look at these - how should I get the files to you? I don't see an option for attaching files in a private or direct email...
Reply to this thread and use the Advanced editor option (look at the top right of the reply box and you will see it). There is an option to attach files to your thread post. I will try to take a quick look.
Ah, that's it. Thank you. I've attached a zip file with the 3 logs:
wfa_stats.log = the current one that shows everything working well.
wfa_stats.log.2 = the log file that shows when the updates stopped.
server.log = Corresponding errors at the same time the updates stopped in the wfa_stats.log.2
It looks like the cache process actually hung. During the step to select the data from DFM, there was an error and the whole process just hung. I see multiple attempts to complete but never see it get past that point.
2012-07-29 07:55:21,443 PDT WARN [com.arjuna.ats.jta.logging.loggerI18N] (WorkManager(2)-18) [com.arjuna.ats.internal.jta.transaction.arjunacore.lastResource.disallow] [com.arjuna.ats.internal.jta.transaction.arjunacore.lastResource.disallow] Adding multiple last resources is disallowed. Current resource is org.jboss.resource.connectionmanager.TxConnectionManager$LocalXAResource@43c709
2012-07-29 07:55:21,443 PDT WARN [org.hibernate.util.JDBCExceptionReporter] (WorkManager(2)-18) SQL Error: 0, SQLState: null
2012-07-29 07:55:21,443 PDT ERROR [org.hibernate.util.JDBCExceptionReporter] (WorkManager(2)-18) Could not enlist in transaction on entering meta-aware object!; - nested throwable: (javax.transaction.SystemException: java.lang.Throwable: Unabled to enlist resource, see the previous warnings. tx=TransactionImple < ac, BasicAction: a838c15:f5cb:4fc391e8:b9a31d status: ActionStatus.ABORT_ONLY >); - nested throwable: (org.jboss.resource.JBossResourceException: Could not enlist in transaction on entering meta-aware object!; - nested throwable: (javax.transaction.SystemException: java.lang.Throwable: Unabled to enlist resource, see the previous warnings. tx=TransactionImple < ac, BasicAction: a838c15:f5cb:4fc391e8:b9a31d status: ActionStatus.ABORT_ONLY >))
2012-07-29 07:55:21,490 PDT ERROR [org.jboss.aspects.tx.TxPolicy] (WorkManager(2)-18) javax.ejb.EJBException: javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Cannot open connection
2012-07-29 07:55:21,490 PDT ERROR [org.jboss.resource.adapter.jms.inflow.JmsServerSession] (WorkManager(2)-18) Unexpected error delivering message delegator->JBossMessage:PERSISTENT, deliveryId=11217
javax.ejb.EJBException: javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Cannot open connection
We have logged an internal customer case and are looking at this. We will reply with our finding as soon as possible.
I have also added a feature request for getting notification based on acquisition issues (Potentially it can be no-updates in a certain
window of time, or number of failures in a certain timeframe).
Thank you for your patience.
Sorry for not following up on it earlier.
We did not manage to figure out exactly why the issue happened.
What was clear from the evidence was that the acquisition job completed, and an attempt was made to post the next recurrence.
That attempt failed and somehow no further attempts were made.
We looked at this from several angles and came up with a short and long term solutions:
The best thing to do is to alert on unscheduled recurrning data-sources, look into the failure and retry attempt to schedule the next recurrence.
This mechanism will be fleshed out later as per the release road-map.
We did tackle the second part: We created a watchdog that would confirm (Every 30 minutes) that all recurring data-sources have an acquisition
recurrence set for them. For those that somehow don't have - It will attempt to schedule that following recurrence.
This part of the solution is already part of the 2.0 release.
Hope that would be helpful in preventing future occurrences .