OCUM Alert from OnCommand Unified Manager: Cluster Not Reachable (State: New)

J-L-B · ‎2018-08-23

Is there a way to set back the threshold on the API timeout for this one cluster and not all of them? Where would that be?

Running OCUM 7.3p1 with half a dozen 9.2 and 9.3 clusters. One of them, ClusterH, is on a network that has close to 180ms latency there and back while the rest of the clusters are between 10 and 100ms. After no issues for a year with monitoring this same system, recently I've been getting alerts for Cluster Monitoring Failed off and on. They always alert, then clear (go obsolete). There's no issues locally with the netapp connections and nothing can be done about the latency from the OCUM server to the cluster. Trying not to have to install another OCUM instance closer to that 1 cluster.

Alert example:

Risk - Cluster Monitoring Failed

Impact Area - Availability

Severity - Warning

State - Obsolete

Source - ClusterH

Trigger Condition - Monitoring failed for cluster ClusterH. Reason: Cluster cannot be reached. Ensure that there is network connectivity to the cluster and the API has not timed out. The following settings must also be configured correctly for the cluster: hostname or cluster-management IP address, protocol, and port. If the issue persists, contact technical support.

RajeshPanda · ‎2018-09-24

@J-L-B

The OnCommand Ask The Expert session is live from today and we have posted this question for our Experts. You will receive a response shortly.

Follow our ATE forum for our Expert’s response.

Benjamin_P · ‎2018-11-09

Good morning, has there been any update on how to increase this monitoring timeout? We also have a site that is extremely latent from our central monitoring site, and constantly receive "New" alerts followed shortly by "Obsolete" messages for the cluster being monitored.

aattar · ‎2018-09-24

@J-L-B,

As a part of troubleshooting, could you please follow this below two options to start off,

Option 1:
Change the name in the cluster certificate to match the hostname or FQDN that DNS provides for the cluster when resolving - whether self-signed or imported from a CA.
For more information on this, see article 1014389: How to renew an SSL certificate in clustered Data ONTAP

Option 2:
Edit DNS (in the /etc/resolv.conf file) to properly reflect the name in the certificate or edit the hosts file in the UM installation to allow for the installation to resolve the cluster (Admin Vserver) name in the certificate to the UM installation.

For Linux installations:
Edit the /etc/hosts file

For OVA deployment:
Access the diag shell and edit the /etc/hosts file. If you are unsure as to how to perform this, see the article 1030670

For Windows:
Edit the C:WindowsSystem32Driversetchosts file

The new line should read as follows:
' clusterIP FQDN hostname'

mbeattie · ‎2018-11-11

Hi Benjamin,

Have you logged a support case for this issue? If i've understood your requirement correctly you have a central site that is monitoring remote datasources (clusters) and the remote cluster is located on a high latency network which can resulting in events being raised that otherwise wouldn't be (if you had the ability to customize the OCUM datasource timeout for the cluster on the remote network)?

It is possible to specify a timeout value when creating an OCUM ZAPI connection however this does not appear to be configurable for an OCUM datasource in the application. I think the default timeout is 60 seconds.

It might be possible to create a workaround by attaching a script to the alert which can then invoke the "cluster-iter" ZAPI using a high timeout, if a ZAPI connection is made and a response is recieved within then you can acknowledge and or delete the OCUM event.

/Matt

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.