Intermittent Host Login Failed warnings

GRAEMEOGDEN · ‎2013-06-21

I am receiving occasional Host Login Failed warnings via dfm for the cluster admin account at different times of the day. These clear themselves automatically.

Is there any way to find out where these logins are coming from?

If it was DFM itself trying to access the cluster (bad password in config for example) I wouldn't expect the alarm to clear....

Thanks

Graeme

kryan · ‎2013-06-21

Greetings Graeme,

can you provide us the version and mode of UM you are using as well as specifically where you are observing the warnings?

Thanks,

Kevin

GRAEMEOGDEN · ‎2013-06-21

Hi Kevin,

OCUM version 5.1.0.15008

Clustered DataONTAP version 8.1.2P3

An example of the alarm email is below:

A Warning event at 20 Jun 18:53 GMT Daylight Time on Cluster SEN_SAN_CLU01:

Host Login Failed.

Host SEN_SAN_CLU01 user admin login failed

Click below to see the details of this event.

http://TMGNAOCMSRV01.gpgroup.com:8080/start.html#st=1&data=(eventID=9167)

*** Event details follow.***

General Information

-------------------

DataFabric Manager server Serial Number: 1-50-017635 Alarm Identifier: 2

Event Fields

-------------

Event Identifier: 9167

Event Name: Host Login Failed

Event Description: Host Login

Event Severity: Warning

Event Timestamp: 20 Jun 18:53

Source of Event

---------------

Source Identifier: 132

Source Name: SEN_SAN_CLU01

Source Type: Cluster

Source Status: Warning

Event Arguments

---------------

hostName: SEN_SAN_CLU01

hostLoginName: admin

My concern is with it being the admin account (cluster administrator) I'd like to identify if this is a bad password in a configuration somewhere or someone attempting to log in to the cluster.

Thanks

Graeme

kryan · ‎2013-06-21

Hi Graeme,

I would not expect that to be a configuration problem within UM since the hostPassword is only entered once for the entire cluster and if works one time it should continue to work.

This might be indicative of a node configuration problem or network access to a particular node(s).

Does the warning only occur for one node (SEN_SAN_CLU01) or all of them?

Kevin

GRAEMEOGDEN · ‎2013-06-21

I've had these alerts from two clusters in two separate datacentres at various times over the last week, which does sound like a network connectivity issue but I would've thought there would be a "host down" message in that case, not a failed login?

If there's no method of obtaining more information from the OCUM logs I'll see if I can get more info from the clusters themselves.

Thanks.

kryan · ‎2013-06-21

Using the default settings of UM 5.1, in order to produce a host down event/alert the host would need to be down for a significant amount of time. This behavior was changed in UM 5.2 under bug 614983 (no public report at this time).

OnCommand Unified Manager Core uses five different methods to identify if a host is down:

echo
http
snmp
ndmp
echo_snmp <== default

The default behavior for a host down monitor run is a ping using ICMP echo and then snmpwalk. UM will retry each method a pre-configured number of times with varying timeouts, as seen below.

While the ICMP retries and timeouts have remained the same over the 5.x code line, the SNMP timeouts were increased in UM 5.1 for 7DOT and even more for 5.1 cDOT installations.

Due to changes under bug 614983, if pingMonTimeout is set to less than or equal to 5 seconds, then the SNMP timeout for host down (pingmon) monitoring will be 5 seconds. If the pingMonTimeout is set to a value greater than 5 seconds, then the pingMonTimeout is used as the SNMP timeout. The global MonSNMPTimeout is used for all other SNMP connections. This applies to both 7DOTand cDOT versions of UM 5.2.

===============================================

UM 5.0.x default values:

monSNMPRetries 4
monSNMPTimeout 5

hostPingMethod                        echo_snmp
pingMonInterval                       1 minute
pingMonRetryDelay                     3
pingMonTimeout                        3

===============================================

UM 5.1 7DOT default values:

monSNMPRetries 4
monSNMPTimeout 60

hostPingMethod                        echo_snmp
pingMonInterval                       1 minute
pingMonRetryDelay                     3
pingMonTimeout                        3

===============================================

UM 5.1/5.2 cDOT default values:

monSNMPRetries 4
monSNMPTimeout 300

hostPingMethod                        echo_snmp
pingMonInterval                       1 minute
pingMonRetryDelay                     3
pingMonTimeout                        3

===============================================

Therefore, if a clustered ONTAP controller is down for less than 5 minutes, UM 5.1 will not report it as down as it would not have exceeded the first timeout value for the host down check. If the ping method is changed to to echo or http the node down event is logged.

Changing the monSNMPTimeout to the 5.0.x default value of 5 seconds allows UM to determine the host down status with the echo_snmp method. However, it is not recommend that this value be adjusted lower than the default for cDOT UM 5.1 servers as some SNMP transactions can take a few minutes to complete and should not be sent multiple times under 5 minutes.

GRAEMEOGDEN · ‎2013-06-21

Checking the cluster for auth failures suggests this isn't a user getting the password wrong!

SEN_SAN_CLU01::*> event status show -messagename login.auth.loginDenied

Node Message Occurs Drops Last Time

----------------- ---------------------------- ------ ----- -------------------

SEN_SAN_CLU01-02 login.auth.loginDenied 1 0 5/8/2013 13:30:33

I'll speak to the networking team!

Thanks for your assistance Kevin.

Cheers

Graeme

GRAEMEOGDEN · ‎2013-09-27

Just to close this thread off, it turns out this was due to OnCommand using HTTPS to communicate with the clusters.

Changing the host protocol to HTTP / Port 80 via the dfm command line stopped the warnings being generated.

dfm host set <CLUSTERNAME> hostAdminPort=80 hostAdminTransport=http

CHYSLOP192 · ‎2013-12-12

Hi all,

Thought I would add to this thread as its one of the few that helped me. I had a call open for 3 months with NetApp support on this error, we had set transport to HTTP but that did not resolve the error for us. It was only after adding a 3rd cluster to DFM we saw the new cluster error once and then no further Host Login Errors from the new cluster yet both original clusters were erroring 20 - 30 times a day.

Yesterday I removed one of the clusters from DFM, carried out a purge on DFM and added it back in and the Host Login Errors have stopped for that cluster.

I have informed NetApp and they want me to hold off removing and re-adding the last of the erroring clusters so they can get some information out of the system.

Hopefully this will help others too if they still have issues.

Nomecks · ‎2015-10-06

This error is caused when the SSL certificates are replaced on your cluster without being activated. Please see this KB which outlines how to activate the SSL certificates: https://kb.netapp.com/support/index?page=content&id=1014389