I've had these alerts from two clusters in two separate datacentres at various times over the last week, which does sound like a network connectivity issue but I would've thought there would be a "host down" message in that case, not a failed login?
If there's no method of obtaining more information from the OCUM logs I'll see if I can get more info from the clusters themselves.
Using the default settings of UM 5.1, in order to produce a host down event/alert the host would need to be down for a significant amount of time. This behavior was changed in UM 5.2 under bug 614983 (no public report at this time).
OnCommand Unified Manager Core uses five different methods to identify if a host is down:
echo_snmp <== default
The default behavior for a host down monitor run is a ping using ICMP echo and then snmpwalk. UM will retry each method a pre-configured number of times with varying timeouts, as seen below.
While the ICMP retries and timeouts have remained the same over the 5.x code line, the SNMP timeouts were increased in UM 5.1 for 7DOT and even more for 5.1 cDOT installations.
Due to changes under bug 614983, if pingMonTimeout is set to less than or equal to 5 seconds, then the SNMP timeout for host down (pingmon) monitoring will be 5 seconds. If the pingMonTimeout is set to a value greater than 5 seconds, then the pingMonTimeout is used as the SNMP timeout. The global MonSNMPTimeout is used for all other SNMP connections. This applies to both 7DOTand cDOT versions of UM 5.2.
Therefore, if a clustered ONTAP controller is down for less than 5 minutes, UM 5.1 will not report it as down as it would not have exceeded the first timeout value for the host down check. If the ping method is changed to to echo or http the node down event is logged.
Changing the monSNMPTimeout to the 5.0.x default value of 5 seconds allows UM to determine the host down status with the echo_snmp method. However, it is not recommend that this value be adjusted lower than the default for cDOT UM 5.1 servers as some SNMP transactions can take a few minutes to complete and should not be sent multiple times under 5 minutes.
Thought I would add to this thread as its one of the few that helped me. I had a call open for 3 months with NetApp support on this error, we had set transport to HTTP but that did not resolve the error for us. It was only after adding a 3rd cluster to DFM we saw the new cluster error once and then no further Host Login Errors from the new cluster yet both original clusters were erroring 20 - 30 times a day.
Yesterday I removed one of the clusters from DFM, carried out a purge on DFM and added it back in and the Host Login Errors have stopped for that cluster.
I have informed NetApp and they want me to hold off removing and re-adding the last of the erroring clusters so they can get some information out of the system.
Hopefully this will help others too if they still have issues.