Active IQ Unified Manager Discussions
Active IQ Unified Manager Discussions
Hello everyone
I have a client that is recieving an appliance down message but the appliance is operating just fine. I am looking for advice on the best way to debug why DFM believes FCSAN03 is offline. I have requested the latest autosupport so i can look through the messages file to see if anything looks suspicious that would trigger this event.
If i understand DFM right, we *ping* the monitored appliance every minute ... what other methods do we use to check to see if the appliance is "still there"?
SAMPLE ALERT:
Subject: dfm: Critical event on fcsan03 (Host Down)
A Critical event at 09 Mar 13:22 Pacific Daylight Time on Active/Active Controller fcsan03.na.gilead.com:
The Active/Active Controller is down.
Click below to see the source of this event.
http://fcdfm01.na.gilead.com:8080/dfm/report/view/appliance-details/49543?group=0
Click below to see details of this event.
http://fcdfm01.na.gilead.com:8080/dfm/report/view/event-details/643567?group=0
Click below to see all events.
http://fcdfm01.na.gilead.com:8080/dfm/report/view/events?group=0
Click below to see the source of this event in the DataFabric Manager server and access the related management page in FilerView.
http://fcdfm01.na.gilead.com:8080/dfm/report/view/appliance-details/49543?group=0&autoload-fv=0
*** Event details follow.***
General Information
-------------------
DataFabric Manager Serial Number: 1-50-004533 Alarm Identifier: 12
Event Fields
-------------
Event Identifier: 643567
Event Name: Host Down
Event Description: Up/down status of an appliance Event Severity: Critical Event Timestamp: 09 Mar 13:22
Source of Event
---------------
Source Identifier: 49543
Source Name: fcsan03.na.gilead.com
Source Type: Active/Active Controller
Source Status: Critical
--NetApp DataFabric Manager
Emanuel,
You can look at pingmon.log to check if the issue was seen on multipl systems, probably due to a network glitch.
You can also check the current method used for ping.
# dfm option list hostPingMethod
Option Value
--------------- ------------------------------
hostPingMethod echo_snmp
The following options are supported, and echo_snmp is recommended as it checks ICMP ping and SNMP.
# dfm option set hostPingMethod=xxx
Error: hostPingMethod: xxx is not a valid ping method;
valid values are echo, http, snmp, ndmp, echo_snmp, all.
Thanks,
Raj.
Hello Raja
Here is a sample from my DFM server ... if i read this right, an entry is made here only when a state changes
Oct 27 12:52:22 [DFMMonitor: INFO]: 10.42.131.81 fas960c1-ps1.10.41.70.91 Unknown -> Up (echo_snmp; errno 0)
Oct 27 13:17:22 [DFMMonitor: INFO]: 10.42.131.91 r200-ps1.ca2k3dom.ngslabs.netapp.com Unknown -> Up (echo_snmp; errno 0)
Oct 27 13:20:41 [dfmserver: INFO]: 10.42.131.81 fas960c1-ps1.10.41.70.91 Unknown -> Up (echo_snmp; errno 0)
Oct 27 14:26:12 [dfmserver: INFO]: 10.42.131.81 fas960c1-ps1.10.41.70.91 Unknown -> Up (echo_snmp; errno 0)
Oct 27 14:26:33 [dfmserver: INFO]: 10.42.131.81 fas960c1-ps1.10.41.70.91 Unknown -> Up (echo_snmp; errno 0)
Oct 31 11:18:38 [DFMMonitor: INFO]: 10.42.131.91 r200-ps1.ca2k3dom.ngslabs.netapp.com Up -> Down (echo_snmp; errno 0)
Oct 31 11:33:53 [DFMMonitor: INFO]: 10.42.131.91 r200-ps1.ca2k3dom.ngslabs.netapp.com Down -> Up (echo_snmp; errno 0)
Nov 26 10:34:38 [DFMMonitor: INFO]: 10.42.131.81 fas960c1-ps1.10.41.70.91 Up -> Down (echo_snmp; errno 0)
Nov 26 10:54:07 [DFMMonitor: INFO]: 10.42.131.81 fas960c1-ps1.10.41.70.91 Down -> Up (echo_snmp; errno 0)
Mar 03 18:00:58 [DFMMonitor: INFO]: 10.42.131.91 r200-ps1.ca2k3dom.ngslabs.netapp.com Up -> Down (echo_snmp; errno 0)
Mar 03 18:14:27 [DFMMonitor: INFO]: 10.42.131.91 r200-ps1.ca2k3dom.ngslabs.netapp.com Down -> Up (echo_snmp; errno 0)
Mar 05 12:02:28 [DFMMonitor: INFO]: 10.42.131.91 r200-ps1.ca2k3dom.ngslabs.netapp.com Up -> Down (echo_snmp; errno 0)
Mar 05 13:07:57 [DFMMonitor: INFO]: 10.42.131.91 r200-ps1.ca2k3dom.ngslabs.netapp.com Down -> Up (echo_snmp; errno 0)
What is the best way to on the command line to list out the events for a monitored hostin regarsd to a appliance down; the event type seems to be "host-down. The customer DFM server is windows so there is no benefit of using GREP. Is there a specific dfm command "dfm event list" that can filter to a specific event type?
106 174 host-up Normal 27 Oct 13:17
108 174 host-snmp-ok Normal 27 Oct 13:19
109 174 host:identity-ok Normal 27 Oct 13:19
110 174 host-login:failed Warning 27 Oct 13:19
111 174 temperature-normal Normal 27 Oct 13:19
112 174 fans:normal Normal 27 Oct 13:19
113 174 power-supplies:normal Normal 27 Oct 13:19
114 174 nvram-battery:fully-charged Normal 27 Oct 13:19
228 174 disks:spares-available Normal 27 Oct 13:21
230 174 disks:none-reconstructing Normal 27 Oct 13:21
231 174 host-communication:ok Normal 27 Oct 13:21
232 174 host-login:ok Normal 27 Oct 13:21
233 174 ndmp-credentials-status:good Normal 27 Oct 13:21
234 174 host-modified Information 27 Oct 13:21
235 174 ndmp-communication-status:up Normal 27 Oct 13:21
236 174 ndmp-up Normal 27 Oct 13:21
237 174 cluster-cfmode-config-ok Normal 27 Oct 13:22
238 174 cpu-load-normal Normal 27 Oct 13:22
267 174 host-role-discovered Information 27 Oct 13:24
268 174 host-usergroup-discovered Information 27 Oct 13:24
269 174 host-user-discovered Information 27 Oct 13:24
514 174 cpu-too-busy Warning 29 Oct 07:30
664 174 nvram-battery:normal Normal 30 Oct 13:38
515 174 cpu-load-normal Normal 29 Oct 07:35
877 174 volume-destroyed Information 31 Oct 09:39
939 174 host-down Critical 31 Oct 11:18
940 174 host-up Normal 31 Oct 11:33
1212 174 global-status:noncritical Error 03 Nov 16:05
1213 174 global-status:noncritical Error 03 Nov 16:17
1916 174 global-status:noncritical Error 16 Nov 06:23
6799 174 host-down Critical 03 Mar 18:00
6866 174 host-down Critical 05 Mar 12:02
2991 174 global-status:noncritical Error 30 Nov 03:55
6800 174 host-up Normal 03 Mar 18:14
6801 174 global-status:noncritical Error 03 Mar 18:15
6869 174 host-up Normal 05 Mar 13:07
6975 174 global-status:noncritical Error 08 Mar 01:02
332 111 snapvault-relationship:create-failed Error 27 Oct 14:18
358 111 snapvault-relationship:created Information 27 Oct 14:29
333 276 snapvault-relationship:create-failed Error 27 Oct 14:18
357 276 snapvault-relationship:created Information 27 Oct 14:29
334 277 snapvault-relationship:create-failed Error 27 Oct 14:18
354 277 snapvault-relationship:created Information 27 Oct 14:27
1 1 management-station:node-limit-ok Normal 14 Aug 2008
2 1 management-station:license-not-expired Normal 14 Aug 2008
3 1 management-station:enough-free-space Normal 14 Aug 2008
4 1 management-station:load-ok Normal 14 Aug 2008
7 1 traplistener-start-ok Normal 27 Oct 12:17
8 1 traplistener-start-ok Normal 27 Oct 12:25
5336 1 traplistener-start-ok Normal 26 Jan 13:30
5335 1 database-backup-succeeded Information 26 Jan 11:32
5337 1 traplistener-start-ok Normal 26 Jan 14:48
6073 1 database-backup-succeeded Information 13 Feb 17:36
C:\Program Files\NetApp\DataFabric\DFM\log>dfm event list 174
There are no events.
C:\Program Files\NetApp\DataFabric\DFM\log>dfm event list
You can use:
C:\Documents and Settings\smdev1>dfm report view events-history <filer> | findstr "Host Up Down" | findstr "Up Down"
Normal 22 Host Up 10 Mar 19:22 80
Thanks,
Raj.
HI Emanual...
Just wondering how (if) you resolved the issue as we are experiencing identical symptoms. Was it network related? Is there anywhere in DFM where the timeout period can be increased? In our case the alert is being generated against a file which is at our DR site...any info/help would be appreciated.
Thanks
Hi,
You could increase the timeout and retry using the following options.
[root@rhel1 config]# dfm options list | grep -i ping
hostPingMethod echo_snmp
pingMonInterval 1 minute
pingMonRetryDelay 3
pingMonTimeout 3
[root@rhel1 config]#
Thanks,