Subscribe

Best ways to debug a false alarm on appliance down

Hello everyone

I have a client that is recieving an appliance down message but the appliance is operating just fine.  I am looking for advice on the best way to debug why DFM believes FCSAN03 is offline.  I have requested the latest autosupport so i can look through the messages file to see if anything looks suspicious that would trigger this event.

If i understand DFM right, we *ping* the monitored appliance every minute ... what other methods do we use to check to see if the appliance is "still there"?

SAMPLE ALERT:

Subject: dfm: Critical event on fcsan03 (Host Down)

A Critical event at 09 Mar 13:22 Pacific Daylight Time on Active/Active Controller fcsan03.na.gilead.com:

The Active/Active Controller is down.

Click below to see the source of this event.

http://fcdfm01.na.gilead.com:8080/dfm/report/view/appliance-details/49543?group=0

Click below to see details of this event.

http://fcdfm01.na.gilead.com:8080/dfm/report/view/event-details/643567?group=0

Click below to see all events.

http://fcdfm01.na.gilead.com:8080/dfm/report/view/events?group=0

Click below to see the source of this event in the DataFabric Manager server and access the related management page in FilerView.

http://fcdfm01.na.gilead.com:8080/dfm/report/view/appliance-details/49543?group=0&autoload-fv=0

*** Event details follow.***

General Information

-------------------

DataFabric Manager Serial Number: 1-50-004533 Alarm Identifier: 12

Event Fields

-------------

Event Identifier: 643567

Event Name: Host Down

Event Description: Up/down status of an appliance Event Severity: Critical Event Timestamp: 09 Mar 13:22

Source of Event

---------------

Source Identifier: 49543

Source Name: fcsan03.na.gilead.com

Source Type: Active/Active Controller

Source Status: Critical

--NetApp DataFabric Manager



Re: Best ways to debug a false alarm on appliance down

Emanuel,

You can look at pingmon.log to check if the issue was seen on multipl systems, probably due to a network glitch.

You can also check the current method used for ping.

# dfm option list hostPingMethod
Option          Value                        
--------------- ------------------------------
hostPingMethod  echo_snmp

The following options are supported, and echo_snmp is recommended as it checks ICMP ping and SNMP.

# dfm option set hostPingMethod=xxx
Error: hostPingMethod: xxx is not a valid ping method;
    valid values are echo, http, snmp, ndmp, echo_snmp, all.

Thanks,

Raj.

Re: Best ways to debug a false alarm on appliance down

Hello Raja

Here is a sample from my DFM server ... if i read this right, an entry is made here only when a state changes

Oct 27 12:52:22 [DFMMonitor: INFO]: 10.42.131.81    fas960c1-ps1.10.41.70.91  Unknown -> Up (echo_snmp; errno 0)
Oct 27 13:17:22 [DFMMonitor: INFO]: 10.42.131.91    r200-ps1.ca2k3dom.ngslabs.netapp.com Unknown -> Up (echo_snmp; errno 0)
Oct 27 13:20:41 [dfmserver: INFO]: 10.42.131.81    fas960c1-ps1.10.41.70.91  Unknown -> Up (echo_snmp; errno 0)
Oct 27 14:26:12 [dfmserver: INFO]: 10.42.131.81    fas960c1-ps1.10.41.70.91  Unknown -> Up (echo_snmp; errno 0)
Oct 27 14:26:33 [dfmserver: INFO]: 10.42.131.81    fas960c1-ps1.10.41.70.91  Unknown -> Up (echo_snmp; errno 0)
Oct 31 11:18:38 [DFMMonitor: INFO]: 10.42.131.91    r200-ps1.ca2k3dom.ngslabs.netapp.com Up -> Down (echo_snmp; errno 0)
Oct 31 11:33:53 [DFMMonitor: INFO]: 10.42.131.91    r200-ps1.ca2k3dom.ngslabs.netapp.com Down -> Up (echo_snmp; errno 0)
Nov 26 10:34:38 [DFMMonitor: INFO]: 10.42.131.81    fas960c1-ps1.10.41.70.91  Up -> Down (echo_snmp; errno 0)
Nov 26 10:54:07 [DFMMonitor: INFO]: 10.42.131.81    fas960c1-ps1.10.41.70.91  Down -> Up (echo_snmp; errno 0)
Mar 03 18:00:58 [DFMMonitor: INFO]: 10.42.131.91    r200-ps1.ca2k3dom.ngslabs.netapp.com Up -> Down (echo_snmp; errno 0)
Mar 03 18:14:27 [DFMMonitor: INFO]: 10.42.131.91    r200-ps1.ca2k3dom.ngslabs.netapp.com Down -> Up (echo_snmp; errno 0)
Mar 05 12:02:28 [DFMMonitor: INFO]: 10.42.131.91    r200-ps1.ca2k3dom.ngslabs.netapp.com Up -> Down (echo_snmp; errno 0)
Mar 05 13:07:57 [DFMMonitor: INFO]: 10.42.131.91    r200-ps1.ca2k3dom.ngslabs.netapp.com Down -> Up (echo_snmp; errno 0)

What is the best way to on the command line to list out the events for a monitored hostin regarsd to a appliance down; the event type seems to be "host-down.  The customer DFM server is windows so there is no benefit of using GREP.  Is there a specific dfm command "dfm event list" that can filter to a specific event type?

   106    174 host-up                        Normal     27 Oct 13:17
   108    174 host-snmp-ok                   Normal     27 Oct 13:19
   109    174 host:identity-ok               Normal     27 Oct 13:19
   110    174 host-login:failed              Warning    27 Oct 13:19
   111    174 temperature-normal             Normal     27 Oct 13:19
   112    174 fans:normal                    Normal     27 Oct 13:19
   113    174 power-supplies:normal          Normal     27 Oct 13:19
   114    174 nvram-battery:fully-charged    Normal     27 Oct 13:19
   228    174 disks:spares-available         Normal     27 Oct 13:21
   230    174 disks:none-reconstructing      Normal     27 Oct 13:21
   231    174 host-communication:ok          Normal     27 Oct 13:21
   232    174 host-login:ok                  Normal     27 Oct 13:21
   233    174 ndmp-credentials-status:good   Normal     27 Oct 13:21
   234    174 host-modified                  Information 27 Oct 13:21
   235    174 ndmp-communication-status:up   Normal     27 Oct 13:21
   236    174 ndmp-up                        Normal     27 Oct 13:21
   237    174 cluster-cfmode-config-ok       Normal     27 Oct 13:22
   238    174 cpu-load-normal                Normal     27 Oct 13:22
   267    174 host-role-discovered           Information 27 Oct 13:24
   268    174 host-usergroup-discovered      Information 27 Oct 13:24
   269    174 host-user-discovered           Information 27 Oct 13:24
   514    174 cpu-too-busy                   Warning    29 Oct 07:30
   664    174 nvram-battery:normal           Normal     30 Oct 13:38
   515    174 cpu-load-normal                Normal     29 Oct 07:35
   877    174 volume-destroyed               Information 31 Oct 09:39
   939    174 host-down                      Critical   31 Oct 11:18
   940    174 host-up                        Normal     31 Oct 11:33
  1212    174 global-status:noncritical      Error      03 Nov 16:05
  1213    174 global-status:noncritical      Error      03 Nov 16:17
  1916    174 global-status:noncritical      Error      16 Nov 06:23
  6799    174 host-down                      Critical   03 Mar 18:00
  6866    174 host-down                      Critical   05 Mar 12:02
  2991    174 global-status:noncritical      Error      30 Nov 03:55
  6800    174 host-up                        Normal     03 Mar 18:14
  6801    174 global-status:noncritical      Error      03 Mar 18:15
  6869    174 host-up                        Normal     05 Mar 13:07
  6975    174 global-status:noncritical      Error      08 Mar 01:02
   332    111 snapvault-relationship:create-failed Error      27 Oct 14:18
   358    111 snapvault-relationship:created Information 27 Oct 14:29
   333    276 snapvault-relationship:create-failed Error      27 Oct 14:18
   357    276 snapvault-relationship:created Information 27 Oct 14:29
   334    277 snapvault-relationship:create-failed Error      27 Oct 14:18
   354    277 snapvault-relationship:created Information 27 Oct 14:27
     1      1 management-station:node-limit-ok Normal     14 Aug  2008
     2      1 management-station:license-not-expired Normal     14 Aug  2008
     3      1 management-station:enough-free-space Normal     14 Aug  2008
     4      1 management-station:load-ok     Normal     14 Aug  2008
     7      1 traplistener-start-ok          Normal     27 Oct 12:17
     8      1 traplistener-start-ok          Normal     27 Oct 12:25
  5336      1 traplistener-start-ok          Normal     26 Jan 13:30
  5335      1 database-backup-succeeded      Information 26 Jan 11:32
  5337      1 traplistener-start-ok          Normal     26 Jan 14:48
  6073      1 database-backup-succeeded      Information 13 Feb 17:36

C:\Program Files\NetApp\DataFabric\DFM\log>dfm event list 174
There are no events.

C:\Program Files\NetApp\DataFabric\DFM\log>dfm event list

Re: Best ways to debug a false alarm on appliance down

You can use:

C:\Documents and Settings\smdev1>dfm report view events-history <filer> | findstr "Host Up Down" | findstr "Up Down"
Normal      22       Host Up                                                   10 Mar 19:22                                                80

Thanks,

Raj.

Re: Best ways to debug a false alarm on appliance down

HI Emanual...

Just wondering how (if) you resolved the issue as we are experiencing identical symptoms.  Was it network related?  Is there anywhere in DFM where the timeout period can be increased? In our case the alert is being generated against a file which is at our DR site...any info/help would be appreciated.

Thanks

Re: Best ways to debug a false alarm on appliance down

Hi,

     You could increase the timeout and retry using the following options.

[root@rhel1 config]# dfm options list | grep -i ping

hostPingMethod                        echo_snmp

pingMonInterval                       1 minute

pingMonRetryDelay                     3

pingMonTimeout                        3

[root@rhel1 config]#

Re: Best ways to debug a false alarm on appliance down

Thanks,