Active IQ Unified Manager Discussions

Operations Manager SnapMirror lag alerts only triggered once

steamshipins1
5,608 Views

Hi All,

I am currently setting up Ops Mgr to just monitor our filers (no protection set up) and have successfully got it alerting us for volume full events. I've set up an alert using snapmirror:out of date and have configured the discovered retention policies on our nightly mirrors to 24 hours.

All worked well on the first overnight run, in that it reported a lot of the mirrors as out of date as it was in the process of updating them (This was only to prove the alert. Once all is working OK, I'll exclude the monitoring between midnight and 06:00). I acknowledged the alerts but on the next run I didn't receive any alerts at all concerning the snapmirrors and there are no entries in the event log.

I also tested this out on a couple of hourly mirrors by changing the lag in the retention policy down to, say, 20 mins and it triggered an alert once only.

Any suggestions? Is there anything that I am missing?

As stated before, we are only using Ops Mgr for monitoring and are still relying on snapmirror.conf for scheduling. Obviously, alerts that only trigger once are not much good...

Thanks,

Simon.

8 REPLIES 8

kjag
5,608 Views

Hi Simon,

Once you acknowledge the events they are no more violations(as you acknowledged) and they will not be alerted again until there is a state change.

Also in Operations Manager Alarms can be configured as a Repeated one by specifying "repeat-notify" option.

-KJag

steamshipins1
5,608 Views

Thanks kjag, thats how we have our volume full alerts set up; to repeat every 15 mins until acknowledged.

The issue here is that we can acknowledge them and then they will trigger again the next time that the volume reaches the threshold but the same process doesn't seem to work for the snapmirror:out of date alert. The only difference between is that we are using vol almost full, which has a severity of warning, as opposed to snapmirror:out of date, which has a severity of error, but this shouldn't matter IMO.

adaikkap
5,608 Views

Hi Simon,

                   As kjag said, did your state change from snapmirror out of date to nearly out of date or date ok ? Did the state of your snapmirror every change ? after you ack ? to any of the below

other than out of date ?

[root@ ~]# dfm eventtype list | grep -i sm.lag

snapmirror:date-ok                            Normal       sm.lag

snapmirror:deleted                            Information  sm.lag

snapmirror:nearly-out-of-date                 Warning      sm.lag

snapmirror:out-of-date                        Error        sm.lag

[root@ ~]#

if not, then until there is a state change a  new event and its alert will not  be triggered.

Regards

adai

steamshipins1
5,608 Views

Thanks, thats helped makes things clearer.

I'm not getting any snapmirror:date-ok event appear in the logs once the snapmirror is updated and the lag goes back to normal hance why the alarm is not triggered when/if the snapmirror lags again.

Any ideas?

Simon.

adaikkap
5,608 Views

Hi Simon,

               can you get the output of following cli  for the snapmirror relationship which is lagging ?

dfm report view events

dfm report view events-history

dfm host diag <filer id/ip> for source and destination filer of the snapmirror relationship ?

dfm version to know what version of dfm is running.

Regards

adai

steamshipins1
5,608 Views

Adai,

Below is the output as requested. I've used just one vol as an example but its the same result on all our Snapmirror relationships:-

dfm report view events 5053

Severity    Event ID Event                   Triggered    Ack'ed By Ack'ed       Source ID Source

----------- -------- ----------------------- ------------ --------- ------------ --------- --------------------------

Error       18042    SnapMirror: Out of Date 24 Jan 01:56                        5053      DRFILER1:/vol_vm_mobapps1d

Information 17493    SnapMirror: Discovered  19 Jan 12:09                        5053      DRFILER1:/vol_vm_mobapps1d

dfm report view events-history 5053

Severity    Event ID Event                         Triggered    Ack'ed By Ack'ed       Deleted By Deleted      Source ID Source

----------- -------- ----------------------------- ------------ --------- ------------ ---------- ------------ --------- --------------------------

Error       18042    SnapMirror: Out of Date       24 Jan 01:56                                                5053      DRFILER1:/vol_vm_mobapps1d

Normal      17494    SnapMirror: Date Ok           19 Jan 12:09                                                5053      DRFILER1:/vol_vm_mobapps1d

Information 17493    SnapMirror: Discovered        19 Jan 12:09                                                5053      DRFILER1:/vol_vm_mobapps1d

Normal      16602    Volume Space Reserve OK       19 Jan 12:04                                                5053      DRFILER1:/vol_vm_mobapps1d

Normal      16601    Volume Next Snapshot Possible 19 Jan 12:04                                                5053      DRFILER1:/vol_vm_mobapps1d

Normal      16600    Volume First Snapshot OK      19 Jan 12:04                                                5053      DRFILER1:/vol_vm_mobapps1d

Normal      16599    Inodes Utilization Normal     19 Jan 12:04                                                5053      DRFILER1:/vol_vm_mobapps1d

Normal      16598    Volume Space Normal           19 Jan 12:04                                                5053      DRFILER1:/vol_vm_mobapps1d

Normal      15067    Scheduled Snapshots Enabled   19 Jan 12:03                                                5053      DRFILER1:/vol_vm_mobapps1d

Normal      15066    Volume Online                 19 Jan 12:03                                                5053      DRFILER1:/vol_vm_mobapps1d

You will notice that the only Snapmirror: Date OK event for this vol is from when it was initially discovered.

dfm host diag filer1 - This is the source filer

Network Connectivity
IP Address             xxx.xxx.xxx.xxx

Network                xxx.xxx.xxx.xxx/16 (last searched 24 Jan 10:51)
DNS Aliases            FILER1.simsl.com
DNS Addresses          xxx.xxx.xxx.xxx
SNMP Version in Use    SNMPv1
SNMPv1                 Passed (78 ms)
SNMP Community         public
SNMP sysName           FILER1.simsl.com
SNMP sysObjectID       .1.3.6.1.4.1.789.2.3 (Clustered Filer)
SNMP productId         1573839544
SNMPv3                 Failed: No SNMPv3 username specified.

SNMPv3 Auth Protocol
SNMPv3 Privacy Enabled No
SNMPv3 Username
ICMP Echo              Passed (0 ms)
HTTP                   Passed (0 ms)
NDMP Ping              Passed (port 10000, 0 ms)
NDMP Connect           Passed (1437 ms)
NDMP MD5 Passwd Check  Passed
RSH                    Skipped (rshBinary is empty in global option)
SSH                    Failed: Login not set for storage system FILER1.simsl.com (3673).
RLM                    Skipped (hostLogin and hostRLMAddress are empty)
XML                    Skipped (hostLogin is empty)

Host Details
According to:   DataFabric Manager server       Host
Host Name       FILER1.simsl.com               FILER1.simsl.com
System ID       1573839544                     1573839544
Model           FAS3240                        FAS3240
Type            Clustered Storage System       Clustered Storage System
OS Version      8.0.2 7-Mode                   8.0.2 7-Mode
Revisions       350,8.0.1,2.1.1                350,8.0.1,2.1.1

Monitoring Timestamps
Timestamp Name       Status   Interval     Default      Last Updated     Status   Error if older than ...
ccTimestamp          Normal   4 hours      4 hours                                24 Jan 06:52
cfTimestamp          Normal   5 minutes    5 minutes    24 Jan 10:51     Normal   24 Jan 10:47
clusterTimestamp     Normal   15 minutes   15 minutes                             24 Jan 10:37
cpuTimestamp         Normal   5 minutes    5 minutes    24 Jan 10:49     Normal   24 Jan 10:47
dfTimestamp          Error    15 minutes   30 minutes   24 Jan 10:45     Normal   24 Jan 10:37
diskTimestamp        Normal   4 hours      4 hours      24 Jan 10:50     Normal   24 Jan 06:52
envTimestamp         Normal   5 minutes    5 minutes    24 Jan 10:51     Normal   24 Jan 10:47
fsTimestamp          Normal   15 minutes   15 minutes   24 Jan 10:44     Normal   24 Jan 10:37
hostPingTimestamp    Normal   1 minute     1 minute     24 Jan 10:51     Normal   24 Jan 10:51
ifTimestamp          Normal   15 minutes   15 minutes   24 Jan 10:44     Normal   24 Jan 10:37
licenseTimestamp     Normal   4 hours      4 hours      24 Jan 10:41     Normal   24 Jan 06:52
lunTimestamp         Normal   30 minutes   30 minutes   24 Jan 10:39     Normal   24 Jan 10:22
opsTimestamp         Normal   10 minutes   10 minutes   24 Jan 10:50     Normal   24 Jan 10:42
qtreeTimestamp       Normal   8 hours      8 hours      24 Jan 10:39     Normal   24 Jan 02:52
rbacTimestamp        Normal   1 day        1 day        24 Jan 10:33     Normal   23 Jan 10:52
userQuotaTimestamp   Normal   1 day        1 day                                  23 Jan 10:52
sanhostTimestamp     Normal   5 minutes    5 minutes                              24 Jan 10:47
snapmirrorTimestamp  Error    5 minutes    30 minutes   24 Jan 10:51     Normal   24 Jan 10:47
snapshotTimestamp    Normal   30 minutes   30 minutes   24 Jan 10:43     Normal   24 Jan 10:22
statusTimestamp      Normal   10 minutes   10 minutes   24 Jan 10:40     Error    24 Jan 10:42
sysInfoTimestamp     Normal   1 hour       1 hour       24 Jan 10:06     Normal   24 Jan 09:52
svTimestamp          Normal   30 minutes   30 minutes   24 Jan 10:41     Normal   24 Jan 10:22
svMonTimestamp       Normal   8 hours      8 hours                                24 Jan 02:52
xmlQtreeTimestamp    Normal   8 hours      8 hours                                24 Jan 02:52
vFilerTimestamp      Normal   1 hour       1 hour       24 Jan 10:08     Normal   24 Jan 09:52
vserverTimestamp     Normal   1 hour       1 hour                                 24 Jan 09:52

Performance Advisor Checklist
perfAdvisorEnabled     Passed
hostType               Passed
hostRevision           Passed
hostLogin              Failed (hostLogin is empty)
perfAdvisorTransport   Passed

dfm host diag filer2 - this is the destination filer

Network Connectivity
IP Address             xxx.xxx.xxx.xxx
Network                xxx.xxx.xxx.xxx/16 (last searched 24 Jan 10:54)
DNS Aliases            FILER2.simsl.com
DNS Addresses          xxx.xxx.xxx.xxx
SNMP Version in Use    SNMPv1
SNMPv1                 Passed (78 ms)
SNMP Community         public
SNMP sysName           FILER2.simsl.com
SNMP sysObjectID       .1.3.6.1.4.1.789.2.3 (Clustered Filer)
SNMP productId         1573766421
SNMPv3                 Failed: No SNMPv3 username specified.

SNMPv3 Auth Protocol
SNMPv3 Privacy Enabled No
SNMPv3 Username
ICMP Echo              Passed (0 ms)
HTTP                   Passed (0 ms)
NDMP Ping              Passed (port 10000, 0 ms)
NDMP Connect           Passed (1437 ms)
NDMP MD5 Passwd Check  Passed
RSH                    Skipped (rshBinary is empty in global option)
SSH                    Failed: Login not set for storage system FILER2.simsl.com (3675).
RLM                    Skipped (hostLogin and hostRLMAddress are empty)
XML                    Skipped (hostLogin is empty)

Host Details
According to:   DataFabric Manager server       Host
Host Name       FILER2.simsl.com               FILER2.simsl.com
System ID       1573766421                     1573766421
Model           FAS3240                        FAS3240
Type            Clustered Storage System       Clustered Storage System
OS Version      8.0.2 7-Mode                   8.0.2 7-Mode
Revisions       350,8.0.1,2.1.1                350,8.0.1,2.1.1

Monitoring Timestamps
Timestamp Name       Status   Interval     Default      Last Updated     Status   Error if older than ...
ccTimestamp          Normal   4 hours      4 hours                                24 Jan 06:54
cfTimestamp          Normal   5 minutes    5 minutes    24 Jan 10:53     Normal   24 Jan 10:49
clusterTimestamp     Normal   15 minutes   15 minutes                             24 Jan 10:39
cpuTimestamp         Normal   5 minutes    5 minutes    24 Jan 10:50     Normal   24 Jan 10:49
dfTimestamp          Error    15 minutes   30 minutes   24 Jan 10:44     Normal   24 Jan 10:39
diskTimestamp        Normal   4 hours      4 hours      24 Jan 09:02     Normal   24 Jan 06:54
envTimestamp         Normal   5 minutes    5 minutes    24 Jan 10:50     Normal   24 Jan 10:49
fsTimestamp          Normal   15 minutes   15 minutes   24 Jan 10:38     Warning  24 Jan 10:39
hostPingTimestamp    Normal   1 minute     1 minute     24 Jan 10:54     Normal   24 Jan 10:53
ifTimestamp          Normal   15 minutes   15 minutes   24 Jan 10:44     Normal   24 Jan 10:39
licenseTimestamp     Normal   4 hours      4 hours      24 Jan 09:02     Normal   24 Jan 06:54
lunTimestamp         Normal   30 minutes   30 minutes   24 Jan 10:47     Normal   24 Jan 10:24
opsTimestamp         Normal   10 minutes   10 minutes   24 Jan 10:49     Normal   24 Jan 10:44
qtreeTimestamp       Normal   8 hours      8 hours      24 Jan 04:26     Normal   24 Jan 02:54
rbacTimestamp        Normal   1 day        1 day        23 Jan 12:08     Normal   23 Jan 10:54
userQuotaTimestamp   Normal   1 day        1 day                                  23 Jan 10:54
sanhostTimestamp     Normal   5 minutes    5 minutes                              24 Jan 10:49
snapmirrorTimestamp  Error    5 minutes    30 minutes   24 Jan 10:50     Normal   24 Jan 10:49
snapshotTimestamp    Normal   30 minutes   30 minutes   24 Jan 10:47     Normal   24 Jan 10:24
statusTimestamp      Normal   10 minutes   10 minutes   24 Jan 10:47     Normal   24 Jan 10:44
sysInfoTimestamp     Normal   1 hour       1 hour       24 Jan 10:07     Normal   24 Jan 09:54
svTimestamp          Normal   30 minutes   30 minutes   24 Jan 10:37     Normal   24 Jan 10:24
svMonTimestamp       Normal   8 hours      8 hours                                24 Jan 02:54
xmlQtreeTimestamp    Normal   8 hours      8 hours                                24 Jan 02:54
vFilerTimestamp      Normal   1 hour       1 hour       24 Jan 10:07     Normal   24 Jan 09:54
vserverTimestamp     Normal   1 hour       1 hour                                 24 Jan 09:54

Performance Advisor Checklist
perfAdvisorEnabled     Passed
hostType               Passed
hostRevision           Passed
hostLogin              Failed (hostLogin is empty)
perfAdvisorTransport   Passed

dfm version

dfbm.exe         5.0.0.7636 (5.0)

dfdrm.exe        5.0.0.7636 (5.0)

dfpm.exe         5.0.0.7636 (5.0)

dfm.exe          5.0.0.7636 (5.0)

dfmcheck.exe     5.0.0.7636 (5.0)

dfmconfig.exe    5.0.0.7636 (5.0)

dfmconsole.exe   5.0.0.7636 (5.0)

dfmmonitor.exe   5.0.0.7636 (5.0)

dfmperf.exe      5.0.0.7636 (5.0)

dfmscheduler.exe 5.0.0.7636 (5.0)

dfmserver.exe    5.0.0.7636 (5.0)

dfmwatchdog.exe  5.0.0.7636 (5.0)

eventd.exe       5.0.0.7636 (5.0)

grapher.exe      5.0.0.7636 (5.0)

Hope this helps!

Regards,

Simon.

adaikkap
5,609 Views

Hi Simon,

            There is an out of date event on 24th Jan.

Error   18042SnapMirror: Out of Date   24 Jan 01:56                                            5053  DRFILER1:/vol_vm_mobapps1d

What I also see is that you have changed the default monitoring interval from 30 minutes to 5 minutes for snapmirror monitoring. Pls reset them back to default. snapmirror is a heavy monitor and it does a lot of work.

As I can see in the dfm host diag error status for snapmirror monitoring.

snapmirrorTimestamp Error    5 minutes    30 minutes   24 Jan 10:50     Normal   24 Jan 10:49

Regards

adai

steamshipins1
5,609 Views

adai,

As you will see from the event history, this is the first occurance of this event. There is no subsequent date OK event.

I've set the snapmirror monitor back to 30 minutes as suggested and dfm diag is now showing as normal. I changed one of the hourly policies down to trigger an event if the lag goes over 30 mins, ran a manual snapmirror update and then let the lag go over 30 mins. No events, either Snapmirror:out of date or date ok where written to the log.

Going back to the original volume, you can see from the below that the volume is showing as out of date in the first screenshot of the Volume Details whereas the second screenshot of the retention policy shows a lag of 13.99 hours which is nowhere near the 1 day threshold.

Thanks,

Simon

Public