ONTAP Discussions

NetApp cDOT Nagios-Plugins

ALEEX4242
13,487 Views

Hello,

since we didn't found any usefull Cluster-Mode Plugins for Nagios but needed to monitor our new Cluster-Mode with nagios, we started to build some Nagios-Plugins for checking our cDOT.

The first scripts could be found here: https://github.com/aleex42/netapp-cdot-nagios

The checks are working great and the first failures have been noticed using them - but now we need more feedback from other customers.

So feel free to try these checks at your setup, collaborate with the code or just give us some feedback.

Thanks so far for any feedback und much fun with these checks!

(there are also a couple of collectd-Plugins for Capacity Monitoring in another git repository - and the collection of scripts is growing very fast, so just check the git-repositories a few days later again )

Greets,

Alex

8 REPLIES 8

bugfinder
13,403 Views

nagios also has a method of supplying performance data from a check in the format "$output_string | $name1=$value1 ; $name2=$value2 ; ..." so that you can create graphs with pnp4nagios,

for the check_cdot_aggr.pl I've changed this:

--- a/check_cdot_aggr.pl
+++ b/check_cdot_aggr.pl@@ -50,6 +49,7 @@ if ($output->results_errno != 0) {
}

my $message;
+my @perf;
my $critical = 0;
my $warning = 0;

@@ -71,16 +71,20 @@ foreach my $aggr (@result){
     else {
         $message .= $aggr_name . " (" . $percent . "%)";
     }
+
+    push @perf,"'$aggr_name'=$percent";
}

+my $perf = "|".join("; ",@perf);
+
if($critical > 0){
-    print "CRITICAL: " . $message . "\n";
+    print "CRITICAL: " . $message . "$perf\n";
     exit 2;
} elsif($warning > 0){
-    print "WARNING: " . $message . "\n";
+    print "WARNING: " . $message . "$perf\n";
     exit 1;
} else {
-    print "OK: " . $message . "\n";
+    print "OK: " . $message . "$perf\n";
     exit 0;
}

I've also got plugins ready for CPU usage (dynamically getting the list of cpus of all nodes and their loads) including this perf data and right now I'm experimenting with other counters from the system object and have a script ready for that but don't know where to put these yet ... might also end up in github ...

ALEEX4242
13,403 Views

Yeah, sorry for the late response.

Someone already added the performance data - maybe that was you 🙂

For your CPU-checks - just feel free to commit to these checks and don't forget to add your name

heinowalther
13,402 Views

Hi there,

Thanks for starting this, I was just looking for something like this for some installations, so I will very soon be testing this, and maybe even expanding on it, because I need to monitor the lag on snapmirror relations.  So say a snapmirror relation is over 12 hours old, a warning will be reported, and over 24 hours a critical event will be reported...

Also I need checks for:

Protocols: CIFS/ISCSI/NFS/FCP (Service running or not)

Time: Check time up against time server (may lead to CIFS breaking if skew over 5 mins)

LIF: Report on LIFS which is not at home controller

CPU: Check CPU load  (there has just been a BUG in CIFS which loaded the CPU to about 60% all the time)

IOPS: Check IOPS load

vServers: General health Status

PING: Test a ping from inside a vserver towards a host (e.g. ping your ESX server's NFS interfaces)

LUNs: Check for offline LUNs and other LUN issues

Something I this should be fixed is that the username and password has to be used on each command...  maybe something with a crypted credential file should be used ?

Kind regards,

Heino Walther

ALEEX4242
13,403 Views

Hello,

do you use the 7-Mode or cDOT Checks?

All your feature requests sounds good, but I need some time for completing all of them.

For snapmirror relationships, there is a check for both systems, but I think rewriting it and adding some features should be nice 🙂

Whatever - feel free to collaborate to the scripts and supply patches 🙂

And I'm looking forward to your testing feedback for the scripts.

Thanks so far,

Regards

Alex

heinowalther
13,403 Views

I will be using both 7-mode and cDot 🙂

There are already some good plugins for 7-mode in regards to some of the additions I mentioned, maybe we could borrow a bit from that 🙂

/Heino

heinowalther
13,403 Views

HI again,

Right now I have implemented this plugin that I found at the Nagios Exchange...

http://exchange.nagios.org/directory/Plugins/Hardware/Storage-Systems/SAN-and-NAS/NetApp/Netapp-Cluster-Mode/details

It works quite well, but one downside is that lag of performance counters, so you cannot create any graphs with this..

But maybe it can be an inspiration for the progress with these scripts...  I will give them a try in the following days...

/Heino

ALEEX4242
13,403 Views

Hello again,

for graphs, you should have a look at: https://github.com/aleex42/Collectd-Plugins-NetApp

The scripts are for collectd, but I think you could simply transform them to use them with other graphing-software

Regards,

Alex

heinowalther
13,403 Views

Hi Alex

I'm not sure how to use the collectd with Nagios if there is a way at all?

I have implemented the checks I was able to find, including some of your checks.

But I still need some specific checks which I might have to do myself.

1 - A check which alerts if any volumes exists with no snapshots on them, and if 250 snapshots exists (max is 255), of cause I would need a way to exclude vol0 and other volumes which should not have any snapshots, but it would find and alert you to newly created volumes not setup correctly.

2. In the lines of the snapshot checks, I would like to have a check which reveals volumes with no snapmirror relations on them, or where the LAG is over x-hours, which has somewhat been done, what hasn't been done is a check on the primary side of the snapmirror, so again with such a check you would capture new volumes without protection by snapmirror.

3. A check which mounts \\windozeserver\reports$ which is pointing to SnapManager for SQL/Exchange, we then looks at the logfiles in the Backup catalog for errors in them. (we have abandoned snapcreator the SME/SMSQL job schedules as it proved way to much of a hassle to configure and maintain).

I'm currently doing the check above on a daily basis, so I will though myself into fabricating a the checks one way or the other 🙂  I'm not that great at modifying others code, so I will start pretty much from scratch 🙂

But once in production I will of cause share it all...

/Heino

Public