The checks are working great and the first failures have been noticed using them - but now we need more feedback from other customers.
So feel free to try these checks at your setup, collaborate with the code or just give us some feedback.
Thanks so far for any feedback und much fun with these checks!
(there are also a couple of collectd-Plugins for Capacity Monitoring in another git repository - and the collection of scripts is growing very fast, so just check the git-repositories a few days later again )
I've also got plugins ready for CPU usage (dynamically getting the list of cpus of all nodes and their loads) including this perf data and right now I'm experimenting with other counters from the system object and have a script ready for that but don't know where to put these yet ... might also end up in github ...
Thanks for starting this, I was just looking for something like this for some installations, so I will very soon be testing this, and maybe even expanding on it, because I need to monitor the lag on snapmirror relations. So say a snapmirror relation is over 12 hours old, a warning will be reported, and over 24 hours a critical event will be reported...
Also I need checks for:
Protocols: CIFS/ISCSI/NFS/FCP (Service running or not)
Time: Check time up against time server (may lead to CIFS breaking if skew over 5 mins)
LIF: Report on LIFS which is not at home controller
CPU: Check CPU load (there has just been a BUG in CIFS which loaded the CPU to about 60% all the time)
IOPS: Check IOPS load
vServers: General health Status
PING: Test a ping from inside a vserver towards a host (e.g. ping your ESX server's NFS interfaces)
LUNs: Check for offline LUNs and other LUN issues
Something I this should be fixed is that the username and password has to be used on each command... maybe something with a crypted credential file should be used ?
I'm not sure how to use the collectd with Nagios if there is a way at all?
I have implemented the checks I was able to find, including some of your checks.
But I still need some specific checks which I might have to do myself.
1 - A check which alerts if any volumes exists with no snapshots on them, and if 250 snapshots exists (max is 255), of cause I would need a way to exclude vol0 and other volumes which should not have any snapshots, but it would find and alert you to newly created volumes not setup correctly.
2. In the lines of the snapshot checks, I would like to have a check which reveals volumes with no snapmirror relations on them, or where the LAG is over x-hours, which has somewhat been done, what hasn't been done is a check on the primary side of the snapmirror, so again with such a check you would capture new volumes without protection by snapmirror.
3. A check which mounts \\windozeserver\reports$ which is pointing to SnapManager for SQL/Exchange, we then looks at the logfiles in the Backup catalog for errors in them. (we have abandoned snapcreator the SME/SMSQL job schedules as it proved way to much of a hassle to configure and maintain).
I'm currently doing the check above on a daily basis, so I will though myself into fabricating a the checks one way or the other 🙂 I'm not that great at modifying others code, so I will start pretty much from scratch 🙂
But once in production I will of cause share it all...