Great surprise

Posted on 10/06/

Yesterday I stumbled upon a fact that made me feel proud of the works that I publish to CPAN, and give me the feeling that someone finds my modules useful.

Catalyst::Authentication::Credential::Authen::Simple, although the long name, has been incorporated to the Debian package libcatalyst-modules-perl, that got shipped with Lenny!

:)

Another Catalyst::Authentication::Credential::Authen::Simple release

Posted on 24/04/

Just uploaded v0.05 to CPAN!

Thanks to the suggestion from Tomas Doran, the module now has less dependancies. Instead of Module::Load, it uses Catalyst::Util::ensure_class_loaded.

I'm happy to see the community is using it and participating. Somehow it looks like the module is finding a place in peoples' applications! These people get a place in the module (see the THANKS section).

Now I have to catch up with a mailing list post from Matt S Trout...

On Wed, Oct 01, 2008 at 04:58:51PM +0200, Jose Luis Martinez wrote:
> Tomas Doran escribió:
> 
>>Unfortunately, there is no such thing as an LDAP credential module on 
>>CPAN at the moment.
> Catalyst::Authentication::Credential::Authen::Simple should do the trick. 
> http://search.cpan.org/~jlmartin/Catalyst-Authentication-Credential-Authen-Simple-0.02/lib/Catalyst/Authentication/Credential/Authen/Simple.pm 
> becasue Authen::Simple does support LDAP.

Fucking awesome.

This needs to be more widely publicised, do you think you could do doc
patches fr C::P::Authentication and a wiki write up?  :)

- 
      Matt S Trout       Need help with your Catalyst or DBIx::Class project?
   Technical Director                    http://www.shadowcat.co.uk/catalyst/
 Shadowcat Systems Ltd.  Want a managed development or deployment platform?
http://chainsawblues.vox.com/            http://www.shadowcat.co.uk/servers/

So let's see if the next thing is the doc patch and the wiki writeup!

Catalyst::Authentication::Credential::Authen::Simple up to 0.04

Posted on 23/04/

I've applied a couple of community suggestions to Catalyst::Authentication::Credential::Authen::Simple.

  • Tobjorn Lindahl pointed out i was using some log->debug calls without verifying if the app was in debug mode. That produced version 0.03
  • Dylan Martin pointed out that the Catalyst::Log object could be passed to Authen::Simple objects, so the log information for Authen::Simple could get logged with Catalyst (in version 0.04)

Thanks for the pointers!

Thought operator overloading was the devils work?

Posted on 07/04/

Traditionally operator overloading has been criticized and normally avoided to the point where it has fallen into the back of our memories. Java didn't even implement it! But looking at it better... It looks like all the arguments that disaprove operator overloading are pointing at statically typed languages. Resuming: operator overload only gives you syntactic sugar so you don't have to do a bunch of (ugly) method calls around your code, but you pay the price of high maintenance costs [1], and the potential to suffer pain [2]. But nothing is said of dynamically typed languages! And I think there are times where this "satanized" techinque comes in handy in dynamically typed languages. It has a use that is not only for "aesthetical" reasons in dynamic languages. It provides functionality that static languages can't provide with it: It lets the programmer outgrow it's API.

The thought came to me when I was trying to out-trick a class that someone else had wrote. The class I was using was programmed in Perl (not rare in my case). It expects you to call methods with scalars that contain numbers, and operates on them with their normal operators (+ - / *). Perl scalars cannot handle integers bigger than 2^32 on a Perl compiled with 32 bit Integers (note: bigger numbers are converted to floats, and therefore lose precision). I had to pass BIG Integers. So... I looked up the options Perl gave me for big integers, and found Math::BigInt. But now my worry was: the classes I use don't have explicit support for BigInts, and I'm not the author of some of them! Luckily i found this in Math::BigInt documentation: "All operators (including basic math operations) are overloaded". Bingo! Now I can pass BigInts into classes that never expected them, and they can operate without code change. And everything works without a single hick-up.

So for some new classes I'll publish to CPAN, that have to do some basic mathematical operations on data structures, I'm thinking of keeping the functionality to the bare bones, and relying on clever programmers to do clever things with the inputs.

Let me explain: I have to calulate growth rates from mesurements that get taken at different times. It's as simeple as:

Given:
M1: measurement 1 at timestamp t1
M2: measurement 2 at timestamp t2
M is growing (like your kids' height), therefor M2 >= M1

I have to do two basic operations: difference (substraction) and division to get rates (things mesured per second)

M2-M1/(t2-t1) ... (yes... We're Deriving (calculating the "rate of change"). (It's like calculating velocity from distance, but with "things" instead of meters (feet for people that drive on the wrong side of the road :D)(loooove nested parenthesis!))).

That's not complicated, and has nothing to do with overriding operators, but somehow, that's the point! My class only has to know how to substract and divide "things". I will really have to be doing these operations with lots of magnitudes that have been taken at the same time. So I could chose to put all them in hashrefs and then implement this:

sub rates {
  my ($self) = @_;
  my $delta_t = $self->{'t2'} - $self->{'t1'};
  my $rates = {};
  foreach my $key (keys %{$self->{'m2'}}){
      $rates->{$key} = $self->{'m2'}->{$key} - $self->{'m1'}->{$key}) / $delta_t;
  }
}

Leaving the implementation limited to one level deep hashrefs.

OR

sub rates {
  my ($self) = @_;
  return (($self->{'m2'} - $self->{'m1'}) / ($self->{'t2'} - $self->{'t1'});
}

And let m1 and m2 be MagicHashSets (for my use case) with the following operations defined:

MagicHashSet1 - MagicHashSet2: foreach key in MagicHashSet1: substract the corresponding keys' value from MagicHashSet2
MagicHashSet1 / scalar: foreach key in MagicHashSet1: divide the corresponding keys' value between scalar.

You can observe that MagicHashSets are easily converted to MagicArraySets.

Note that you get one more property for free from this design... If the values of the keys in the MagicHashSets are themselves MagicHashSets... You get free n level deep operations for substraction and division! Yay!

So expect me to be posting my findings on this adventure in my next postings :). I really hope they are good findings, and not very bad ones. After all... Maybe operator overloads are the devils work }:D

CPU checking

Posted on 02/04/

I'm going to release the check_linux_cpu check that we've been beta testing at CAPSiDE. I looked around in Nagios Exchange and none of the existing plugins itched my scratch... So I made a new one. What was wrong with the other plugins?

  • No performance data. At CAPSIDE we want all the plugins we use to output perfdata.
  • Calculation of the CPU usage. Read below to find out why.
  • Dependancy on external utilities (mpstat, iostat, Net::SNMP, etc)

How is the CPU usage percentage calculated in our plugin?

/proc/stat has the info needed to calculate the CPU usage. Every time you read it, it gives you the number of slices each processor has passed doing what (computing in user space, computing in kernel (system) space, attending interrupts, etc). But those slices are absolute (counted since the OS is running).

So if you're curious about knowing what your processor has been doing, you just have to sum up all the time it has been doing something, and then calculate the proportion of time that it was doing what you're interested in.

For example, lets suppose a /proc/stat that reports user, system, nice and idle time in each column:

cpu 8000 2000 1000 9000

8000 + 2000 + 1000 + 9000 = 20000 time slices doing things.

How much of that time was spent in user? 8000/20000 = 0.4
And in system? 2000/20000 = 0.1
In nice? 1000/20000 = 0.05
Idle? 9000/20000 = 0.45

This information can be useful, but it can be misleading if you monitor it, because it accounts time since the computer was ON. That means that if at night you have little activity, idle will gain weight. And therefore, your user time can spike up to 100% during a lot of time, and the percentages will not vary all that much.

Think of an obsessive person that that notes down all the time he has spent on all of his activities. When you ask him "what have you been doing all your life?". He'll tend to respond: "sleeping" :D.

A more useful metric would be: "what have you been doing since the last time I asked you". He could tell you: "working on the presentation for tomorrow".

Well let's do the same with our CPU! Since the kernel doesn't have any interface to query what it's been doing since the last time we were interested, we'll have to ask twice:

mesure 1: cpu 8000 2000 1000 9000
mesure 2: cpu 9500 2500 1500 9500 

What has the CPU been doing between mesure 2 and mesure 1?

9500 - 8000 = 1500 in user
2500 - 2000 = 500 in system
1500 - 1000 = 500 in nice
9500 - 9000 = 500 in idle


1500 + 500 + 500 + 500 = 3000 in total

So.. it's been working on:
1500/3000 = 0.5 in user
500/3000 = 0.16 in system
500/3000 = 0.16 in nice
500/3000 = 0.16 in idle

Some plugins would calculate CPU usage using an X second interval. (mostly the ones that depend on external utilities). I don't think this is an accurate way to do the measurement either, because that obsessive person will either say "I'm talking with you", or "I was doing the presentation", but just before that he had been having a snack :D.

Curiosity:
Execute top. Quickly look at the CPU usage. Does the first reading it displays seem familiar now? And the rest?

So... is the plugin fundamentally flawed in some way? Am I just plainly wrong? What do you think?

Squeezing the juice out of check_mysql

Posted on 01/04/

The check_mysql plugin from the Nagios Plugins project is useful, but at CAPSiDE we're quite obsessed with having performance data, and registering it, to later graph it. That way we have a better vision of the systems we're monitoring. Opsview will automatically detect performance data from the plugins and graph it. But in the case of check_mysql we're out of luck. It outputs useful data, but it's not performance data.

Luckily the rrdgraph tool that Opsview uses let's you do some tricks for checks that don't output performance data (like check_mysql). It's map file let's you specify a set of regular expressions to turn the output of plugins into graphable data (Opsview provides a standard set of mappings with it's base installation)

The output of check_mysql looks like

Uptime: 801963 Threads: 5 Questions: 55210201 Slow queries: 246 Opens: 25611 Flush tables: 1 Open tables: 55 Queries per second avg: 68.843

If you paste the line below into /usr/local/nagios/etc/map.local on your Opsview Master server.

/output:Uptime: \d+  Threads: (\d+)  Questions: (\d+)  Slow queries: (\d+)  Opens: (\d+)  Flush tables: (\d+)  Open tables: (\d+)  Queries per second avg: ([-.0-9]+)(?: Slave IO: (\w+) Slave SQL: (\w+) Seconds Behind Master: (\d+)|)/
and push @s, [ "mysql",
             [ "threads", GAUGE, $1 ],
             [ "questions", DERIVE, $2 ],
             [ "slow", DERIVE, $3 ],
             [ "opens", DERIVE, $4 ],
             [ "flush_tables", DERIVE, $5],
             [ "open_tables", GAUGE, $6],
             [ "avg_qps", GAUGE, $7],
             defined $8?[ 'slave_io', GAUGE, (lc($8) eq 'yes'?1:0) ]:(),
             defined $9?[ 'slave_running', GAUGE, (lc($9) eq 'yes'?1:0) ]:(),
             defined $10?[ 'sec_behind', GAUGE, $10 ]:()
             ];
Opsview will be able to generate graphs for all your configured check_mysql checks (note that passed a couple of checks, you will have to reload Opsview to see the icon that links to the graphs).

This map file takes into account if you execute check_mysql with the -S option to monitor MySQL slave status, and creates the slave_io, slave_running and sec_behind data channels.

You can get pretty graphs like this:

Try it out! and tell me if it works for you, and please correct and criticize

DISCLAIMER: Editing the map.local file can leave the rrdgraph recollection broken. Please pass

perl -c /usr/local/nagios/etc/map.local
to assure that it says: /usr/local/nagios/etc/map.local syntax OK and then pay special attention to see if the other RRDs are working correctly. Please read Opsview docs for more information

Nagios::Plugin::DieNicely 0.03

Posted on 31/03/

I'm glad I started it out as a module, because now and then I get the opportunity to find out it doesn't do what I (and probably you) expect it to do... I've squatted a bug in v0.03 (uploaded to CPAN during my "silence" in the blogosphere).

If you where using exception handling in perl (maybe you don't do it explicitly, but a module you use does), Nagios::Plugin::DieNicely would catch it, and exit your script instead of the exception being captured in the eval.

I've discovered a couple flaws more in the module...

  • perl -d my_nagiosplugin
    is broken when Nagios::Plugin::DieNicely is used.
  • Nagios::Plugin::DieNicely doesn't play well with the way the Math::Bigint module detects the library it wants to use. Declare the math library you want to use explicitly and everything works OK.
    use Math::BigInt lib => 'Calc';
    

I haven't had time to look into why these bugs are there. Any ideas?

Pause... But back to blogging

Posted on 31/03/

I haven't been updating the blog lately, but that doesn't mean I haven't been working :)

These days I'll be posting about some Nagios / Opsview related stuff I've been up to lately.

Catalyst::View::RRDGraph

Posted on 26/11/

I needed to render some RRD graphs from a Catalyst application. Before, I was using rrdcgi. Not that I couldn't use it together with WrapCGI, but I wanted to write the HTML templates in Template Toolkit (as always), because rrdcgi templating is not all that powerful.

So you get the RRDs perl module on one side, and you get Catalyst on the other, a bit of glue, and there you have it: Catalyst::View::RRDGraph

Just put the graph definition on the stash, and call the view. The view outputs images, so you can use them from an HTML page that you have templated in whatever language you want.

<Img src="[% c.uri_for('controller/that/uses/rrdgraph') %]">

As always, feedback is welcome

Working with Ton Voon

Posted on 26/11/

Ton Voon, CTO of Altinity was at CAPSiDE last week. He was here doing joint develpoment on OpsView, and giving us an inside view of the bowels of the beast. I always say that to get implicated with a project, having it's source code is not enough. You have to have a "photo" of the project as a whole, and that is pretty hard to have, because most of the time, it isn't documented anywhere. So that's what Ton has given us. One day I'll blog about having that "photo" of a project...

I have to say I enjoyed Ton's stay here, and it was a great pleasure to work together. His tecnological skills and personal aptitudes (good comunicator, ability to envision solutions that can fit everyones needs, will to work, etc) made his stay here at Barcelona a very productive one.

Beeing able to see how other projects are managed gives you a view of how your projects are managed too, and the problems and considerations that these other projects have, that maybe you don't have actually, but can have someday, or that you can apply to yourself. On that side I'm very fond on how we are managing projects at CAPSiDE and think we are on the right path. Of course there is always room for improvement.

At CAPSiDE we are commited to contributing to OpsView, and will try to apport our grain of salt so that it can evolve into an even better monitoring solution than it is.

Simple Cross-Domain Ajax Proxy

Posted on 26/11/

Developing a feature for one of our products we needed to do retrieve pages from other domains via XMLHttpRequests from the browser. As you already know, browsers don't let you do cross domain requests as a security measure, so you have to use a proxy on the same domain that your application is running.

There are a lot of ways of doing it, and I wanted a way where I didn't have to install additional soft and such. There are php proxies, Java proxies, etc. I didn't want to do a Perl proxy (just to not bloat the solution). There where people doing it with Apache (those ones I liked), but in an unmaintanable way (adding one configuration per domain to retrieve info from), and our application required data to be retrievable from any domain. So here is the recepie I whipped up:

<Proxy http>
    Order Deny,Allow
    Allow from all
</Proxy>

RewriteEngine on
RewriteRule ^/web-proxy/(.*)$ $1 [P]

Now you only have to make requests to http://webserver/web-proxy/DESIRED_URL. Please take into account that if you do not protect the web-proxy location to authorized users, you have an open proxy (don't do that).

Opsview Single Sign-On coming soon

Posted on 26/11/

People have been asking on the list to be able to use the Single Sign-On feature implemented in Opsview to authenticate against an LDAP, for example.

I've been trying to get it working with the actual codebase, but I'm sad to say that it's not ready yet. While looking through the code, I found a comment that resolved my doubts:

  # This setting of the user_exists means that Opsview is the central
  # login point, not the authticket
  # Maybe possible in future to allow a trust from the external source
  # so the user can be given from the auth ticket
I love code comments that really help you see the decisions that were made (those are good code comments), although this one was a bit of a show stopper :(

So... the actual codebase can't trust a ticket generated from a 3rd party source. You CAN use the ticket generated by opsview to authenticate on other sources, though, as it's fully valid.

I've contributed changes to Opsview that are awaiting revision. These changes let the Catalyst framework (that Opsview uses) log in the user that is provided through a 3rd party ticket, so if everything goes well, I will be able to show you how to use the Single Sign-On to autheticate Opsview users for the next Opsview release (the article is half-written ;))

Blosxom

Posted on 26/11/

New blog. New blogging software. The last one was movabletype. Not bad... It was written in perl! ;) but it's not totally free and I didn't want something as complicated (some features I didn't need/want/don't want to spend time discovering what they are). I had heard about a quite minimal blogging system, written in Perl, and where almost all functionality is a plugin.

To create a blogging habit things have to be easy (for me). Now i can blog from almost anywhere. I can open an SSH session and blog from vi, install a backend for mobile devices (this article I'm writing from my PDA... while sitting on the couch!).

The blog is going to start out minimal. You can imagine based on the styling... I even don't want comments for now (spam blog comments are horrible to cope with). If you have any comment or suggestion: drop me a mail at pplu@capside.com

Opsview custom SMS notifications

Posted on 26/11/

Opsview can now use custom SMS notification methods. I've prepared a mini-howto guide on how to use this feature. Please send in comments, corrections and suggestions. This article will be aported to the Opsview docs, so all of us will benefit.

Configuring Opsview

Put your custom SMS notification script into

/usr/local/nagios/libexec/notifications

remember to make it executable to the nagios user. See below for recommendations on how to develop the notification methods.

Sync the plugins to the slaves:

/usr/local/nagios/bin/send2slaves

In Opsview interface

Go to: Advanced -> SMS Notifications -> Create new SMS Notification Methods

  • Name: give it an identifier (without spaces)
  • Run On:
    • Monitoring Server: The command will be run on the Master monitoring server. This is for scenarios where you have to notify from a special device, for example, that isn't available on the slaves. A cell phone attached via a serial cable, a server that is only accessible from the master, etc.
    • Slave: This means the command will be run on the slave that has detected the alert. This is for notification services that will not depend on the server that detected the alert, like an HTTP call to an SMS service.
  • Command: the name of the command in the /usr/local/nagios/libexec/notifications directory. Add extra parameters that are supposed to get to the script (parameters that nagios doesn't send you).

Go to: Advanced -> System Preferences, choose your new SMS method identifier, and submit the changes

Be sure to have a contact with the SMS number filled in (that will activate the SMS notifications for that contact). Note that the +CCNNNNNNNNN format is not longer enforced, in fact, no format is enforced, as it will be the plugins responsability to verify the correct format for the number for it's use. Push the "send test SMS" link to try out your notification method.

Reload your Opsview configuration and you're running

More help is available in Opsview docs.

Script guide

The script will recieve the SMS number in the NAGIOS_CONTACTPAGER environment variable, in fact, it can play around with all the environment variables listed in the Nagios Macros Reference. Look in the Service Notifications and the Host Notifications column.

Non-Nagios variables can be expected from the command parameters. Things like --url_to_post_to, --serial-device-to-talk-to --baud-rate, etc, and can be passed when you define the "Command" of the new SMS method.

Do the notification magic, print a line of status to STDOUT to help out humans ;), and exit 0 on success, non-zero on failure.

Note: The Opsview 2.11 standard notification scripts relied on getting the SMS number via the command line with -n parameter (if I don't remember badly). These where changed to be expected through the env variables in Opsview 2.12.

Power to the users

Who says "custom SMS notifications" says "do what you want to notify... you have the control". That is, as long as you fill in the SMS number for a contact, the "SMS" notification will be called for it. You can write a log file instead of sending an SMS if you want... Opsview won't care }:)

Conference feedback

Posted on 26/11/

Sorry for the late post, but I've been quite busy after the Nagios Konferenz. I was preparing one macro-post with all the new things I learned, but I'll just split them so they get published quicker!

The conference was really good, and I met lots of people that use in some way Nagios, apart from main developers, and developers of 3rd party software based on Nagios. The conference was sold out, and it was a pleasure to attend. I hope to be there next time.

I attended:

  • Ethan Galstad: Nagios - Current State, Future Plans and Development Roadmap
  • Geert Vanderkelen: Monitoring MySQL
  • Stefan Kaltenbrunner: PostgreSQL Monitoring - Introduction, Internals And Monitoring Strategies for postgres ql.org
  • Ton Voon: An active check on the status of the Nagios Plugins
  • Satish Jonnavithula & Steven Neiman: Application Transaction Monitoring using Nagios
  • Malte Sussdorff: Integrating Nagios and ]project-open[
  • Tom De Cooman: Monitoring Tools Shootout
  • Julian Hein: FLEXible Realtime Graphing with the new NETWAYS Grapher v2

A big thank you to Netways for organizing this great event

Nagios::Plugin::DieNicely v0.02

Posted on 26/11/

Nagios::Plugin::DieNicely now lets you exit with the Nagios status that you most like. The feature was on the Todo list, and now that I'm confident that the tests pass on lots of different perls and platforms (thanks CPAN Testers!), and that I have detected why there are some FAIL test results, and that there have been requests for it, I have decided to add the feature

Compatibility should be assured (at least the test suite says so). If you use the module as in v0.01, the exit code will still be CRITICAL. But if you where not all that comfortable with CRITICAL, and you would like WARNINGs, now you can. Just:

use Nagios::Plugin::DieNicely qw/WARNING/;

You can pass in these identifiers:

  • CRITICAL: The default
  • WARNING: I suppose this one will be the most used...
  • OK: If you use this one, please comment why you would want to do so. I added it just in case someone would want it (I have no cristal ball to say that it isn't useful), and I have not been creative enough to find a real use.
  • UNKNOWN: The purpose of the module is to NOT get UNKNOWNs in Nagios. Why have you done this? Well... If you specify UNKNOWN, you will get the exception in the Nagios output (instead of lost in limbo).

Give it a ride!

Test::SMTP

Posted on 26/11/

I'm announcing the release of Test::SMTP. This module pretends to provide a framework for making SMTP server testing easy. We were doing SMTP testing with an instance of Net::SMTP, and with Test::More methods, seeing if everything was as expected. All this logic has been encaplsulated into Test::SMTP to make testing SMTP servers a little less of a pain.

Please note that this is a 0.01 version and is based on Net::SMTP as the client. Net::SMTP has it's limitations as a client that permits full control to the test. Don't get me wrong: as a "do the right thing for me when you can" client it's great. Try not to call Net::SMTP methods, as this class is a temporary bridge, just so the testing framework can be evaluated by the community (release early, release often).

Things in Test::SMTP that need to be issued in the future:

  • Test::SMTP cant simulate plain old (helo) smtp clients if server supports ESMTP. Underlying Net::SMTP auto negotiates ehlo/helo when an instance is created.
  • Net::SMTP supports method is called, although not documented in Net-SMTP docs. It's name seems to be public by name :p
  • No STARTTLS support because Net-SMTP doesn.t support it
  • Auto selected AUTH. See Net::SMTP for supported AUTH methods and code for how it selects the auth

Features:

  • You can simulate multiple clients in the same test. Just call connect_ok more times and you obtain more clients.
  • Simulation of misbehaving clients is supported. Test::SMTP inherits from Net::SMTP. You have access to the methods of IO-Socket-INET, Net-Cmd. Because of auto-helo/ehlo you cant issue commands before the helo phase, though.
  • Mail addresses passed to Net::SMTP methods to and mail are mangled by Net::SMTP to try to produce good commands to the server. These have been worked around adding mail_from and rcpt_to methods, that issue MAIL FROM and RCPT TO commands

Future plans are to implement a "don't do things automatically" client so you have all (or at least more) control over the client.

Introducing Catalyst::Authentication::Credential::Authen::Simple

Posted on 26/11/

Just got another module out!

This module isn't at all complicated. I'm even surprised that anyone hasn't already written it! Authen::Simple is a great authentication framework (thanks to the excellent work of Christian Hansen). We've been using it at CAPSiDE for quite some time now, but we hadn't developed a Catalyst module for it because we are normally using mod_auth_tkt, so our Catalyst apps aren't authenicating directly. I recall the need for Catalyst apps to authenticate against external datastores from the mailing lists, and a recent conversation with Ton Voon made me think that it's time to write the module so Catalyst can do fancy authetication

Catalyst::Authentication::Credential::Authen::Simple is just glue between Authen::Simple and Catalyst. It reads the Catalyst App config, instances the appropiate Authen::Simple objects, and then just calls autheticate on the objects when you authenticate from within Catalyst.

It's that simple... Authen::Simple...

Opsview support for NagiosChecker

Posted on 26/11/

I've been using the FireFox NagiosChecker together with Opsview. I found this plugin because someone on the Opsview list asked if it was compatible with Opsview, and I tried it out. It worked well, aside from one little issue:

NagiosChecker authenticates with Basic HTTP Authentication, and Opsview doesn't like that. You configure NagiosChecker, and it doesn't work. Opsview needs a valid cookie to authenticate. If you login to Opsview, you see NagiosChecker start to work. That's because FireFox stores the cookie needed to authenticate, and on the next request NagiosChecker makes, the cookie gets sent to Opsview!

So I only had to make NagiosChecker log in to Opsview the first time it requests the Nagios status screen. I added a checkbox to the NagiosCheck server setup screen so you can tell it it's an Opsview server

I've contributed the patch to the NagiosChecker project, but in the meantime, I've packaged NagiosChecker with my patch so you can try it out. Feedback is welcome ;)

Download the NagiosChecker with Opsview support. You can also patch your installation of NagiosChecker.

Be the ticket booth

Posted on 26/11/

Now that we know how mod_auth_tkt works, we are eager to implement our applications authentication with it. The module never generates the ticket. Instead, ticket generation is delegated to the login URLs.

You can generate a ticket from your favourite language. The mod-auth-tkt distribution includes: a perl module, a python module, and php helper functions. There is a login perl CGI script that uses the perl module, and is prepared to do a lot of things just configuring it, and filling in the "sub validate" so user and password get verified against any database you want. Look at the example: require the class that will do the validation, and then return true or false if the supplied credentials are not correct.

So... What does the aplication have to do to get into the single sign on world? In many cases: nothing. If you have been relying on Apache basic authentication, you probably have been recieving the already authenticated user in the REMOTE_USER environment variable. When a valid ticket is detected, the module takes the user for which the ticket was generated for (remember that if the ticket was expended, the supplied credentials where correct) and sets the REMOTE_USER. So if your application was using basic authentication, you are in luck: set the Apache config and let it run!

If you were authenticating within your application, you are in less luck. There is a forest of possibilities of how your system is working, but most probably you are just storing the logged in user in the session once authenticated, or getting the logged in user from one single point in your code. You can see where I'm getting... Just start to rely on the REMOTE_USER from that point.

Net::Server::Mail::ESMTP::SIZE

Posted on 26/11/

I've just released another module to CPAN. This one is Net::Server::Mail::ESMTP::SIZE.

When I developed tests for Test::SMTP, I had to implement a mini-SMTP server. Instead of reinventing the wheel, I chose to use the Net::Server::Mail distribution to fastly have what I wanted. But to test the supports_cmp_ok, and supports_like there was no extension that reported parameters pre-built. So I stubbed in the SIZE extension only for the tests to play around with.

On my list of "possible CPAN modules" appeared the Net::Server::Mail::ESMTP::SIZE, but this time implementing actual functionality. As with the first attempt I made great progress, and everything went quite straightforward... It's done!

One problem I had was actually getting the module to plug-in to the Net::Server::Mail::ESMTP... The documentation is quite short, so I had to "inspire" myself off the code from the modules that were already written, and do lots of Dumper(@_) to see what was going on... I hope I got it all right :p.

Faster isn't always scalable

Posted on 26/11/

Sometimes when designing we go great lengths optimizing for speed, and not always think of scalability. When thinking scalable you have to tend to think of letting operations be done in parallel and thus locking as little common resources as possible so that the work can probabilistically be done in parallel. And sometimes, to be fast, you hold a lock, so you can make the assumption that you are alone (you can overlook sincronization with others, and thus the overhead). But that means that you are the only one that can be working.

As an example: MyISAM tables are fast reading and writing but scale badly for writes. As concurrent reads go up, one single write locks up ALL the reads on the table, because writes hold a lock on the entire table until they are done. Innodb, in change is slower updating rows, but because writes only lock the rows that they are writing, the reads can still be done concurrently if they are addressed to unlocked rows.

The confusion normally comes from faster meaning less CPU cycles, and since a CPU is a locked resource, the faster you do things, the more you can do in parallel.

Think before holding a lock ;)

Opsview Asterisk notification

Posted on 26/11/

After a couple of weeks in internal testing, I can now contribute a helpful notification method for Opsview, and more generally for Nagios.

We wanted Asterisk to wake up an on-guard engineer if an alert was detected. What seems like a pretty trivial thing has a couple of subtilities that have to be treated with care to not make a nightmare out of it.

First: how to make asterisk call a phone number. There are a couple of documented ways to do it:

  • create a call file on the server. This would have to be done via ssh from the nagios master server. I don't like this because it is touching the internals of asterisk, although it is standard (much like creating a mail in the sendmail queue).
  • via the Asterisk Manager protocol (astman). This seemed more suited for the task. The downside is that the perl API for astman is quite spartan and not all that documneted (Asterisk::Manager).

When someone asks me to choose an option from two that I don't like, I always ask them this question in return: "What do you prefer? Syphilis or Gonorrhea?". This time I was asking it to myself... So I chose the astman solution. I hope that this way the Asterisk::Manager module will get a bit more mature, and therefore be a better long term solution.

Second: The calling gets done in asterisk. The notification script only calls an asterisk extension. We wanted a "human hunter" that would not stop calling until someone acknowledged the alert. Maybe someone would want a different behavior, so that is customizable via asterisk programming.

Lessons Learned

  • Astman was the right way to go: Astman has access rights per user and per host in the manager.conf file.
  • Alarms are random and can happen in parallel: The first day in alpha there was a connectivity problem. A lot of alerts where spawned. The poor guy at the phone had to acknowledge a lot of calls :S. We noticed that a "don't call me if you have just called me" mechanism was needed.

    A configurable lock out mechanism was added so that some calls could be made in parallel in a customizable way. Maybe you want to call a number two times in a row if the alert is for different hosts or host groups, or maybe you just want one time alert per phone number called, or just notify to one phone whatever the notification is.
  • Nagios kills "lazy" notifications: Because we hunt down someone, the call can get long. Another time... someone had to acknowledge a lot of calls that day... :S The call is not registered as successful until Asterisk says that it's successful and then it's registered in the notified database. Other notifications get queued up while a lock on a notification db is held. When tested out of the Nagios environment everything was OK. Debugging revealed that when the notification script was taking more than x seconds, Nagios was killing it, the lock was released, and Asterisk was continuing with the call. The next notification was kicking in (because the lock was released), and Asterisk was dialing again to the same contact.

    This was resolved by forking, and detaching the child from the father process (just like a daemon does). The detached process does the calling. The father returns inmediately.

So you can get the notification script here. At the end it got a little more complicated than it seemed at first :)

New design

Posted on 26/11/
This time it's a design! Thanks Pau.

Nagios Checker patch got through

Posted on 26/11/

I'm pleased to announce that my patch for Nagios Checker to support Opsview is is now available in the official distribution of the plugin. You can get it here

I've been using Nagios Checker for quite some time now, and I like it very much, and now that it's patched for Opsview, I like it even more. Instead of having a browser window open, and going to look at it when an alert email gets in my mailbox, I get a nice warning sound, and an overview of the problems just hovering over the checker. Direct access to Opsview is granted just clicking on the alert. Need to add a host? Or curious to see the Opsview HH page? Just click on the 'go to Nagios' menu when you right click on the Nagios Checker status.

My office colleagues have been beta testing the patched plugin, and find it very useful, but they had a bit of trouble configuring NagiosChecker correctly to play with Opsview so here is how to do it:

When you're configuring your Opsview host in Nagios Checker:

  • URL of the Nagios Interface: http://my.opsview.server/
  • Type of server: Opsview
  • URL of status.cgi: Select manually, and fill in http://my.opsview.server/cgi-bin/status.cgi

I haz got Kwalitee

Posted on 26/11/

I've been trying to increase the kwalitee of my modules in every release of each of my modules. Looks like I got it right.

A couple of tips are:

  • use recent Module::Builder: It gives you kwalitee very easily, as it does tons of stuff for free. But use an actual version. The first modules i contributed used Debian Etch Module::Builder, and didn't generate a known-spec META.yml. Got that fixed free just upgrading Module::Build.
  • make manifest before doing the make dist.
  • use Test::Pod and Test::Pod::Coverage: Test::Pod will alert you if you have typos in your POD, and Test::Pod::Coverage will bug you when you don't document a function

Of course there is no guarantee that if a module has kwalitee then it's good... It has to have proper tests (Test::SMTP had 100% code coverage, and even that won't guarantee bugfree-ness), and those tests have to run on the most platforms possible (that wont assure anything either), and a bunch of other things which I'll write about in next articles... I hope I maintain my kwalitee (I like beeing on the first page of the "Authors with less than five dists" ;)

Comments activated

Posted on 26/11/

I've activated comments for the posts. You can rate them too... (so now I can know if you like the posts, and if I should stop blabbing about some topic :p)

Please be polite, and try to apport some extra content to the posts, without flaming, insulting and such. These actitudes will not be tolerated, and comments will be deleted without any type of explanation.

Machine naming schemes

Posted on 26/11/

When you industrialize your systems management (you are a hosting provider), or you simply have LOTS of machines for whatever reason, you have the need of a naming scheme. You have probably been naming machines by:

  • Planets: Sun, Mercury, Venus, Pluto, ...
  • Constellations: Andromeda, Orion, ...
  • Winds: Tramuntana, Xaloc, Garbí, ... (Here in Catalonia these are used a lot!)
  • LOTR: Mordor, Shire, ...

So you start naming machines with a scheme that helps you localize them: r01p01.net.example.com means rack 01 position 01 (positions starting from the rack bottom), for example. The downside is that once you have standardized the machine names, you loose that special "think of a name" moment, and the freakiness of the thing all together (people that are in IT usually don't know that machines even have names!)

I personally name my machines (and electronic devices that have computer-like functionality) with names of robots that appear in Futurama. So I have Bender, Flexo, Roberto, Calculon, etc. It's funny when I get into my bosses car and the hands free display says "Kwanzabot", and to see the HELO in SMTP headers display "SINCLAIR-2K".

Of course this is not a new thing, and RFC 1178 has some interesting situations in the "what NOT to do" guidelines XD. I'm pretty sure that most of us have fallen into one of the situations described in the RFC.

The bottom line is "try to have fun naming!" (when you can)

New style

Posted on 26/11/

New style for the blog! Contributed by Pau Puig, one of CAPSiDE's workers.

Thanks Pau!

The connotation of PIN numbers

Posted on 26/11/

I discovered some time ago a neat "trick" for mobile phones that not many people know, and that I'm sure the "security paranoid" bunch of people will appreciate.

When your mobile phone prompts for a pin, strangely, it lets you insert more than 4 numbers. That's because: pins can be longer than 4 digits on mobile phones!

I investigsted a little further and it turns out that wikipedia has it documented! It's curious how we have asociated the term "PIN code" with only 4 digits. Maybe phone manufacturers should of called it Unlock Number to take the 4 number connotation out...

So now you know it... you can have the worst type of PIN... One that is probably out of the mental scope of an attacker. And if you're a developer and need to ask for a numeric password be careful of the connotation of "PIN". Maybe you'll find yourself with all passwords beeing 4 digits long, altough you support more ;)

Writing great Nagios plugins

Posted on 26/11/

So you want to write a Nagios plugin, and you want it to be a great one! A great plugin, aside from having some great functionality is one that provides good documentation and fits nicely into the Nagios ecosystem, that is, that nagios users will be comfortable with it.

Right now you are thinking: "how can I do that? I have to look at other plugins, read guidelines, learn a lot about the nagios way to do things, and what the community expects from a plugin, etc. It's a quite big task, and I just wanted to write a quick and dirty plugin!"

If you program your plugins in perl you are a lucky man, because smart people have already done that for you! Nagios::Plugin helps you fit into that ecosystem and get a lot of functionality for the best cost: FREE, and get your plugins done in less time and with more features, with less bugs.

First step: Instance a Nagios::Plugin object

my $np = Nagios::Plugin->new(
        usage => "Usage: %s [-v|--verbose] [-t|--timeout=seconds] -c|--critical=<threshold>"
        version => 1.0,
        blurb => qq{Count the xxx's in yyy},
        extra => qq{
 -c 10
   returns CRITICAL if xxx's are greater than 10
 -c 20 -t 60
   returns CRITICAL if xxx's are greater than 20. Timeout in 60 seconds if it takes too long.},
        url => 'http://example.com'
);

You get:

  • standard parameters
    -V version info
    -h autogenerated help
    -v verbose output flag
    -t timeout
    nice features that you don't have to worry about, and that Nagios users will be very happy to have. Programs like Opsview will show the help on it's web interface (again... for free).

  • plugin versioning
    version and url get outputted for free (too) in help and -V
  • help text
    the help text consists of the version info, license (GPL if not overridden), blurb (text describing what the plugin does), parameter help list (autogenerated with the add_arg() info, and extra info. The extra info is the ideal place to give the user a couple of usage examples with a small description of what the invocation of the plugin with those parameters does.
That's a lot for one statement!

Second step: add your parameters

$np->add_arg(
    spec => 'warning|w=s',
    help => "-w, --warning=RANGE\n     Range for returning WARNING"
);
$np->add_arg(
    spec => 'number|n=i',
    help => "-n, --number=INTEGER\n     Number of yyy's to xxx",
    required => 1
);
$np->add_arg(
    spec => 'filter|f=s',
    help => "-f, --filter=aaa\n    Filter by aaa",
    default => 'aaa'
);

# Parse @ARGV and process standard arguments (e.g. usage, help, version)
$np->getopts;
You get free parameter type validation, so if you declare that a parameter is an integer, the plugin will not go past the $np->getopts statement. You also specify a string for each parameter that will be displayed when the user calls the plugin with --help. If you are going to have a critical and a warning threshold, tell the user that they are RANGE items (you'll see why below). Some standard parameter names are:
-c critical range
-w warning range
-C for parameters that start with "c" other than critical
-H hostname: for names of machines
-p port: for port numbers
-4 for using IPv4
-6 for using IPv6

Third step: do what your plugin does

Now you have to work (hey! you haven't broken a sweat yet!). To get the value of the parameters passed to your script, you have handy $np->opts->paramname accessors.

Fourth step: return performance data (it's free)

You have almost surely collected a measurable quantity to compare against a threshold. Output the recollected data via performance data. I'm sure you will want to see how your recollected data evolves through time with a nice graphing tool. Is it going up? down? is it high at work hours? is it low on weekends?

$np->add_perfdata(
    label => "size",
    value => $value,
    uom => "kB",
    warning => $np->opts->warning,
    critical => $np->opts->critical
);

Let UOM be:

  • no unit specified - assume a number (int or float) of things (eg, users, processes, load averages)
  • s - seconds (also us, ms)
  • % - percentage
  • B - bytes (also KB, MB, TB)
  • c - a continous counter (such as bytes transmitted on an interface)

Fifth step: return the status

Now you decide if the plugin has to return CRITICAL, WARNING or OK. This code quickly springs to mind:

if (recollected_value > critical)
    ...
elsif (recollected_value between critical and warning)
    ...
else
    ...

What if somebody wants OK between critical and warning? Again you can work less and get more: $np->check_threshold to the resue! Nagios has a RANGE specification that check_threshold understands so you can just pass the recollected value, the critical parameter and the warning parameter. You get the status that has to be returned calculated for free!

my $status = $np->check_threshold(
    check => $value,
    warning => $np->opts->warning,
    critical => $np->opts->critical
);

Now just return the calculated status and a little single line text with the exit method. Don't be too verbose, though, because the output gets cut!

$np->nagios_exit( $status, "$value xxx's where found" );

More neat (and free) details

  • verbosity
    $np->opts->verbose will return the number of -v flags in the parameters. Use it if you want to give the users a little more info (-vv or a little more (-vvv or a lot more)) :p.
  • Read the docs
    The docs will reveal all sorts of extra info. Read the helper classes (Nagios::Plugin::Xxx) documentation too, because not everything is exposed in the Nagios::Plugin documentation ;)

Resuming

Nagios::Plugin will save your time, and make your plugins better, with less effort.

Proud to see Opsview 2.12.1

Posted on 26/11/

The development work that got done when Ton Voon came to CAPSiDE has got through. I am proud of the add-ons that we have contributed, and hope to add more over time.

A lot of effort has gone into each feature by the CAPSiDE Team and by Altinity.

CAPSiDE added features are:

  • Single Sign-on
  • Event handlers
  • Customizable host check commands
  • Customizable SMS Notification methods

One thing from CAPSiDE didn't make it in to the 2.12.1 release (but hope will soon come) is Nagvis integration so you can map out your servers and see them the way you want to.

We are looking forward to hear if these add-ons have been useful to the community, and if they are being used and how. Drop us a mail to the opsview users list ;)

Getting to the backends

Posted on 26/11/

As I already exposed, simple web apps will be using mod_auth_tkt pretty fast if they where counting on http basic authentication.

When you control the software being used (be it yours or open source) you can always take on parsing the ticket to get the info back, be it in a cookie, be it in a parameter via GET.

Let's examine a more complex scenario. Problems start ariving when using application servers, or proxying to non auth_tkt aware servers or applications. The frontend can validate the ticket, (authenticating the user), but, since mod_auth_tkt basically leaves the ticket in the REMOTE_USER environment variable, and these variables don't get proxied, you don't recieve the logged in user in the backend. So... lets try to find some ways of getting the info to the backends (thanks to the people on the mod_auth_tkt list for the pointers).

Using headers

Put the REMOTE_USER in an HTTP header. Use mod_headers.

ProxyPass /headertest/ http://backend/xxx/
ProxyPassReverse /headertest/ http://backend/xxx/

<Location /headertest/>
   AuthType Basic
   TKTAuthLoginURL /login
   TKTAuthTimeout 600s
   RequestHeader set X-AuthTkt-Remote-User "%{REMOTE_USER}e"
   RequestHeader set X-AuthTkt-Data        "%{REMOTE_USER_DATA}e"
   RequestHeader set X-AuthTkt-Tokens      "%{REMOTE_USER_TOKENS}e"
   require valid-user
</Location>

And in the backend, just pickup the results! (If you are running a CGI on the backend, just loookup the environment variable: HTTP_X_AUTHTKT_REMOTE_USER, HTTP_X_AUTHTKT_TOKENS, HTTP_X_AUTHTKT_DATA. Of course, you'll say! I have to modify the backend software to read from the HTTP_X_AUTHTKT_REMOTE_USER. If the backend server is another Apache, you still have an Ace up your sleeve mod_setenvif.

    SetEnvIf X-AuthTkt-Remote-User "(.*)" REMOTE_USER=$1
    SetEnvIf X-AuthTkt-Data        "(.*)" REMOTE_USER_DATA=$1
    SetEnvIf X-AuthTkt-Tokens      "(.*)" REMOTE_USER_TOKENS=$1

Using URL GET parameters

You can rewrite the REMOTE_USER to a parameter in the URL. mod_rewrite can handle this with it's eyes closed, and fetch that in the backend.

ProxyPass /headertest/ http://backend/xxx/
ProxyPassReverse /headertest/ http://backend/xxx/

<Location /headertest/>
   AuthType Basic
   TKTAuthLoginURL /login
   TKTAuthTimeout 600s

   RewriteEngine on
   RewriteRule  ^(.+)\??(.*)$   $1?remote_user=%{ENV:REMOTE_USER}$2    [QSA]

   require valid-user
</Location>

mod_rewrite can set environment variables too, so, if you do the inverse process (set the value of the GET parameter to the environment variable), you get the same result. I like the header solution best because mod_rewrite is a heavy module, and just adds the module that the frontend needs, and the one that the backend needs.

There was a comment on the list on getting username and password to the backends (for apps that need the two on every request), but for that you have to store the password encripted in the cookie. I'll have a shot at that one in another post (and maybe use the tecnique in the real world in an OS application... we'll see).

I wish I never hit the send button

Posted on 26/11/

Every day we send out lots of mails. I normally read a mail two times before sending it to a customer. And despite that, there have been times where I wished that a message had not gone out. Maybe I pressed the shortcut to send out the mail when it was half finished, maybe I got an afterthought on how to express something, or on how to solve an issue in another way, or to include someone in the conversation...

The other day, talking with one of our customers project manager, he told me he was going to send me a mail while we were at the phone. He told me that I would recieve the mail in one minute, and the curious thing: "one minute" was not an expression. I got interested in the delay, and just had to ask why. Basically he has a rule that delays all outgoing mail for one minute before submitting it to the server. He gave me a nice and easy solution I hadn't ever seen. You can cause a configurable delay to your outgoing messages with a very simple Outlook rule! Now you always have a second chance! After all the customer probably won't notice the delay.

I'm not saying that this is the remedy to all mistakes, but I don't know why, when you press the send button, a background process kicks in and you realize your mistakes, and this is a nice way of getting to the message before it really gets sent. Of course your brain can adapt to kick that background process in after the delay... you never know brains! ;)

I liked the solution because it was a way to use Outlook rules that I had never thought of, although you can see that it isn't hidden at all (create an outgoing message rule).

Oh... wait... I wanted to blog about open source things! I tried to get the same functionality out of Thunderbird but it seems that rules only apply to incoming mail. Does anybody know of an Open Source mail client that can implement this sort of behaviour?

See you in NETWAYS Nagios Konferenz

Posted on 26/11/

I'll be attending Nagios Konferenz 2008 September 11 and 12, in Nuremberg. I'll be presenting the Review of notification methods talk. In other words, how people are notifying, and if you want to develop your own notification methods, teach you what we've learned the hard way about doing so.

Hope to see you there, and looking forward to hearing the very interesting presentations that are scheduled (It's a shame I don't know German!).

How we notify engineers on 24/7 [Feedback III]

Posted on 26/11/

After my talk, some people asked me if I could publish the Asterisk code that we use to notify our on-call engineers. We published asterisk_notify that comes to be the glue between Nagios and Asterisk, but didn't publish the logic that was in Asterisk to be shure that someone (human) picked up the phone, and got notified that Nagios detected a problem.

I am no Asterisk guru, and I have a hard time programming in Asterisks "language" (kind of reminds me of BASIC). I cannot guarantee it will work for you in your environment. I can only say that it works for us. Suggestions and Improvements are welcome (I said I was no Asterisk guru... I just went around the docs and searched and investigated alot in VoIP Info).

; Custom Nagios Notify Extension
; (c) Jose Luis Martinez
; jlmartinez@capside.com
; Use at your own risk. 
[custom-nagiosnotify]
; s,1 defines where to get the number from. Select one of the s,1
; lines, and comment out the others...
;
; Just for hard-coding a list of numbers 
;exten => s,1,Set(SUPPORT_GROUP_NUMS=0666666666#0666666667#);
; The nagios notification script was setting SUPPORT_GROUP_NUMS, so
; the STUB=1 action was just to have an s,1 action (to not touch the rest
; of the extension 
;exten => s,1,Set(STUB=1);
; we use s,1 to setup a variable named SUPPORT_GROUP_NUMS that will contain
; the list of numbers that Asterisk will "hunt down"
; AGI nagios_notify.pl looks up who to call in our on-call database and
; sets that variable via the AGI interface. 
exten => s,1,AGI(nagios_notify.pl)
exten => s,2,NoOp()
exten => s,3,SetLanguage(es)
exten => s,n,Set(RingGroupMethod=hunt)
; make macro "nagios-pickup" handle when the user answers
; the timeout waiting for someone to answer is 30 seconds. 
exten => s,n(DIALGRP),Macro(dial,30,M(nagiospickup)m,${SUPPORT_GROUP_NUMS})
exten => s,n,Set(RingGroupMethod=)
exten => s,n,Goto(custom-nagiosnotify,s,2)

[macro-nagiospickup]
exten => s,1,Wait(1)
; playback a sound that says: "There is an Opsview alert. Please press 1"
exten => s,2,Playback(custom/alertaDOpsview&custom/premi1)
exten => s,3,Read(OneKey||1||1|5) ; Store in 'OneKey' the pressed key. timeout in 5 secs
exten => s,4,GotoIf($[${OneKey} = 1]?s|5:s|8) ; GoTo prio 5 if "1" was pressed; else to prio 8 
exten => s,5,NoOp(Caller marked 1) ; Called person pressed number 1
; playback "you have an alert" in a loop
exten => s,6,Playback(custom/alertaDOpsview)
exten => s,7,Goto(s,6)
; We got tired of waiting for the user to press 1. We'll continue down the hunt list...
exten => s,8,Set(MACRO_RESULT=CONTINUE)

One known bug is that if you press number one BEFORE hearing "There is an Opsview alert. Please press 1", the call doesn't get acknowledged. Of course, It will call you again, and you'll have the time to not be impatient this time ;)

If you use or adapt this script, make it do interesting new things, fix bugs, etc. please give me feedback.

Notifications stall Natgios [Feedback I]

Posted on 26/11/

Notifications are NOT Asyncronous, that is, nothing goes on in Nagios while your notification script is running. So try to be fast when notifying. Or simply return something inmediately (Nagios does nothing with the return code from notification scripts for the moment). Note also that your script will be killed if it's taking too long.

If you still have a long-running notification script, you can opt to fork, detach the child process (like a daemon does), and do all the work in the child. Just return something inmediately in the father process. If your notification script is in perl, just do this:

use POSIX 'setsid';

open STDIN, '/dev/null'     or die "Can't read /dev/null: $!";
open(STDERR, "> /dev/null") or die "Can't write to /dev/null: $!";
defined(my $pid = fork)     or die "Can't fork: $!";
exit if $pid; # parent process just exits.
setsid or die "Can't start a new session: $!";

Nagios::Plugin::DieNicely Released

Posted on 26/11/

As your Nagios plugins get a bit more complicated, and depend on external CPAN modules you will find yourself with spontaneous UNKNOWN states on Nagios when the services that you monitor are faulty. This will probably come from the fact that different modules have different ways of notifying that something has gone wrong. Some return undef, and some call die or croak.

When they call die is when you have Nagios reporting UNKNOWN states, and "no output". Nagios will consider exit codes that it doesn't know as unknown states, and perl exits with 255 on die. And one more thing: the exception gets printed to STDERR, and Nagios will just discard it. So you never know what hit you.

Normally you program thinking that things go well, and if there is an unhandled exception the program is supposed to die. But we're monitoring... an unhandled exception can probably give some important info on what's going on. So you wrap the code you THINK will fail around an eval, and you exit with the appropiate Nagios exit code if there is an exception. But what will you do? Wrap everything around an eval? Ugly. And you have to remember... Fear not. Just use Nagios::Plugin::DieNicely and program as always.

Nagios::Plugin::DieNicely will trap perls die (and Carp's croak and confess) for you. Then it will output the exception to STDOUT in Nagios format and exit with a Nagios CRITICAL exit code. So now you have one less thing to worry about.

This module was motivated by a real case. We were (and actually are) monitoring web services with the CPAN Soap::Lite module. These web services fail very often due to uncontrollable (by us) causes. So I have had the opportunity to see the Nagios check that attacks them in a variety of cases when the web service / server is failing. I've gone through 4 (or so) revisions of the code that returned UNKNOWN states in corner cases where the where the client module would behave in unexpected ways, and a couple of them where "die cases" that I wrapped an eval around. But I finally thought that this could maybe be done a better way.

Command Line arguments vs Environment Varaibles [Feedback II]

Posted on 26/11/

Use command line arguments to your service checks and notification scripts, as Nagios will be able to optimize them. Nagios 2 used to calculate all the values for all the macros before executing a command. The number of Macros is quite big (see v2 vs v3, so there's a lot of time waisted calculating values that will never be used.

Nagios 3 will just look at the Macros it has to resolve and only calculate those ones. Of course, it cannot look at or interpret what environment variables a check or notification needs. If you where relying on getting info from the Environment Variables, then you won't find any data, unless you tell Nagios to revert to the old behaviour (and pay the penalty of calculating everything every time). See for more info

Being the guy at the door

Posted on 26/11/

Sometimes you just can't rely on the Apache module for what you are doing. Maybe because you are not using Apache... or maybe because you find yourself in a situation where your application runs on a separate server, and Apache is just proxying requests back to you. The backend server doesn't recieve the famous REMOTE_USER environment variable because environment isn't passed when requests are proxied. If you know a way of getting the ENV to the backend server, drop me a mail (pplu at capside.com).

So you are on your own! Just very recently the Apache::AuthTkt module got updated with a method to get info from a ticket back. That is: it used to be a one way ride: you could generate tickets, but from a ticket you couldn't get anything back, so you couldn't validate the tickets you generated (the module is supposed to do that).

Getting the application to properly handle the tickets is not that straightforward, so I'll detail what you have to do to get it (hopefully) right:

  • - no ticket: Show the login screen. Verify the login screen's supplied credentials against the credential db of your choice and extend a ticket if credentials are correct. Redirect to original (protected) URL. This time you'll have a valid ticket and get past. Of course, instead of showing the login screen you can show contents for anonymous users, if you like
  • - ticket expiry: when you parse the ticket you get the timestamp of the time it was generated. You have to control if it has expired (ticket.ts + seconds for which the ticket will be considered valid < now). If the ticket has expired: show the login screen
  • - ticket renewal when the ticket is close to expiring your application should renew it (generate another one with a new timestamp), so suddenly the user doesn't get logged out. If you don't, your logins will only last for a maximum of expiry seconds...
  • - cross domain authentication:
  • Take into account that the ticket can be sent instead of via cookie, via GET.
  • - ticket tampering:the logged in userid, timestamp, and tokens (if any... see docs for more details) are beeing transmitted in almost clear text. So what if someone changes the data in the ticket and submits that? Luckily there is a digest field in the cookie that gets formed with: MD5(clear text info + ip address + the secret) the real implementation does more things, but this serves to make my point clear. The server can validate if any of the clear info or the IP was changed by just regenerating the digest, and comparing it to the one that was recieved in the cookie. If the expected digest doesn't match with the new digest: show the login screen.

On this last point we had a bit of a surprise. In Apache::AuthTkt you could call the new method parse_ticket, it didn't return the digest, and it didn't do the validation. So if you where relying on that method to see if the ticket was valid, you would be accepting tampered tickets. So Ton Voon and I updated the module so it would have a new method: valid_ticket that verifies the digest and only returns data if the ticket has not been tampered with. Hopefully the patch to the module will get to the CPAN Apache::AuthTkt module soon. Ticket expiry and renewal are still the applications responsibillty.

PHP and Python contributed API can not parse and validate the cookie. So if you are using those languages, take into consideration extending those contributed modules to do ticket validation.

PHP is almost always running under Apache, and if it runs under FastCGI there is no problem: it will inherit the REMOTE_USER environment variable, and you won't even notice. I suppose that python boys can rapidly implement the parse_ticket & valid_ticket methods with their eyes closed.