My first YAPC

Posted on 25/08/

Back from vacations, and back from YAPC::EU. It's been two weeks since YAPC, and I'd like to share a bit of my experience. First: the YAPC was great! They are community driven events, so they are more informal and far less commercial than other conferences, and that's great! Note that being informal is not being unprofessional. The organization and the talks where great.

Some of the things I learned/did/have to take a look at after the YAPC are:

  • GIT: I attended Scott Chacon's GIT 101 talk. I had worked with GIT for the last Catalyst::Plugin::Server update. Scott's talk was an eye-opener. I finally understood some of the things I was just simply "typing in" when working with GIT
  • Workflow.pm: I attended Jonas B. Nielsen's talk about Workflows. In the questions round somebody asked if there was integration with Catalyst, and Jonas replied that it was a question that appeared in the Workflow mailing list every now and then. I pointed out that we (at CAPSiDE) had integrated Workflow.pm as a Catalyst model, and shared some time giving it a look. I told Jonas that we could publish the model on CPAN, so now I have to clean it up a bit, and publish it.
  • YAPC in BCN: Jonas also suggested having a YAPC in Barcelona, so I'll have to bug BCN.pm in our next meetings :). I think we'll have to organize a Perl Workshop before going big with a YAPC
  • Perl Data Warehouse Toolkit: Nelson Ferraz gave a talk about how Perl was ideal for Datawarehousing, but that there were no modules in CPAN referencing Datawarehouse. He proposed a framework and I see he is hard at work

I attended a lot more talks, and have a bunch more things to get into, but these are the most remarkable

Also, I did my talk about Writing Nagios Plugins in Perl. I hope we'll see more people writing their own plugins for Nagios with the help of Nagios::Plugin and the other modules out there. I had 50 minutes alloted, and went a bit short (I think I finished in 40m or so...). I have to finish the test suite for Nagios::Plugin::Differences, and publish it too.

Nagios::Plugin::DieNicely 0.05

Posted on 03/06/

Got the tests for Nagios::Plugin::DieNicely working on all platforms unless Windows...

Nothing really wrong with the module. The strangeness with the exit codes was about how die works.

Die doesn't imply a fixed exit code. It will exit perl with the value of $! (last system error code) if it's non-zero. The it the exit code will be ($? >> 8) (last exit code of commands that return there) if it's non-zero. Finally (if $! AND $? are non-zero) the exit code will be 255

And why was I so surpised? Because what was happening is that $! has a value of 9 on lot's on some platforms just after perls' initialization (in a BEGIN block), probably because some system call will put a 9 in $!.

OSDC: Looks interesting!

Posted on 31/05/

The people at Netways are hard at work with this years' Open Source DataCenter Conference - OSDC.

I wanted to present a paper for it this year... but arrived late, but I'm sure I'll present a paper for the OSMC. This year's lineup of speakers for OSDC is really awesome! Because of calendar reasons, I don't think I'll be able to attend, although I think I'm tempted to do malabarisms to get there.

Release of Nagios::Plugin::DieNicely with surprises

Posted on 28/05/

I released version 0.04 of Nagios::Plugin::DieNicely with an interesting bug fix that has generated an interesting surprise.

First: the bug: when debugging a script that uses N::P::DN, the debugger would exit inmediately:

bender:~$ perl -d check_something

Loading DB routines from perl5db.pl version 1.3
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

...[SUPRESSED TEXT]...

CRITICAL - Can't locate Term/ReadLine/Gnu.pm in @INC (@INC contains: t/lib/ lib/ /etc/perl /usr/local/lib/perl/5.10.0 /usr/local/share/perl/5.10.0 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl .) at (eval 4)[/usr/share/perl/5.10/Term/ReadLine.pm:320] line 1.
Nagios::Plugin::DieNicely::_nagios_die(lib//Nagios/Plugin/DieNicely.pm:35):
35:         die @_ if $^S;
Can't locate object method "new" via package "Term::ReadLine" at /usr/share/perl/5.10/perl5db.pl line 5998.
bender:~$

Strange... and also... if you did use Math::BigInt try => 'GMP'; with N::P::DN enabled, the script would die to...

The root cause was a bit harder to detect, until I ran into documentation about the $^S variable (that N::P::DN uses to know if the exception has generated in an eval {} or not). I originally though that it could only have two values, but as stated in perldoc: It has three!

So... It looks like the debugger throws an exception while loading... And playing around with why using N::P::DN with Math::BigInt was failing too I started suspecting that the problem was with programs that dynamically load classes. So a fast test case:

bender:~$ perl -MNagios::Plugin::DieNicely -e"BEGIN { eval { die 'X' } }"
CRITICAL - X at -e line 1.
revealed everything. Note that although the exception is raised in an eval... N::P::DN will intercept it, think it wasn't in an eval block and exit in a Nagios friendly way. This is because when you're in a BEGIN block, and the code dies: $^S will be undef. It doesn't matter if it's in an eval or not as
bender:~$ perl -I lib/ -MNagios::Plugin::DieNicely -e"BEGIN { die 'X' }"
CRITICAL - X at -e line 1.
demonstrates. So there is no alternative. I just to propagate the exception... If it is catched, then everything will be OK. If it's not... then the plugin will die in a Non-Nagios compatible manner: no output, and exit code 255

So I fix N::P::DN, program the tests and get a surprise!
bender:~/Nagios-Plugin-DieNicely-0.04$ perl -I lib/ -I t/lib/ t/bin/notok_using_die_in_begin.t
DIED!!! at t/lib//DieModule.pm line 4.
BEGIN failed--compilation aborted at t/lib//DieModule.pm line 5.
Compilation failed in require at t/bin/notok_using_die_in_begin.t line 4.
BEGIN failed--compilation aborted at t/bin/notok_using_die_in_begin.t line 4.
bender:~/Nagios-Plugin-DieNicely-0.04$ echo $?
9
bender:~/Nagios-Plugin-DieNicely-0.04$

It's exiting with exit code 9. I thought: "that's strange... but OK... Nagios doesn't matter between exit code 255 and 9...". So I upload the module to CPAN, and get yet another surprise! SOME tests start being reported as FAIL in CPAN testers

And the test that is failing:

#   Failed test 'Exit code 9 for ./t/bin/notok_using_die_in_begin.t'
#   at t/002_test_outputs.t line 87.
#          got: 255
#     expected: 9
# Looks like you failed 1 test of 108.

So now the question is: What's happening? It's doesn't seem critical, but I don't like my tests failing, and I would want to at least understand what's going on. Any suggestions?

Taking Nagios::Plugin::WWW::Mechanize for a ride

Posted on 20/05/

I love Nagios::Plugin because it simplifies (and beautifies) the plugins for Nagios that I write. Ton Voon released Nagios::Plugin::WWW::Mechanize some time ago... A really good helper for writing plugins that have to do web-browsing related things...

I used the module some time ago to write a couple of simple plugins, but these days I've been confronted with some trickier plugins, and Nagios::Plugin::WWW::Mechanize has been able to make me smile, because it has helped, and hasn't got in the way with the trickier bits.

So let me expose the tricky stuff that Nagios::Plugin::WWW::Mechanize let me do

Reading the headers of the response
Since the module let's you access the raw "mech" object, you can access the headers:

$np->mech->response->headers->as_string

Handling Gzipped content
WWW::Mechanize will announce that it accepts data encoded in gzip format... but when the server returns a gzipped body it will do nothing about it! It just returns the gzipped stream of bytes in the body of the response. So if you where pretending to do something with $np->content you get lot's of garbage.
So you start to wander around CPAN and the interweb, and you find: WWW::Mechanize::GZip. It looks promising, but you think: there is no way Nagios::Plugin::WWW::Mechanize will integrate... Well... You're wrong:

my $np = Nagios::Plugin::WWW::Mechanize->new(
  'mech' => WWW::Mechanize::GZip->new(autocheck => 0)
);
Ton left in your salvation... you can pass Nagios::Plugin::WWW::Mechanize an already constructed mech object!

Making Nagios::Plugin::WWW::Mechanize go through a proxy
And last, but not least... the plugin I was developing had to request a URL (f. ex. http://www.example.com/a/url) from another server (not the one that www.example.com resolves to). I found a pretty nasty solution browsing though perlmonks at first:

use LWP::UserAgent;
use LWP::Protocol::http;
push(@LWP::Protocol::http::EXTRA_SOCK_OPTS, "PeerAddr" => "IP_OF_THE_WEB_SERVER");

I wouldn't recommend this way to trick WWW::Mechanize (which is an LWP::UserAgent) to contact another server... (I'm just documenting it for maybe further use ;)).
I finally found an elegant solution: LWP let's you define proxies!
my $proxy = $np->opts->proxy;
if (defined $proxy){
  $np->mech->proxy(['http', 'https'], $proxy);
}
You can also make any plugin that uses Nagios::Plugin::WWW:Mechanize proxy through another server just defining an environment variable:
http_proxy='http://PROXY_IP/' ./check_something_with_n_p_www_mech -url http://www.example.com/

I did find a couple of things that can maybe be be addressed.

  • when ->get(URL) fails the module calls nagios_exit(CRITICAL) without giving you a choice to do anything with the failure.
    Even if you wrap the call to get around an eval {}, the script exits. And the contents of the request are outputted. The thing is: Maybe you want to do something on failure
    eval {
      $np->get('http://x.x.x.x/url');
    }
    if ($@) {
      $np->nagios_exit('CRITICAL', "HTTP Status " . $np->....->status);
    }
    

    or even
    eval {
      $np->get('http://x.x.x.x/url');
    }
    if ($@) {
      $np->get('http://y.y.y.y/url');
    }
    # absolutely nothing bad happened...
    
    To solve this in a back-compatible way, I propose that get only call nagios_exit if the calling code is not wrapped in an eval ($^S will inform of this), else die, so the user can catch the exception (http://perldoc.perl.org/perlvar.html).
  • the plugins identify themselves as WWW::Mechanize in the User Agent
    Why not make them identify as pluginname/version. I had to do this in my script, but it would be nice that they do it automatically.
  • I had to use Nagios::Plugin; use Nagios::Plugin::WWW::Mechanize; To call
    $np->nagios_exit(CRITICAL, 'An error has occurred');

    If you don't use Nagios::Plugin, you wont get constants CRITICAL, WARNING, etc. imported into your namespace. Maybe Nagios::Plugin::WWW::Mechanize should do that too.

Writing plugins for Nagios presentation

Posted on 30/04/

I'm just uploading the presentation of the talk I gave at Perl Mongers BCN group meeting on how to write plugins with the Nagios::Plugin CPAN module.

This presentation is based on one of my blog articles: "Writing great Nagios plugins" (just find that text in this blog to read it). Hopefully it will give you part of the text to follow the presentation

Download the presentation

Great surprise

Posted on 10/06/

Yesterday I stumbled upon a fact that made me feel proud of the works that I publish to CPAN, and give me the feeling that someone finds my modules useful.

Catalyst::Authentication::Credential::Authen::Simple, although the long name, has been incorporated to the Debian package libcatalyst-modules-perl, that got shipped with Lenny!

:)

Another Catalyst::Authentication::Credential::Authen::Simple release

Posted on 24/04/

Just uploaded v0.05 to CPAN!

Thanks to the suggestion from Tomas Doran, the module now has less dependancies. Instead of Module::Load, it uses Catalyst::Util::ensure_class_loaded.

I'm happy to see the community is using it and participating. Somehow it looks like the module is finding a place in peoples' applications! These people get a place in the module (see the THANKS section).

Now I have to catch up with a mailing list post from Matt S Trout...

On Wed, Oct 01, 2008 at 04:58:51PM +0200, Jose Luis Martinez wrote:
> Tomas Doran escribió:
> 
>>Unfortunately, there is no such thing as an LDAP credential module on 
>>CPAN at the moment.
> Catalyst::Authentication::Credential::Authen::Simple should do the trick. 
> http://search.cpan.org/~jlmartin/Catalyst-Authentication-Credential-Authen-Simple-0.02/lib/Catalyst/Authentication/Credential/Authen/Simple.pm 
> becasue Authen::Simple does support LDAP.

Fucking awesome.

This needs to be more widely publicised, do you think you could do doc
patches fr C::P::Authentication and a wiki write up?  :)

- 
      Matt S Trout       Need help with your Catalyst or DBIx::Class project?
   Technical Director                    http://www.shadowcat.co.uk/catalyst/
 Shadowcat Systems Ltd.  Want a managed development or deployment platform?
http://chainsawblues.vox.com/            http://www.shadowcat.co.uk/servers/

So let's see if the next thing is the doc patch and the wiki writeup!

Catalyst::Authentication::Credential::Authen::Simple up to 0.04

Posted on 23/04/

I've applied a couple of community suggestions to Catalyst::Authentication::Credential::Authen::Simple.

  • Tobjorn Lindahl pointed out i was using some log->debug calls without verifying if the app was in debug mode. That produced version 0.03
  • Dylan Martin pointed out that the Catalyst::Log object could be passed to Authen::Simple objects, so the log information for Authen::Simple could get logged with Catalyst (in version 0.04)

Thanks for the pointers!

Thought operator overloading was the devils work?

Posted on 07/04/

Traditionally operator overloading has been criticized and normally avoided to the point where it has fallen into the back of our memories. Java didn't even implement it! But looking at it better... It looks like all the arguments that disaprove operator overloading are pointing at statically typed languages. Resuming: operator overload only gives you syntactic sugar so you don't have to do a bunch of (ugly) method calls around your code, but you pay the price of high maintenance costs [1], and the potential to suffer pain [2]. But nothing is said of dynamically typed languages! And I think there are times where this "satanized" techinque comes in handy in dynamically typed languages. It has a use that is not only for "aesthetical" reasons in dynamic languages. It provides functionality that static languages can't provide with it: It lets the programmer outgrow it's API.

The thought came to me when I was trying to out-trick a class that someone else had wrote. The class I was using was programmed in Perl (not rare in my case). It expects you to call methods with scalars that contain numbers, and operates on them with their normal operators (+ - / *). Perl scalars cannot handle integers bigger than 2^32 on a Perl compiled with 32 bit Integers (note: bigger numbers are converted to floats, and therefore lose precision). I had to pass BIG Integers. So... I looked up the options Perl gave me for big integers, and found Math::BigInt. But now my worry was: the classes I use don't have explicit support for BigInts, and I'm not the author of some of them! Luckily i found this in Math::BigInt documentation: "All operators (including basic math operations) are overloaded". Bingo! Now I can pass BigInts into classes that never expected them, and they can operate without code change. And everything works without a single hick-up.

So for some new classes I'll publish to CPAN, that have to do some basic mathematical operations on data structures, I'm thinking of keeping the functionality to the bare bones, and relying on clever programmers to do clever things with the inputs.

Let me explain: I have to calulate growth rates from mesurements that get taken at different times. It's as simeple as:

Given:
M1: measurement 1 at timestamp t1
M2: measurement 2 at timestamp t2
M is growing (like your kids' height), therefor M2 >= M1

I have to do two basic operations: difference (substraction) and division to get rates (things mesured per second)

M2-M1/(t2-t1) ... (yes... We're Deriving (calculating the "rate of change"). (It's like calculating velocity from distance, but with "things" instead of meters (feet for people that drive on the wrong side of the road :D)(loooove nested parenthesis!))).

That's not complicated, and has nothing to do with overriding operators, but somehow, that's the point! My class only has to know how to substract and divide "things". I will really have to be doing these operations with lots of magnitudes that have been taken at the same time. So I could chose to put all them in hashrefs and then implement this:

sub rates {
  my ($self) = @_;
  my $delta_t = $self->{'t2'} - $self->{'t1'};
  my $rates = {};
  foreach my $key (keys %{$self->{'m2'}}){
      $rates->{$key} = $self->{'m2'}->{$key} - $self->{'m1'}->{$key}) / $delta_t;
  }
}

Leaving the implementation limited to one level deep hashrefs.

OR

sub rates {
  my ($self) = @_;
  return (($self->{'m2'} - $self->{'m1'}) / ($self->{'t2'} - $self->{'t1'});
}

And let m1 and m2 be MagicHashSets (for my use case) with the following operations defined:

MagicHashSet1 - MagicHashSet2: foreach key in MagicHashSet1: substract the corresponding keys' value from MagicHashSet2
MagicHashSet1 / scalar: foreach key in MagicHashSet1: divide the corresponding keys' value between scalar.

You can observe that MagicHashSets are easily converted to MagicArraySets.

Note that you get one more property for free from this design... If the values of the keys in the MagicHashSets are themselves MagicHashSets... You get free n level deep operations for substraction and division! Yay!

So expect me to be posting my findings on this adventure in my next postings :). I really hope they are good findings, and not very bad ones. After all... Maybe operator overloads are the devils work }:D

CPU checking

Posted on 02/04/

I'm going to release the check_linux_cpu check that we've been beta testing at CAPSiDE. I looked around in Nagios Exchange and none of the existing plugins itched my scratch... So I made a new one. What was wrong with the other plugins?

  • No performance data. At CAPSIDE we want all the plugins we use to output perfdata.
  • Calculation of the CPU usage. Read below to find out why.
  • Dependancy on external utilities (mpstat, iostat, Net::SNMP, etc)

How is the CPU usage percentage calculated in our plugin?

/proc/stat has the info needed to calculate the CPU usage. Every time you read it, it gives you the number of slices each processor has passed doing what (computing in user space, computing in kernel (system) space, attending interrupts, etc). But those slices are absolute (counted since the OS is running).

So if you're curious about knowing what your processor has been doing, you just have to sum up all the time it has been doing something, and then calculate the proportion of time that it was doing what you're interested in.

For example, lets suppose a /proc/stat that reports user, system, nice and idle time in each column:

cpu 8000 2000 1000 9000

8000 + 2000 + 1000 + 9000 = 20000 time slices doing things.

How much of that time was spent in user? 8000/20000 = 0.4
And in system? 2000/20000 = 0.1
In nice? 1000/20000 = 0.05
Idle? 9000/20000 = 0.45

This information can be useful, but it can be misleading if you monitor it, because it accounts time since the computer was ON. That means that if at night you have little activity, idle will gain weight. And therefore, your user time can spike up to 100% during a lot of time, and the percentages will not vary all that much.

Think of an obsessive person that that notes down all the time he has spent on all of his activities. When you ask him "what have you been doing all your life?". He'll tend to respond: "sleeping" :D.

A more useful metric would be: "what have you been doing since the last time I asked you". He could tell you: "working on the presentation for tomorrow".

Well let's do the same with our CPU! Since the kernel doesn't have any interface to query what it's been doing since the last time we were interested, we'll have to ask twice:

mesure 1: cpu 8000 2000 1000 9000
mesure 2: cpu 9500 2500 1500 9500 

What has the CPU been doing between mesure 2 and mesure 1?

9500 - 8000 = 1500 in user
2500 - 2000 = 500 in system
1500 - 1000 = 500 in nice
9500 - 9000 = 500 in idle


1500 + 500 + 500 + 500 = 3000 in total

So.. it's been working on:
1500/3000 = 0.5 in user
500/3000 = 0.16 in system
500/3000 = 0.16 in nice
500/3000 = 0.16 in idle

Some plugins would calculate CPU usage using an X second interval. (mostly the ones that depend on external utilities). I don't think this is an accurate way to do the measurement either, because that obsessive person will either say "I'm talking with you", or "I was doing the presentation", but just before that he had been having a snack :D.

Curiosity:
Execute top. Quickly look at the CPU usage. Does the first reading it displays seem familiar now? And the rest?

So... is the plugin fundamentally flawed in some way? Am I just plainly wrong? What do you think?

Squeezing the juice out of check_mysql

Posted on 01/04/

The check_mysql plugin from the Nagios Plugins project is useful, but at CAPSiDE we're quite obsessed with having performance data, and registering it, to later graph it. That way we have a better vision of the systems we're monitoring. Opsview will automatically detect performance data from the plugins and graph it. But in the case of check_mysql we're out of luck. It outputs useful data, but it's not performance data.

Luckily the rrdgraph tool that Opsview uses let's you do some tricks for checks that don't output performance data (like check_mysql). It's map file let's you specify a set of regular expressions to turn the output of plugins into graphable data (Opsview provides a standard set of mappings with it's base installation)

The output of check_mysql looks like

Uptime: 801963 Threads: 5 Questions: 55210201 Slow queries: 246 Opens: 25611 Flush tables: 1 Open tables: 55 Queries per second avg: 68.843

If you paste the line below into /usr/local/nagios/etc/map.local on your Opsview Master server.

/output:Uptime: \d+  Threads: (\d+)  Questions: (\d+)  Slow queries: (\d+)  Opens: (\d+)  Flush tables: (\d+)  Open tables: (\d+)  Queries per second avg: ([-.0-9]+)(?: Slave IO: (\w+) Slave SQL: (\w+) Seconds Behind Master: (\d+)|)/
and push @s, [ "mysql",
             [ "threads", GAUGE, $1 ],
             [ "questions", DERIVE, $2 ],
             [ "slow", DERIVE, $3 ],
             [ "opens", DERIVE, $4 ],
             [ "flush_tables", DERIVE, $5],
             [ "open_tables", GAUGE, $6],
             [ "avg_qps", GAUGE, $7],
             defined $8?[ 'slave_io', GAUGE, (lc($8) eq 'yes'?1:0) ]:(),
             defined $9?[ 'slave_running', GAUGE, (lc($9) eq 'yes'?1:0) ]:(),
             defined $10?[ 'sec_behind', GAUGE, $10 ]:()
             ];
Opsview will be able to generate graphs for all your configured check_mysql checks (note that passed a couple of checks, you will have to reload Opsview to see the icon that links to the graphs).

This map file takes into account if you execute check_mysql with the -S option to monitor MySQL slave status, and creates the slave_io, slave_running and sec_behind data channels.

You can get pretty graphs like this:

Try it out! and tell me if it works for you, and please correct and criticize

DISCLAIMER: Editing the map.local file can leave the rrdgraph recollection broken. Please pass

perl -c /usr/local/nagios/etc/map.local
to assure that it says: /usr/local/nagios/etc/map.local syntax OK and then pay special attention to see if the other RRDs are working correctly. Please read Opsview docs for more information

Nagios::Plugin::DieNicely 0.03

Posted on 31/03/

I'm glad I started it out as a module, because now and then I get the opportunity to find out it doesn't do what I (and probably you) expect it to do... I've squatted a bug in v0.03 (uploaded to CPAN during my "silence" in the blogosphere).

If you where using exception handling in perl (maybe you don't do it explicitly, but a module you use does), Nagios::Plugin::DieNicely would catch it, and exit your script instead of the exception being captured in the eval.

I've discovered a couple flaws more in the module...

  • perl -d my_nagiosplugin
    is broken when Nagios::Plugin::DieNicely is used.
  • Nagios::Plugin::DieNicely doesn't play well with the way the Math::Bigint module detects the library it wants to use. Declare the math library you want to use explicitly and everything works OK.
    use Math::BigInt lib => 'Calc';
    

I haven't had time to look into why these bugs are there. Any ideas?

Pause... But back to blogging

Posted on 31/03/

I haven't been updating the blog lately, but that doesn't mean I haven't been working :)

These days I'll be posting about some Nagios / Opsview related stuff I've been up to lately.

Catalyst::View::RRDGraph

Posted on 26/11/

I needed to render some RRD graphs from a Catalyst application. Before, I was using rrdcgi. Not that I couldn't use it together with WrapCGI, but I wanted to write the HTML templates in Template Toolkit (as always), because rrdcgi templating is not all that powerful.

So you get the RRDs perl module on one side, and you get Catalyst on the other, a bit of glue, and there you have it: Catalyst::View::RRDGraph

Just put the graph definition on the stash, and call the view. The view outputs images, so you can use them from an HTML page that you have templated in whatever language you want.

<Img src="[% c.uri_for('controller/that/uses/rrdgraph') %]">

As always, feedback is welcome

Working with Ton Voon

Posted on 26/11/

Ton Voon, CTO of Altinity was at CAPSiDE last week. He was here doing joint develpoment on OpsView, and giving us an inside view of the bowels of the beast. I always say that to get implicated with a project, having it's source code is not enough. You have to have a "photo" of the project as a whole, and that is pretty hard to have, because most of the time, it isn't documented anywhere. So that's what Ton has given us. One day I'll blog about having that "photo" of a project...

I have to say I enjoyed Ton's stay here, and it was a great pleasure to work together. His tecnological skills and personal aptitudes (good comunicator, ability to envision solutions that can fit everyones needs, will to work, etc) made his stay here at Barcelona a very productive one.

Beeing able to see how other projects are managed gives you a view of how your projects are managed too, and the problems and considerations that these other projects have, that maybe you don't have actually, but can have someday, or that you can apply to yourself. On that side I'm very fond on how we are managing projects at CAPSiDE and think we are on the right path. Of course there is always room for improvement.

At CAPSiDE we are commited to contributing to OpsView, and will try to apport our grain of salt so that it can evolve into an even better monitoring solution than it is.

Simple Cross-Domain Ajax Proxy

Posted on 26/11/

Developing a feature for one of our products we needed to do retrieve pages from other domains via XMLHttpRequests from the browser. As you already know, browsers don't let you do cross domain requests as a security measure, so you have to use a proxy on the same domain that your application is running.

There are a lot of ways of doing it, and I wanted a way where I didn't have to install additional soft and such. There are php proxies, Java proxies, etc. I didn't want to do a Perl proxy (just to not bloat the solution). There where people doing it with Apache (those ones I liked), but in an unmaintanable way (adding one configuration per domain to retrieve info from), and our application required data to be retrievable from any domain. So here is the recepie I whipped up:

<Proxy http>
    Order Deny,Allow
    Allow from all
</Proxy>

RewriteEngine on
RewriteRule ^/web-proxy/(.*)$ $1 [P]

Now you only have to make requests to http://webserver/web-proxy/DESIRED_URL. Please take into account that if you do not protect the web-proxy location to authorized users, you have an open proxy (don't do that).

Opsview Single Sign-On coming soon

Posted on 26/11/

People have been asking on the list to be able to use the Single Sign-On feature implemented in Opsview to authenticate against an LDAP, for example.

I've been trying to get it working with the actual codebase, but I'm sad to say that it's not ready yet. While looking through the code, I found a comment that resolved my doubts:

  # This setting of the user_exists means that Opsview is the central
  # login point, not the authticket
  # Maybe possible in future to allow a trust from the external source
  # so the user can be given from the auth ticket
I love code comments that really help you see the decisions that were made (those are good code comments), although this one was a bit of a show stopper :(

So... the actual codebase can't trust a ticket generated from a 3rd party source. You CAN use the ticket generated by opsview to authenticate on other sources, though, as it's fully valid.

I've contributed changes to Opsview that are awaiting revision. These changes let the Catalyst framework (that Opsview uses) log in the user that is provided through a 3rd party ticket, so if everything goes well, I will be able to show you how to use the Single Sign-On to autheticate Opsview users for the next Opsview release (the article is half-written ;))

Blosxom

Posted on 26/11/

New blog. New blogging software. The last one was movabletype. Not bad... It was written in perl! ;) but it's not totally free and I didn't want something as complicated (some features I didn't need/want/don't want to spend time discovering what they are). I had heard about a quite minimal blogging system, written in Perl, and where almost all functionality is a plugin.

To create a blogging habit things have to be easy (for me). Now i can blog from almost anywhere. I can open an SSH session and blog from vi, install a backend for mobile devices (this article I'm writing from my PDA... while sitting on the couch!).

The blog is going to start out minimal. You can imagine based on the styling... I even don't want comments for now (spam blog comments are horrible to cope with). If you have any comment or suggestion: drop me a mail at pplu@capside.com

Opsview custom SMS notifications

Posted on 26/11/

Opsview can now use custom SMS notification methods. I've prepared a mini-howto guide on how to use this feature. Please send in comments, corrections and suggestions. This article will be aported to the Opsview docs, so all of us will benefit.

Configuring Opsview

Put your custom SMS notification script into

/usr/local/nagios/libexec/notifications

remember to make it executable to the nagios user. See below for recommendations on how to develop the notification methods.

Sync the plugins to the slaves:

/usr/local/nagios/bin/send2slaves

In Opsview interface

Go to: Advanced -> SMS Notifications -> Create new SMS Notification Methods

  • Name: give it an identifier (without spaces)
  • Run On:
    • Monitoring Server: The command will be run on the Master monitoring server. This is for scenarios where you have to notify from a special device, for example, that isn't available on the slaves. A cell phone attached via a serial cable, a server that is only accessible from the master, etc.
    • Slave: This means the command will be run on the slave that has detected the alert. This is for notification services that will not depend on the server that detected the alert, like an HTTP call to an SMS service.
  • Command: the name of the command in the /usr/local/nagios/libexec/notifications directory. Add extra parameters that are supposed to get to the script (parameters that nagios doesn't send you).

Go to: Advanced -> System Preferences, choose your new SMS method identifier, and submit the changes

Be sure to have a contact with the SMS number filled in (that will activate the SMS notifications for that contact). Note that the +CCNNNNNNNNN format is not longer enforced, in fact, no format is enforced, as it will be the plugins responsability to verify the correct format for the number for it's use. Push the "send test SMS" link to try out your notification method.

Reload your Opsview configuration and you're running

More help is available in Opsview docs.

Script guide

The script will recieve the SMS number in the NAGIOS_CONTACTPAGER environment variable, in fact, it can play around with all the environment variables listed in the Nagios Macros Reference. Look in the Service Notifications and the Host Notifications column.

Non-Nagios variables can be expected from the command parameters. Things like --url_to_post_to, --serial-device-to-talk-to --baud-rate, etc, and can be passed when you define the "Command" of the new SMS method.

Do the notification magic, print a line of status to STDOUT to help out humans ;), and exit 0 on success, non-zero on failure.

Note: The Opsview 2.11 standard notification scripts relied on getting the SMS number via the command line with -n parameter (if I don't remember badly). These where changed to be expected through the env variables in Opsview 2.12.

Power to the users

Who says "custom SMS notifications" says "do what you want to notify... you have the control". That is, as long as you fill in the SMS number for a contact, the "SMS" notification will be called for it. You can write a log file instead of sending an SMS if you want... Opsview won't care }:)

Conference feedback

Posted on 26/11/

Sorry for the late post, but I've been quite busy after the Nagios Konferenz. I was preparing one macro-post with all the new things I learned, but I'll just split them so they get published quicker!

The conference was really good, and I met lots of people that use in some way Nagios, apart from main developers, and developers of 3rd party software based on Nagios. The conference was sold out, and it was a pleasure to attend. I hope to be there next time.

I attended:

  • Ethan Galstad: Nagios - Current State, Future Plans and Development Roadmap
  • Geert Vanderkelen: Monitoring MySQL
  • Stefan Kaltenbrunner: PostgreSQL Monitoring - Introduction, Internals And Monitoring Strategies for postgres ql.org
  • Ton Voon: An active check on the status of the Nagios Plugins
  • Satish Jonnavithula & Steven Neiman: Application Transaction Monitoring using Nagios
  • Malte Sussdorff: Integrating Nagios and ]project-open[
  • Tom De Cooman: Monitoring Tools Shootout
  • Julian Hein: FLEXible Realtime Graphing with the new NETWAYS Grapher v2

A big thank you to Netways for organizing this great event

Nagios::Plugin::DieNicely v0.02

Posted on 26/11/

Nagios::Plugin::DieNicely now lets you exit with the Nagios status that you most like. The feature was on the Todo list, and now that I'm confident that the tests pass on lots of different perls and platforms (thanks CPAN Testers!), and that I have detected why there are some FAIL test results, and that there have been requests for it, I have decided to add the feature

Compatibility should be assured (at least the test suite says so). If you use the module as in v0.01, the exit code will still be CRITICAL. But if you where not all that comfortable with CRITICAL, and you would like WARNINGs, now you can. Just:

use Nagios::Plugin::DieNicely qw/WARNING/;

You can pass in these identifiers:

  • CRITICAL: The default
  • WARNING: I suppose this one will be the most used...
  • OK: If you use this one, please comment why you would want to do so. I added it just in case someone would want it (I have no cristal ball to say that it isn't useful), and I have not been creative enough to find a real use.
  • UNKNOWN: The purpose of the module is to NOT get UNKNOWNs in Nagios. Why have you done this? Well... If you specify UNKNOWN, you will get the exception in the Nagios output (instead of lost in limbo).

Give it a ride!

Test::SMTP

Posted on 26/11/

I'm announcing the release of Test::SMTP. This module pretends to provide a framework for making SMTP server testing easy. We were doing SMTP testing with an instance of Net::SMTP, and with Test::More methods, seeing if everything was as expected. All this logic has been encaplsulated into Test::SMTP to make testing SMTP servers a little less of a pain.

Please note that this is a 0.01 version and is based on Net::SMTP as the client. Net::SMTP has it's limitations as a client that permits full control to the test. Don't get me wrong: as a "do the right thing for me when you can" client it's great. Try not to call Net::SMTP methods, as this class is a temporary bridge, just so the testing framework can be evaluated by the community (release early, release often).

Things in Test::SMTP that need to be issued in the future:

  • Test::SMTP cant simulate plain old (helo) smtp clients if server supports ESMTP. Underlying Net::SMTP auto negotiates ehlo/helo when an instance is created.
  • Net::SMTP supports method is called, although not documented in Net-SMTP docs. It's name seems to be public by name :p
  • No STARTTLS support because Net-SMTP doesn.t support it
  • Auto selected AUTH. See Net::SMTP for supported AUTH methods and code for how it selects the auth

Features:

  • You can simulate multiple clients in the same test. Just call connect_ok more times and you obtain more clients.
  • Simulation of misbehaving clients is supported. Test::SMTP inherits from Net::SMTP. You have access to the methods of IO-Socket-INET, Net-Cmd. Because of auto-helo/ehlo you cant issue commands before the helo phase, though.
  • Mail addresses passed to Net::SMTP methods to and mail are mangled by Net::SMTP to try to produce good commands to the server. These have been worked around adding mail_from and rcpt_to methods, that issue MAIL FROM and RCPT TO commands

Future plans are to implement a "don't do things automatically" client so you have all (or at least more) control over the client.

Introducing Catalyst::Authentication::Credential::Authen::Simple

Posted on 26/11/

Just got another module out!

This module isn't at all complicated. I'm even surprised that anyone hasn't already written it! Authen::Simple is a great authentication framework (thanks to the excellent work of Christian Hansen). We've been using it at CAPSiDE for quite some time now, but we hadn't developed a Catalyst module for it because we are normally using mod_auth_tkt, so our Catalyst apps aren't authenicating directly. I recall the need for Catalyst apps to authenticate against external datastores from the mailing lists, and a recent conversation with Ton Voon made me think that it's time to write the module so Catalyst can do fancy authetication

Catalyst::Authentication::Credential::Authen::Simple is just glue between Authen::Simple and Catalyst. It reads the Catalyst App config, instances the appropiate Authen::Simple objects, and then just calls autheticate on the objects when you authenticate from within Catalyst.

It's that simple... Authen::Simple...

Opsview support for NagiosChecker

Posted on 26/11/

I've been using the FireFox NagiosChecker together with Opsview. I found this plugin because someone on the Opsview list asked if it was compatible with Opsview, and I tried it out. It worked well, aside from one little issue:

NagiosChecker authenticates with Basic HTTP Authentication, and Opsview doesn't like that. You configure NagiosChecker, and it doesn't work. Opsview needs a valid cookie to authenticate. If you login to Opsview, you see NagiosChecker start to work. That's because FireFox stores the cookie needed to authenticate, and on the next request NagiosChecker makes, the cookie gets sent to Opsview!

So I only had to make NagiosChecker log in to Opsview the first time it requests the Nagios status screen. I added a checkbox to the NagiosCheck server setup screen so you can tell it it's an Opsview server

I've contributed the patch to the NagiosChecker project, but in the meantime, I've packaged NagiosChecker with my patch so you can try it out. Feedback is welcome ;)

Download the NagiosChecker with Opsview support. You can also patch your installation of NagiosChecker.

Be the ticket booth

Posted on 26/11/

Now that we know how mod_auth_tkt works, we are eager to implement our applications authentication with it. The module never generates the ticket. Instead, ticket generation is delegated to the login URLs.

You can generate a ticket from your favourite language. The mod-auth-tkt distribution includes: a perl module, a python module, and php helper functions. There is a login perl CGI script that uses the perl module, and is prepared to do a lot of things just configuring it, and filling in the "sub validate" so user and password get verified against any database you want. Look at the example: require the class that will do the validation, and then return true or false if the supplied credentials are not correct.

So... What does the aplication have to do to get into the single sign on world? In many cases: nothing. If you have been relying on Apache basic authentication, you probably have been recieving the already authenticated user in the REMOTE_USER environment variable. When a valid ticket is detected, the module takes the user for which the ticket was generated for (remember that if the ticket was expended, the supplied credentials where correct) and sets the REMOTE_USER. So if your application was using basic authentication, you are in luck: set the Apache config and let it run!

If you were authenticating within your application, you are in less luck. There is a forest of possibilities of how your system is working, but most probably you are just storing the logged in user in the session once authenticated, or getting the logged in user from one single point in your code. You can see where I'm getting... Just start to rely on the REMOTE_USER from that point.

Net::Server::Mail::ESMTP::SIZE

Posted on 26/11/

I've just released another module to CPAN. This one is Net::Server::Mail::ESMTP::SIZE.

When I developed tests for Test::SMTP, I had to implement a mini-SMTP server. Instead of reinventing the wheel, I chose to use the Net::Server::Mail distribution to fastly have what I wanted. But to test the supports_cmp_ok, and supports_like there was no extension that reported parameters pre-built. So I stubbed in the SIZE extension only for the tests to play around with.

On my list of "possible CPAN modules" appeared the Net::Server::Mail::ESMTP::SIZE, but this time implementing actual functionality. As with the first attempt I made great progress, and everything went quite straightforward... It's done!

One problem I had was actually getting the module to plug-in to the Net::Server::Mail::ESMTP... The documentation is quite short, so I had to "inspire" myself off the code from the modules that were already written, and do lots of Dumper(@_) to see what was going on... I hope I got it all right :p.

Faster isn't always scalable

Posted on 26/11/

Sometimes when designing we go great lengths optimizing for speed, and not always think of scalability. When thinking scalable you have to tend to think of letting operations be done in parallel and thus locking as little common resources as possible so that the work can probabilistically be done in parallel. And sometimes, to be fast, you hold a lock, so you can make the assumption that you are alone (you can overlook sincronization with others, and thus the overhead). But that means that you are the only one that can be working.

As an example: MyISAM tables are fast reading and writing but scale badly for writes. As concurrent reads go up, one single write locks up ALL the reads on the table, because writes hold a lock on the entire table until they are done. Innodb, in change is slower updating rows, but because writes only lock the rows that they are writing, the reads can still be done concurrently if they are addressed to unlocked rows.

The confusion normally comes from faster meaning less CPU cycles, and since a CPU is a locked resource, the faster you do things, the more you can do in parallel.

Think before holding a lock ;)

Opsview Asterisk notification

Posted on 26/11/

After a couple of weeks in internal testing, I can now contribute a helpful notification method for Opsview, and more generally for Nagios.

We wanted Asterisk to wake up an on-guard engineer if an alert was detected. What seems like a pretty trivial thing has a couple of subtilities that have to be treated with care to not make a nightmare out of it.

First: how to make asterisk call a phone number. There are a couple of documented ways to do it:

  • create a call file on the server. This would have to be done via ssh from the nagios master server. I don't like this because it is touching the internals of asterisk, although it is standard (much like creating a mail in the sendmail queue).
  • via the Asterisk Manager protocol (astman). This seemed more suited for the task. The downside is that the perl API for astman is quite spartan and not all that documneted (Asterisk::Manager).

When someone asks me to choose an option from two that I don't like, I always ask them this question in return: "What do you prefer? Syphilis or Gonorrhea?". This time I was asking it to myself... So I chose the astman solution. I hope that this way the Asterisk::Manager module will get a bit more mature, and therefore be a better long term solution.

Second: The calling gets done in asterisk. The notification script only calls an asterisk extension. We wanted a "human hunter" that would not stop calling until someone acknowledged the alert. Maybe someone would want a different behavior, so that is customizable via asterisk programming.

Lessons Learned

  • Astman was the right way to go: Astman has access rights per user and per host in the manager.conf file.
  • Alarms are random and can happen in parallel: The first day in alpha there was a connectivity problem. A lot of alerts where spawned. The poor guy at the phone had to acknowledge a lot of calls :S. We noticed that a "don't call me if you have just called me" mechanism was needed.

    A configurable lock out mechanism was added so that some calls could be made in parallel in a customizable way. Maybe you want to call a number two times in a row if the alert is for different hosts or host groups, or maybe you just want one time alert per phone number called, or just notify to one phone whatever the notification is.
  • Nagios kills "lazy" notifications: Because we hunt down someone, the call can get long. Another time... someone had to acknowledge a lot of calls that day... :S The call is not registered as successful until Asterisk says that it's successful and then it's registered in the notified database. Other notifications get queued up while a lock on a notification db is held. When tested out of the Nagios environment everything was OK. Debugging revealed that when the notification script was taking more than x seconds, Nagios was killing it, the lock was released, and Asterisk was continuing with the call. The next notification was kicking in (because the lock was released), and Asterisk was dialing again to the same contact.

    This was resolved by forking, and detaching the child from the father process (just like a daemon does). The detached process does the calling. The father returns inmediately.

So you can get the notification script here. At the end it got a little more complicated than it seemed at first :)

New design

Posted on 26/11/
This time it's a design! Thanks Pau.

Nagios Checker patch got through

Posted on 26/11/

I'm pleased to announce that my patch for Nagios Checker to support Opsview is is now available in the official distribution of the plugin. You can get it here

I've been using Nagios Checker for quite some time now, and I like it very much, and now that it's patched for Opsview, I like it even more. Instead of having a browser window open, and going to look at it when an alert email gets in my mailbox, I get a nice warning sound, and an overview of the problems just hovering over the checker. Direct access to Opsview is granted just clicking on the alert. Need to add a host? Or curious to see the Opsview HH page? Just click on the 'go to Nagios' menu when you right click on the Nagios Checker status.

My office colleagues have been beta testing the patched plugin, and find it very useful, but they had a bit of trouble configuring NagiosChecker correctly to play with Opsview so here is how to do it:

When you're configuring your Opsview host in Nagios Checker:

  • URL of the Nagios Interface: http://my.opsview.server/
  • Type of server: Opsview
  • URL of status.cgi: Select manually, and fill in http://my.opsview.server/cgi-bin/status.cgi

I haz got Kwalitee

Posted on 26/11/

I've been trying to increase the kwalitee of my modules in every release of each of my modules. Looks like I got it right.

A couple of tips are:

  • use recent Module::Builder: It gives you kwalitee very easily, as it does tons of stuff for free. But use an actual version. The first modules i contributed used Debian Etch Module::Builder, and didn't generate a known-spec META.yml. Got that fixed free just upgrading Module::Build.
  • make manifest before doing the make dist.
  • use Test::Pod and Test::Pod::Coverage: Test::Pod will alert you if you have typos in your POD, and Test::Pod::Coverage will bug you when you don't document a function

Of course there is no guarantee that if a module has kwalitee then it's good... It has to have proper tests (Test::SMTP had 100% code coverage, and even that won't guarantee bugfree-ness), and those tests have to run on the most platforms possible (that wont assure anything either), and a bunch of other things which I'll write about in next articles... I hope I maintain my kwalitee (I like beeing on the first page of the "Authors with less than five dists" ;)

Comments activated

Posted on 26/11/

I've activated comments for the posts. You can rate them too... (so now I can know if you like the posts, and if I should stop blabbing about some topic :p)

Please be polite, and try to apport some extra content to the posts, without flaming, insulting and such. These actitudes will not be tolerated, and comments will be deleted without any type of explanation.

Machine naming schemes

Posted on 26/11/

When you industrialize your systems management (you are a hosting provider), or you simply have LOTS of machines for whatever reason, you have the need of a naming scheme. You have probably been naming machines by:

  • Planets: Sun, Mercury, Venus, Pluto, ...
  • Constellations: Andromeda, Orion, ...
  • Winds: Tramuntana, Xaloc, Garbí, ... (Here in Catalonia these are used a lot!)
  • LOTR: Mordor, Shire, ...

So you start naming machines with a scheme that helps you localize them: r01p01.net.example.com means rack 01 position 01 (positions starting from the rack bottom), for example. The downside is that once you have standardized the machine names, you loose that special "think of a name" moment, and the freakiness of the thing all together (people that are in IT usually don't know that machines even have names!)

I personally name my machines (and electronic devices that have computer-like functionality) with names of robots that appear in Futurama. So I have Bender, Flexo, Roberto, Calculon, etc. It's funny when I get into my bosses car and the hands free display says "Kwanzabot", and to see the HELO in SMTP headers display "SINCLAIR-2K".

Of course this is not a new thing, and RFC 1178 has some interesting situations in the "what NOT to do" guidelines XD. I'm pretty sure that most of us have fallen into one of the situations described in the RFC.

The bottom line is "try to have fun naming!" (when you can)

New style

Posted on 26/11/

New style for the blog! Contributed by Pau Puig, one of CAPSiDE's workers.

Thanks Pau!

The connotation of PIN numbers

Posted on 26/11/

I discovered some time ago a neat "trick" for mobile phones that not many people know, and that I'm sure the "security paranoid" bunch of people will appreciate.

When your mobile phone prompts for a pin, strangely, it lets you insert more than 4 numbers. That's because: pins can be longer than 4 digits on mobile phones!

I investigsted a little further and it turns out that wikipedia has it documented! It's curious how we have asociated the term "PIN code" with only 4 digits. Maybe phone manufacturers should of called it Unlock Number to take the 4 number connotation out...

So now you know it... you can have the worst type of PIN... One that is probably out of the mental scope of an attacker. And if you're a developer and need to ask for a numeric password be careful of the connotation of "PIN". Maybe you'll find yourself with all passwords beeing 4 digits long, altough you support more ;)

Writing great Nagios plugins

Posted on 26/11/

So you want to write a Nagios plugin, and you want it to be a great one! A great plugin, aside from having some great functionality is one that provides good documentation and fits nicely into the Nagios ecosystem, that is, that nagios users will be comfortable with it.

Right now you are thinking: "how can I do that? I have to look at other plugins, read guidelines, learn a lot about the nagios way to do things, and what the community expects from a plugin, etc. It's a quite big task, and I just wanted to write a quick and dirty plugin!"

If you program your plugins in perl you are a lucky man, because smart people have already done that for you! Nagios::Plugin helps you fit into that ecosystem and get a lot of functionality for the best cost: FREE, and get your plugins done in less time and with more features, with less bugs.

First step: Instance a Nagios::Plugin object

my $np = Nagios::Plugin->new(
        usage => "Usage: %s [-v|--verbose] [-t|--timeout=seconds] -c|--critical=<threshold>"
        version => 1.0,
        blurb => qq{Count the xxx's in yyy},
        extra => qq{
 -c 10
   returns CRITICAL if xxx's are greater than 10
 -c 20 -t 60
   returns CRITICAL if xxx's are greater than 20. Timeout in 60 seconds if it takes too long.},
        url => 'http://example.com'
);

You get:

  • standard parameters
    -V version info
    -h autogenerated help
    -v verbose output flag
    -t timeout
    nice features that you don't have to worry about, and that Nagios users will be very happy to have. Programs like Opsview will show the help on it's web interface (again... for free).

  • plugin versioning
    version and url get outputted for free (too) in help and -V
  • help text
    the help text consists of the version info, license (GPL if not overridden), blurb (text describing what the plugin does), parameter help list (autogenerated with the add_arg() info, and extra info. The extra info is the ideal place to give the user a couple of usage examples with a small description of what the invocation of the plugin with those parameters does.
That's a lot for one statement!

Second step: add your parameters

$np->add_arg(
    spec => 'warning|w=s',
    help => "-w, --warning=RANGE\n     Range for returning WARNING"
);
$np->add_arg(
    spec => 'number|n=i',
    help => "-n, --number=INTEGER\n     Number of yyy's to xxx",
    required => 1
);
$np->add_arg(
    spec => 'filter|f=s',
    help => "-f, --filter=aaa\n    Filter by aaa",
    default => 'aaa'
);

# Parse @ARGV and process standard arguments (e.g. usage, help, version)
$np->getopts;
You get free parameter type validation, so if you declare that a parameter is an integer, the plugin will not go past the $np->getopts statement. You also specify a string for each parameter that will be displayed when the user calls the plugin with --help. If you are going to have a critical and a warning threshold, tell the user that they are RANGE items (you'll see why below). Some standard parameter names are:
-c critical range
-w warning range
-C for parameters that start with "c" other than critical
-H hostname: for names of machines
-p port: for port numbers
-4 for using IPv4
-6 for using IPv6

Third step: do what your plugin does

Now you have to work (hey! you haven't broken a sweat yet!). To get the value of the parameters passed to your script, you have handy $np->opts->paramname accessors.

Fourth step: return performance data (it's free)

You have almost surely collected a measurable quantity to compare against a threshold. Output the recollected data via performance data. I'm sure you will want to see how your recollected data evolves through time with a nice graphing tool. Is it going up? down? is it high at work hours? is it low on weekends?

$np->add_perfdata(
    label => "size",
    value => $value,
    uom => "kB",
    warning => $np->opts->warning,
    critical => $np->opts->critical
);

Let UOM be:

  • no unit specified - assume a number (int or float) of things (eg, users, processes, load averages)
  • s - seconds (also us, ms)
  • % - percentage
  • B - bytes (also KB, MB, TB)
  • c - a continous counter (such as bytes transmitted on an interface)

Fifth step: return the status

Now you decide if the plugin has to return CRITICAL, WARNING or OK. This code quickly springs to mind:

if (recollected_value > critical)
    ...
elsif (recollected_value between critical and warning)
    ...
else
    ...

What if somebody wants OK between critical and warning? Again you can work less and get more: $np->check_threshold to the resue! Nagios has a RANGE specification that check_threshold understands so you can just pass the recollected value, the critical parameter and the warning parameter. You get the status that has to be returned calculated for free!

my $status = $np->check_threshold(
    check => $value,
    warning => $np->opts->warning,
    critical => $np->opts->critical
);

Now just return the calculated status and a little single line text with the exit method. Don't be too verbose, though, because the output gets cut!

$np->nagios_exit( $status, "$value xxx's where found" );

More neat (and free) details

  • verbosity
    $np->opts->verbose will return the number of -v flags in the parameters. Use it if you want to give the users a little more info (-vv or a little more (-vvv or a lot more)) :p.
  • Read the docs
    The docs will reveal all sorts of extra info. Read the helper classes (Nagios::Plugin::Xxx) documentation too, because not everything is exposed in the Nagios::Plugin documentation ;)

Resuming

Nagios::Plugin will save your time, and make your plugins better, with less effort.

Proud to see Opsview 2.12.1

Posted on 26/11/

The development work that got done when Ton Voon came to CAPSiDE has got through. I am proud of the add-ons that we have contributed, and hope to add more over time.

A lot of effort has gone into each feature by the CAPSiDE Team and by Altinity.

CAPSiDE added features are:

  • Single Sign-on
  • Event handlers
  • Customizable host check commands
  • Customizable SMS Notification methods

One thing from CAPSiDE didn't make it in to the 2.12.1 release (but hope will soon come) is Nagvis integration so you can map out your servers and see them the way you want to.

We are looking forward to hear if these add-ons have been useful to the community, and if they are being used and how. Drop us a mail to the opsview users list ;)

Getting to the backends

Posted on 26/11/

As I already exposed, simple web apps will be using mod_auth_tkt pretty fast if they where counting on http basic authentication.

When you control the software being used (be it yours or open source) you can always take on parsing the ticket to get the info back, be it in a cookie, be it in a parameter via GET.

Let's examine a more complex scenario. Problems start ariving when using application servers, or proxying to non auth_tkt aware servers or applications. The frontend can validate the ticket, (authenticating the user), but, since mod_auth_tkt basically leaves the ticket in the REMOTE_USER environment variable, and these variables don't get proxied, you don't recieve the logged in user in the backend. So... lets try to find some ways of getting the info to the backends (thanks to the people on the mod_auth_tkt list for the pointers).

Using headers

Put the REMOTE_USER in an HTTP header. Use mod_headers.

ProxyPass /headertest/ http://backend/xxx/
ProxyPassReverse /headertest/ http://backend/xxx/

<Location /headertest/>
   AuthType Basic
   TKTAuthLoginURL /login
   TKTAuthTimeout 600s
   RequestHeader set X-AuthTkt-Remote-User "%{REMOTE_USER}e"
   RequestHeader set X-AuthTkt-Data        "%{REMOTE_USER_DATA}e"
   RequestHeader set X-AuthTkt-Tokens      "%{REMOTE_USER_TOKENS}e"
   require valid-user
</Location>

And in the backend, just pickup the results! (If you are running a CGI on the backend, just loookup the environment variable: HTTP_X_AUTHTKT_REMOTE_USER, HTTP_X_AUTHTKT_TOKENS, HTTP_X_AUTHTKT_DATA. Of course, you'll say! I have to modify the backend software to read from the HTTP_X_AUTHTKT_REMOTE_USER. If the backend server is another Apache, you still have an Ace up your sleeve mod_setenvif.

    SetEnvIf X-AuthTkt-Remote-User "(.*)" REMOTE_USER=$1
    SetEnvIf X-AuthTkt-Data        "(.*)" REMOTE_USER_DATA=$1
    SetEnvIf X-AuthTkt-Tokens      "(.*)" REMOTE_USER_TOKENS=$1

Using URL GET parameters

You can rewrite the REMOTE_USER to a parameter in the URL. mod_rewrite can handle this with it's eyes closed, and fetch that in the backend.

ProxyPass /headertest/ http://backend/xxx/
ProxyPassReverse /headertest/ http://backend/xxx/

<Location /headertest/>
   AuthType Basic
   TKTAuthLoginURL /login
   TKTAuthTimeout 600s

   RewriteEngine on
   RewriteRule  ^(.+)\??(.*)$   $1?remote_user=%{ENV:REMOTE_USER}$2    [QSA]

   require valid-user
</Location>

mod_rewrite can set environment variables too, so, if you do the inverse process (set the value of the GET parameter to the environment variable), you get the same result. I like the header solution best because mod_rewrite is a heavy module, and just adds the module that the frontend needs, and the one that the backend needs.

There was a comment on the list on getting username and password to the backends (for apps that need the two on every request), but for that you have to store the password encripted in the cookie. I'll have a shot at that one in another post (and maybe use the tecnique in the real world in an OS application... we'll see).

I wish I never hit the send button

Posted on 26/11/

Every day we send out lots of mails. I normally read a mail two times before sending it to a customer. And despite that, there have been times where I wished that a message had not gone out. Maybe I pressed the shortcut to send out the mail when it was half finished, maybe I got an afterthought on how to express something, or on how to solve an issue in another way, or to include someone in the conversation...

The other day, talking with one of our customers project manager, he told me he was going to send me a mail while we were at the phone. He told me that I would recieve the mail in one minute, and the curious thing: "one minute" was not an expression. I got interested in the delay, and just had to ask why. Basically he has a rule that delays all outgoing mail for one minute before submitting it to the server. He gave me a nice and easy solution I hadn't ever seen. You can cause a configurable delay to your outgoing messages with a very simple Outlook rule! Now you always have a second chance! After all the customer probably won't notice the delay.

I'm not saying that this is the remedy to all mistakes, but I don't know why, when you press the send button, a background process kicks in and you realize your mistakes, and this is a nice way of getting to the message before it really gets sent. Of course your brain can adapt to kick that background process in after the delay... you never know brains! ;)

I liked the solution because it was a way to use Outlook rules that I had never thought of, although you can see that it isn't hidden at all (create an outgoing message rule).

Oh... wait... I wanted to blog about open source things! I tried to get the same functionality out of Thunderbird but it seems that rules only apply to incoming mail. Does anybody know of an Open Source mail client that can implement this sort of behaviour?