Nagios::Plugin::DieNicely Released

Posted on 23/05/

As your Nagios plugins get a bit more complicated, and depend on external CPAN modules you will find yourself with spontaneous UNKNOWN states on Nagios when the services that you monitor are faulty. This will probably come from the fact that different modules have different ways of notifying that something has gone wrong. Some return undef, and some call die or croak.

When they call die is when you have Nagios reporting UNKNOWN states, and "no output". Nagios will consider exit codes that it doesn't know as unknown states, and perl exits with 255 on die. And one more thing: the exception gets printed to STDERR, and Nagios will just discard it. So you never know what hit you.

Normally you program thinking that things go well, and if there is an unhandled exception the program is supposed to die. But we're monitoring... an unhandled exception can probably give some important info on what's going on. So you wrap the code you THINK will fail around an eval, and you exit with the appropiate Nagios exit code if there is an exception. But what will you do? Wrap everything around an eval? Ugly. And you have to remember... Fear not. Just use Nagios::Plugin::DieNicely and program as always.

Nagios::Plugin::DieNicely will trap perls die (and Carp's croak and confess) for you. Then it will output the exception to STDOUT in Nagios format and exit with a Nagios CRITICAL exit code. So now you have one less thing to worry about.

This module was motivated by a real case. We were (and actually are) monitoring web services with the CPAN Soap::Lite module. These web services fail very often due to uncontrollable (by us) causes. So I have had the opportunity to see the Nagios check that attacks them in a variety of cases when the web service / server is failing. I've gone through 4 (or so) revisions of the code that returned UNKNOWN states in corner cases where the where the client module would behave in unexpected ways, and a couple of them where "die cases" that I wrapped an eval around. But I finally thought that this could maybe be done a better way.