Opsview Asterisk notification

Posted on 05/03/

After a couple of weeks in internal testing, I can now contribute a helpful notification method for Opsview, and more generally for Nagios.

We wanted Asterisk to wake up an on-guard engineer if an alert was detected. What seems like a pretty trivial thing has a couple of subtilities that have to be treated with care to not make a nightmare out of it.

First: how to make asterisk call a phone number. There are a couple of documented ways to do it:

  • create a call file on the server. This would have to be done via ssh from the nagios master server. I don't like this because it is touching the internals of asterisk, although it is standard (much like creating a mail in the sendmail queue).
  • via the Asterisk Manager protocol (astman). This seemed more suited for the task. The downside is that the perl API for astman is quite spartan and not all that documneted (Asterisk::Manager).

When someone asks me to choose an option from two that I don't like, I always ask them this question in return: "What do you prefer? Syphilis or Gonorrhea?". This time I was asking it to myself... So I chose the astman solution. I hope that this way the Asterisk::Manager module will get a bit more mature, and therefore be a better long term solution.

Second: The calling gets done in asterisk. The notification script only calls an asterisk extension. We wanted a "human hunter" that would not stop calling until someone acknowledged the alert. Maybe someone would want a different behavior, so that is customizable via asterisk programming.

Lessons Learned

  • Astman was the right way to go: Astman has access rights per user and per host in the manager.conf file.
  • Alarms are random and can happen in parallel: The first day in alpha there was a connectivity problem. A lot of alerts where spawned. The poor guy at the phone had to acknowledge a lot of calls :S. We noticed that a "don't call me if you have just called me" mechanism was needed.

    A configurable lock out mechanism was added so that some calls could be made in parallel in a customizable way. Maybe you want to call a number two times in a row if the alert is for different hosts or host groups, or maybe you just want one time alert per phone number called, or just notify to one phone whatever the notification is.
  • Nagios kills "lazy" notifications: Because we hunt down someone, the call can get long. Another time... someone had to acknowledge a lot of calls that day... :S The call is not registered as successful until Asterisk says that it's successful and then it's registered in the notified database. Other notifications get queued up while a lock on a notification db is held. When tested out of the Nagios environment everything was OK. Debugging revealed that when the notification script was taking more than x seconds, Nagios was killing it, the lock was released, and Asterisk was continuing with the call. The next notification was kicking in (because the lock was released), and Asterisk was dialing again to the same contact.

    This was resolved by forking, and detaching the child from the father process (just like a daemon does). The detached process does the calling. The father returns inmediately.

So you can get the notification script here. At the end it got a little more complicated than it seemed at first :)