Stupid SNMP trap design worked around in OpenNMS

Not too long ago, I was tasked to implement notifications triggered by SNMP traps sent to OpenNMS from a telephone system. Every time I had to do a task like this I felt like it was more or less a pain in the ass, but this thing was definitely the worst …

In addition to the actual telephone business, the device also was connected to a bunch of electrical contacts in doors, smoke and water detectors and the like so it could send traps for those kinds of events, too.

The MIB for the device implemented exactly 2 traps (read: two), a so called “spontaneous alarm” and a “q3 alarm”. The spontaneous alarm was triggered whenever one of the electrical contacts disconnected or connected and the q3 was for telephone business not working as expected. Each trap had a parameter indicating the severity, a couple of parameters holding information about the actual event the device wanted to inform about and one parameter holding an ever increasing device-internal notificationId.

The problem was, the exact same spontaneous alarm trap OID could indicate a resolved minor event like an open door to the storage room or a new hazardous event like fire in the server room.

So the first task was to decide whether a trap held a “problem” or an “ok”. This was rather easyily done by configuring a varbind filter on the severity parameter like

<varbind>
<vbnumber>4</vbnumber>
<vbvalue>~[01234]</vbvalue>
</varbind> 

for the “problem” events and

<varbind>
<vbnumber>4</vbnumber>
<vbvalue>5</vbvalue>
</varbind> 

for the “ok” events.

Whether the trap corresponded to a minor or major event was only visible through a “line” parameter, which was also matchable with a varbind filter after I had received a list of what was connected to which line. There were like a dozen “minor” lines which could be implemented like this:

<varbind>
<vbnumber>9</vbnumber>
<vbvalue>~.*(A|B|C|D|E|F|G|H|I|J|K|L|M).*</vbvalue>
</varbind> 

Any other line than those meant that something serious was wrong. So there was one event definition with the varbind mentioned above and another one without it.

By now, from only 2 traps, we already have 6 events:

  1. spontaneousAlarmsMinor
  2. spontaneousAlarmsMinorOkay
  3. spontaneousAlarmsMajor
  4. spontaneousAlarmsMajorOkay
  5. q3Alarm
  6. q3AlarmOkay

Technically, the small number of electrical connector lines corresponding to minor events could have been split into individual events, but due to the fact that there were like a hundred other lines that were connected to some kind of considered “important” thingy, I wasn’t too keen on splitting it up any further. Plus: They tended to put new things there rather regularly, so there had to be something that catched any “line” and I decided to stick with these 6.

Here’s how they wanted OpenNMS to handle those events:

First step: Notifications … The spontaneous minor problems should be notified with a delay of 5 minutes, the spontaneous major problems and q3 problems should be notified immediately. This was achieved with the configuration of two different destination paths, one with a delay of 5m, one without delay.

    <path name="email" initial-delay="0s">
        <target interval="0s">
            <name xmlns="">admin</name>
            <autoNotify xmlns="">auto</autoNotify>
            <command xmlns="">javaEmail</command>
        </target>
    </path>
    <path name="email5mindelay" initial-delay="5m">
        <target interval="0s">
            <name xmlns="">admin</name>
            <autoNotify xmlns="">auto</autoNotify>
            <command xmlns="">javaEmail</command>
        </target>
    </path>

These paths were then used in the notifications:

<notification name="q3Alarm" status="on" writeable="yes">
  <uei xmlns="">uei.opennms.org/vendor/q3Alarm</uei>
  <destinationPath xmlns="">email</destinationPath>
</notification> 
<notification name="spontaneousAlarms" status="on" writeable="yes">
  <uei xmlns="">uei.opennms.org/vendor/spontaneousAlarms</uei>
  <destinationPath xmlns="">email</destinationPath>
</notification>
<notification name="spontaneousAlarmsMinor" status="on" writeable="yes">
  <uei xmlns="">uei.opennms.org/vendor/spontaneousAlarmsMinor</uei>
  <destinationPath xmlns="">email5mindelay</destinationPath>
</notification>

Once the notifications were configured, the next challenge was to not only send notifications about problems but to also send recovery-notifications when a problem was fixed … and that’s where the fun began … Usually, automatic recovery-notifications are set up in the notifd configuration with a pattern of

<auto-acknowledge resolution-prefix="RESOLVED: "
uei="uei.opennms.org/my/events/problem"
acknowledge="uei.opennms.org/my/events/okay">
  <match xmlns="">nodeid</match>
  <match xmlns="">interfaceid</match>
</auto-acknowledge>

Unfortunately, OpenNMS (as of 1.8.1, which is what I used when developing this solution) is only capable of matching event parameters (which is something different than trap parameters) and UEIs in this step and it is perfectly fine for $device to have sent three traps with the same oid indicating someone switched on the light, there’s water in room one and there’s fire in room eight. So there would be three events with the same UEI meaning something completely different and with a configuration in notifd that just ack’d on behalf of the UEI, it might have acknowledged the fire event if the lights had been switched out again. Bad idea …

Instead, I had to figure out a way to auto-acknowledge the notifications by looking at the device-internal notificationId, which was sent as a trap parameter. To get to the point of the whole posting … here’s how I did it:

I configured every “ok”-event to run through the event translator and had the translator call a PLPGSQL procedure which updated the original “problem”-notification in the notifications table of the database and acknowledged it by setting respondtime to “now()” and answeredby to “admin” (which, looking at it from today, should have been a better name). This prevented the “minor” notifications that were still waiting for $delay to expire from being sent. Afterwards, the event translator created a new event on which I configured a new notification that simply wrote the same text as the original notification with an additional leading “RESOLVED: ” in the subject.

In order to be able to easily match the device-internal notificationId sent in the trap (which is just part of the trap, not necessarily of the notification) I put this ID into “numericmsg” of the notifications:

<numeric-message xmlns="">%parm[#3]%</numeric-message> 

So here goes the crazy stuff:

create or replace function getunackdspontaneousminor() returns void as $body$
declare
i record;
begin
for i in select numericmsg from notifications where answeredby is null and respondtime is null and eventuei like '%spontaneousAlarmsMinor' order by numericmsg LOOP
execute $$update notifications set answeredby='admin', respondtime='now()' where respondtime is null and answeredby is null and numericmsg=$$ || quote_literal(i.numericmsg) || $$ and exists ( select eventid from events where eventparms ~ '^.*$$ || i.numericmsg || $$[(].*' and eventuei = 'uei.opennms.org/vendor/spontaneousAlarmsMinorOkay' )$$;
END LOOP;
END
$body$ LANGUAGE 'plpgsql'; </code>

create or replace function getunackdspontaneousalarms() returns void as $body$
declare
i record;
begin
for i in select numericmsg from notifications where answeredby is null and respondtime is null and eventuei like '%spontaneousAlarmsMajor' order by numericmsg LOOP
execute $$update notifications set answeredby='admin', respondtime='now()' where respondtime is null and answeredby is null and numericmsg=$$ || quote_literal(i.numericmsg) || $$ and exists ( select eventid from events where eventparms ~ '^.*$$ || i.numericmsg || $$[(].*' and eventuei = 'uei.opennms.org/vendor/spontaneousAlarmsMajorOkay' )$$;
END LOOP;
END
$body$ LANGUAGE 'plpgsql';

create or replace function getunackdq3alarms() returns void as $body$
declare
i record;
begin
for i in select numericmsg from notifications where answeredby is null and respondtime is null and eventuei like '%q3Alarm' order by numericmsg LOOP
execute $$update notifications set answeredby='admin', respondtime='now()' where respondtime is null and answeredby is null and numericmsg=$$ || quote_literal(i.numericmsg) || $$ and exists ( select eventid from events where eventparms ~ '^.*$$ || i.numericmsg || $$[(].*' and eventuei = 'uei.opennms.org/vendor/q3AlarmOkay' )$$;
END LOOP;
END
$body$ LANGUAGE 'plpgsql';

The event translation for the minor events looked like this, notice “lt” vs. “gt” in the sql statements:

<event-translation-spec uei="uei.opennms.org/vendor/spontaneousAlarmsMinorOkay">
  <mappings>
    <mapping>
      <assignment name="uei" type="field" >
        <value type="constant" result="uei.opennms.org/translator/spontaneousAlarmsMinorOkay5min" />
      </assignment>
      <assignment name="sleep" type="parameter">
        <value type="sql" result="select pg_sleep(3)" />
      </assignment>
      <assignment name="ack" type="parameter">
        <value type="sql" result="select getunackdspontaneousalarmsminor()" />
      </assignment>
      <assignment name="acktime" type="parameter">
        <value type="sql" result="select acktime from (select respondtime-pagetime acktime from notifications where notifyid=(select notifyid from notifications where eventid=(SELECT eventid FROM events WHERE eventparms ~ ? and eventuei ='uei.opennms.org/vendor/spontaneousAlarmsMinor'))) acktime where acktime &gt; '5 minutes'" >
          <value type="parameter" name="~^\.1\.3\.6\.1\.4\.1\.someenterprise\.7\.1\.3\.1\.1\.1\.3\.0$" matches=".*" result="${0}" />
        </value>
      </assignment>
    </mapping>
    <mapping>
      <assignment name="uei" type="field" >
        <value type="constant" result="uei.opennms.org/translator/spontaneousAlarmsMinorOkay" />
      </assignment>
      <assignment name="sleep" type="parameter">
        <value type="sql" result="select pg_sleep(3)" />
      </assignment>
      <assignment name="ack" type="parameter">
        <value type="sql" result="select getunackdspontaneousalarmsminor()" />
      </assignment>
      <assignment name="acktime" type="parameter">
        <value type="sql" result="select acktime from (select respondtime-pagetime acktime from notifications where notifyid=(select notifyid from notifications where eventid=(SELECT eventid FROM events WHERE eventparms ~ ? and eventuei ='uei.opennms.org/vendor/spontaneousAlarmsMinor'))) acktime where acktime &lt; '5 minutes'" >
          <value type="parameter" name="~^\.1\.3\.6\.1\.4\.1\.someenterprise\.7\.1\.3\.1\.1\.1\.3\.0$" matches=".*" result="${0}" />
        </value>
      </assignment>
    </mapping>
  </mappings>
</event-translation-spec>

The major event translations looked like this:

<event-translation-spec uei="uei.opennms.org/vendor/spontaneousAlarmsOkay">
  <mappings>
    <mapping>
      <assignment name="uei" type="field" >
        <value type="constant" result="uei.opennms.org/translator/spontaneousAlarmsOkay" />
      </assignment>
      <assignment name="sleep" type="parameter">
        <value type="sql" result="select pg_sleep(3)" />
      </assignment>
      <assignment name="egal" type="parameter">
        <value type="sql" result="select getunackdspontaneousalarms()" />
      </assignment>
    </mapping>
  </mappings>
</event-translation-spec>

<event-translation-spec uei="uei.opennms.org/vendor/q3AlarmCleared">
  <mappings>
    <mapping>
      <assignment name="uei" type="field" >
        <value type="constant" result="uei.opennms.org/translator/q3AlarmCleared" />
      </assignment>
      <assignment name="sleep" type="parameter">
        <value type="sql" result="select pg_sleep(3)" />
      </assignment>
      <assignment name="egal" type="parameter">
        <value type="sql" result="select getunackdq3alarms()" />
      </assignment>
    </mapping>
  </mappings>
</event-translation-spec> 

How much easier would this have been with a decent MIB … On the other hand, it showed once again, that OpenNMS is an amazing platform that most of the time can be told to do what you want it to … at least in some way :)

OpenNMS UCE 2011

This Thursday and Friday I attended the Users Conference Europe 2011 of the OpenNMS-project. While the last two years the program consisted of talks that were handed in before the actual conference and then, as an attendee, you only had to choose which talks you wanted to listen to, this year, they had decided to hold a course covering the basics of the software on day one (they usually do this course in 4 days) and a barcamp style thing on day two.

To be honest, I didn’t expect much from day one since I’ve been using OpenNMS for over 4 years now … Once again, Tarus did a great job talking about OpenNMS and keeping everyone interested and awake by putting in an anecdote or a joke every now and then. And actually, I did learn quite a few things. So now I understand that an “RRA” is a round robin array and what the numbers of such configuration actually mean.

RRA:AVERAGE:0.5:1:2016
RRA:AVERAGE:0.5:12:148

So the first RRA would store 2016 entries, each holding the average value of 1 sample. The second one would store 148 entries, each holding the average value of 12 samples. The 0.5 represents that it needs at 6 of the 12 (0.5 or 50%) values in order to actually store a value.

Aaand … I finally understood what a – drumrolls – alarm is. I had never had a use for this since I apparently could do anything I wanted OpenNMS to with just using events and so I never really tried to understand the alarms-conecpt. Turns out, afaiu, alarms are just something a user sees in the WebUI and so I’m not too interested in that. Although … there was one thing that sounded interesting … namely being able to only keep the most recent event of a certain type around instead of storing all of them. I’ll need to dig into that I guess.

From day 2 I really did not know what to expect. I had never attended a barcamp before and what I read about it on the internet didn’t really give me a good idea of what was going to happen. Maybe I didn’t read carefully enough. So anyway, they had everybody get up and come to the front, introduce themselves and then briefly tell about what you’re interested in regarding OpenNMS and what you might offer to talk about.

So I figured since I did the HA talk last year and the slides were still on my laptop, I could offer to give that presentation if anyone’d be interested in that. After everybody talked about what they’d like to hear or wanted to talk about, everyone got to vote on the available topics and the top 9 voted talks would then be held in the 3 rooms available for the conference. Turns out, my topic was among the top 5 of the offered talks. So like 10 minutes later, I started giving that talk from last year.

I hadn’t looked at the presentation since then and it was like about 80 slides and I only had 90 minutes to give the talk so as soon as the projector was working I started rambling about Opensource HA clusters and went through the setup I had created about a year ago. Actually, I think it went rather smoothly considering I had not thought about this talk for about a year and so I’m quite satisfied with how this one turned out. Some guy (sorry I’m _really_ bad with names) even gave me some positive feedback on the talk which always feels good.

After lunch, there was a talk on provisioning for which I also volunteered to share my use case of provisioning from the source of a DNS servers. While David did most of the talk, I was able to slide this in and I think there was some interest in that.

This barcamp approach was a completely new concept of a conference to me but I can’t say I didn’t like it. While you could see that some people were not that comfortable talking in front of the entire group of 60 (!!!) people, I think I kind of got used to it over the last couple of years and I’m quite happy with that.

So let’s get some sleep and go bike riding tomorrow :)