Stupid SNMP trap design worked around in OpenNMS

Not too long ago, I was tasked to implement notifications triggered by SNMP traps sent to OpenNMS from a telephone system. Every time I had to do a task like this I felt like it was more or less a pain in the ass, but this thing was definitely the worst …

In addition to the actual telephone business, the device also was connected to a bunch of electrical contacts in doors, smoke and water detectors and the like so it could send traps for those kinds of events, too.

The MIB for the device implemented exactly 2 traps (read: two), a so called “spontaneous alarm” and a “q3 alarm”. The spontaneous alarm was triggered whenever one of the electrical contacts disconnected or connected and the q3 was for telephone business not working as expected. Each trap had a parameter indicating the severity, a couple of parameters holding information about the actual event the device wanted to inform about and one parameter holding an ever increasing device-internal notificationId.

The problem was, the exact same spontaneous alarm trap OID could indicate a resolved minor event like an open door to the storage room or a new hazardous event like fire in the server room.

So the first task was to decide whether a trap held a “problem” or an “ok”. This was rather easyily done by configuring a varbind filter on the severity parameter like

<varbind>
<vbnumber>4</vbnumber>
<vbvalue>~[01234]</vbvalue>
</varbind> 

for the “problem” events and

<varbind>
<vbnumber>4</vbnumber>
<vbvalue>5</vbvalue>
</varbind> 

for the “ok” events.

Whether the trap corresponded to a minor or major event was only visible through a “line” parameter, which was also matchable with a varbind filter after I had received a list of what was connected to which line. There were like a dozen “minor” lines which could be implemented like this:

<varbind>
<vbnumber>9</vbnumber>
<vbvalue>~.*(A|B|C|D|E|F|G|H|I|J|K|L|M).*</vbvalue>
</varbind> 

Any other line than those meant that something serious was wrong. So there was one event definition with the varbind mentioned above and another one without it.

By now, from only 2 traps, we already have 6 events:

  1. spontaneousAlarmsMinor
  2. spontaneousAlarmsMinorOkay
  3. spontaneousAlarmsMajor
  4. spontaneousAlarmsMajorOkay
  5. q3Alarm
  6. q3AlarmOkay

Technically, the small number of electrical connector lines corresponding to minor events could have been split into individual events, but due to the fact that there were like a hundred other lines that were connected to some kind of considered “important” thingy, I wasn’t too keen on splitting it up any further. Plus: They tended to put new things there rather regularly, so there had to be something that catched any “line” and I decided to stick with these 6.

Here’s how they wanted OpenNMS to handle those events:

First step: Notifications … The spontaneous minor problems should be notified with a delay of 5 minutes, the spontaneous major problems and q3 problems should be notified immediately. This was achieved with the configuration of two different destination paths, one with a delay of 5m, one without delay.

    <path name="email" initial-delay="0s">
        <target interval="0s">
            <name xmlns="">admin</name>
            <autoNotify xmlns="">auto</autoNotify>
            <command xmlns="">javaEmail</command>
        </target>
    </path>
    <path name="email5mindelay" initial-delay="5m">
        <target interval="0s">
            <name xmlns="">admin</name>
            <autoNotify xmlns="">auto</autoNotify>
            <command xmlns="">javaEmail</command>
        </target>
    </path>

These paths were then used in the notifications:

<notification name="q3Alarm" status="on" writeable="yes">
  <uei xmlns="">uei.opennms.org/vendor/q3Alarm</uei>
  <destinationPath xmlns="">email</destinationPath>
</notification> 
<notification name="spontaneousAlarms" status="on" writeable="yes">
  <uei xmlns="">uei.opennms.org/vendor/spontaneousAlarms</uei>
  <destinationPath xmlns="">email</destinationPath>
</notification>
<notification name="spontaneousAlarmsMinor" status="on" writeable="yes">
  <uei xmlns="">uei.opennms.org/vendor/spontaneousAlarmsMinor</uei>
  <destinationPath xmlns="">email5mindelay</destinationPath>
</notification>

Once the notifications were configured, the next challenge was to not only send notifications about problems but to also send recovery-notifications when a problem was fixed … and that’s where the fun began … Usually, automatic recovery-notifications are set up in the notifd configuration with a pattern of

<auto-acknowledge resolution-prefix="RESOLVED: "
uei="uei.opennms.org/my/events/problem"
acknowledge="uei.opennms.org/my/events/okay">
  <match xmlns="">nodeid</match>
  <match xmlns="">interfaceid</match>
</auto-acknowledge>

Unfortunately, OpenNMS (as of 1.8.1, which is what I used when developing this solution) is only capable of matching event parameters (which is something different than trap parameters) and UEIs in this step and it is perfectly fine for $device to have sent three traps with the same oid indicating someone switched on the light, there’s water in room one and there’s fire in room eight. So there would be three events with the same UEI meaning something completely different and with a configuration in notifd that just ack’d on behalf of the UEI, it might have acknowledged the fire event if the lights had been switched out again. Bad idea …

Instead, I had to figure out a way to auto-acknowledge the notifications by looking at the device-internal notificationId, which was sent as a trap parameter. To get to the point of the whole posting … here’s how I did it:

I configured every “ok”-event to run through the event translator and had the translator call a PLPGSQL procedure which updated the original “problem”-notification in the notifications table of the database and acknowledged it by setting respondtime to “now()” and answeredby to “admin” (which, looking at it from today, should have been a better name). This prevented the “minor” notifications that were still waiting for $delay to expire from being sent. Afterwards, the event translator created a new event on which I configured a new notification that simply wrote the same text as the original notification with an additional leading “RESOLVED: ” in the subject.

In order to be able to easily match the device-internal notificationId sent in the trap (which is just part of the trap, not necessarily of the notification) I put this ID into “numericmsg” of the notifications:

<numeric-message xmlns="">%parm[#3]%</numeric-message> 

So here goes the crazy stuff:

create or replace function getunackdspontaneousminor() returns void as $body$
declare
i record;
begin
for i in select numericmsg from notifications where answeredby is null and respondtime is null and eventuei like '%spontaneousAlarmsMinor' order by numericmsg LOOP
execute $$update notifications set answeredby='admin', respondtime='now()' where respondtime is null and answeredby is null and numericmsg=$$ || quote_literal(i.numericmsg) || $$ and exists ( select eventid from events where eventparms ~ '^.*$$ || i.numericmsg || $$[(].*' and eventuei = 'uei.opennms.org/vendor/spontaneousAlarmsMinorOkay' )$$;
END LOOP;
END
$body$ LANGUAGE 'plpgsql'; </code>

create or replace function getunackdspontaneousalarms() returns void as $body$
declare
i record;
begin
for i in select numericmsg from notifications where answeredby is null and respondtime is null and eventuei like '%spontaneousAlarmsMajor' order by numericmsg LOOP
execute $$update notifications set answeredby='admin', respondtime='now()' where respondtime is null and answeredby is null and numericmsg=$$ || quote_literal(i.numericmsg) || $$ and exists ( select eventid from events where eventparms ~ '^.*$$ || i.numericmsg || $$[(].*' and eventuei = 'uei.opennms.org/vendor/spontaneousAlarmsMajorOkay' )$$;
END LOOP;
END
$body$ LANGUAGE 'plpgsql';

create or replace function getunackdq3alarms() returns void as $body$
declare
i record;
begin
for i in select numericmsg from notifications where answeredby is null and respondtime is null and eventuei like '%q3Alarm' order by numericmsg LOOP
execute $$update notifications set answeredby='admin', respondtime='now()' where respondtime is null and answeredby is null and numericmsg=$$ || quote_literal(i.numericmsg) || $$ and exists ( select eventid from events where eventparms ~ '^.*$$ || i.numericmsg || $$[(].*' and eventuei = 'uei.opennms.org/vendor/q3AlarmOkay' )$$;
END LOOP;
END
$body$ LANGUAGE 'plpgsql';

The event translation for the minor events looked like this, notice “lt” vs. “gt” in the sql statements:

<event-translation-spec uei="uei.opennms.org/vendor/spontaneousAlarmsMinorOkay">
  <mappings>
    <mapping>
      <assignment name="uei" type="field" >
        <value type="constant" result="uei.opennms.org/translator/spontaneousAlarmsMinorOkay5min" />
      </assignment>
      <assignment name="sleep" type="parameter">
        <value type="sql" result="select pg_sleep(3)" />
      </assignment>
      <assignment name="ack" type="parameter">
        <value type="sql" result="select getunackdspontaneousalarmsminor()" />
      </assignment>
      <assignment name="acktime" type="parameter">
        <value type="sql" result="select acktime from (select respondtime-pagetime acktime from notifications where notifyid=(select notifyid from notifications where eventid=(SELECT eventid FROM events WHERE eventparms ~ ? and eventuei ='uei.opennms.org/vendor/spontaneousAlarmsMinor'))) acktime where acktime &gt; '5 minutes'" >
          <value type="parameter" name="~^\.1\.3\.6\.1\.4\.1\.someenterprise\.7\.1\.3\.1\.1\.1\.3\.0$" matches=".*" result="${0}" />
        </value>
      </assignment>
    </mapping>
    <mapping>
      <assignment name="uei" type="field" >
        <value type="constant" result="uei.opennms.org/translator/spontaneousAlarmsMinorOkay" />
      </assignment>
      <assignment name="sleep" type="parameter">
        <value type="sql" result="select pg_sleep(3)" />
      </assignment>
      <assignment name="ack" type="parameter">
        <value type="sql" result="select getunackdspontaneousalarmsminor()" />
      </assignment>
      <assignment name="acktime" type="parameter">
        <value type="sql" result="select acktime from (select respondtime-pagetime acktime from notifications where notifyid=(select notifyid from notifications where eventid=(SELECT eventid FROM events WHERE eventparms ~ ? and eventuei ='uei.opennms.org/vendor/spontaneousAlarmsMinor'))) acktime where acktime &lt; '5 minutes'" >
          <value type="parameter" name="~^\.1\.3\.6\.1\.4\.1\.someenterprise\.7\.1\.3\.1\.1\.1\.3\.0$" matches=".*" result="${0}" />
        </value>
      </assignment>
    </mapping>
  </mappings>
</event-translation-spec>

The major event translations looked like this:

<event-translation-spec uei="uei.opennms.org/vendor/spontaneousAlarmsOkay">
  <mappings>
    <mapping>
      <assignment name="uei" type="field" >
        <value type="constant" result="uei.opennms.org/translator/spontaneousAlarmsOkay" />
      </assignment>
      <assignment name="sleep" type="parameter">
        <value type="sql" result="select pg_sleep(3)" />
      </assignment>
      <assignment name="egal" type="parameter">
        <value type="sql" result="select getunackdspontaneousalarms()" />
      </assignment>
    </mapping>
  </mappings>
</event-translation-spec>

<event-translation-spec uei="uei.opennms.org/vendor/q3AlarmCleared">
  <mappings>
    <mapping>
      <assignment name="uei" type="field" >
        <value type="constant" result="uei.opennms.org/translator/q3AlarmCleared" />
      </assignment>
      <assignment name="sleep" type="parameter">
        <value type="sql" result="select pg_sleep(3)" />
      </assignment>
      <assignment name="egal" type="parameter">
        <value type="sql" result="select getunackdq3alarms()" />
      </assignment>
    </mapping>
  </mappings>
</event-translation-spec> 

How much easier would this have been with a decent MIB … On the other hand, it showed once again, that OpenNMS is an amazing platform that most of the time can be told to do what you want it to … at least in some way :)

Advertisements

5 responses to “Stupid SNMP trap design worked around in OpenNMS

  1. On which file I would configure this PLPGSQL procedure? I too have the similar problem on auto-acknowledging notifications.

  2. Very nice and usefull article, I’m trying to do something similar, but in order to do a corrolate of a breach with an alam clearing it is required to consider %nodeid% and 3 aditional parameters. So my idea is to serialize the parameters in the “numericmsg”, but my doubt is how to unserialize them and compare them individualy with each of the parameters, do you know how this could be done or if it even possible via the event translation, the documentation is not very explicit of the capabilities of Opennms which I think are very flexible. Many Thanks, Alex

    • I haven’t used OpenNMS for half a year due to a job switch, so I doubt I can help here. Try the mailing list or the IRC channel – those guys are really helpful. Tell ’em I said hi :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s