Nullqueue Sampling

One of the first things the average Splunk administrator has to learn about the hard way is how to send traffic to the Splunk nullQueue.  It’s almost a rite of passage — you configure a new data source, somewhat unaware of the tens of thousands of mostly-useless events it produces.  It blows out your license for a day or two, and then you then hit up answers, #splunk on IRC, or file a support case and quickly learn about how to use nullQueue.  With a few minutes of configuration, the mostly-useless events are filtered entirely, and you move on to the next challenge.

In some cases, this is not optimal.  Perhaps most of the tens of thousands of events are useless, but removing them entirely hides a solveable problem from your operations team.  How can we rate-limit certain messages in order to still see the event, but without using vast quantities of license volume to do so?  Until Splunk adds in proper support for rate-limiting of events, here is an approach you can take.

Suppose we have this message, occuring many thousands of times per day:

2014-03-03T23:29:00 INFO [Thread-11] java.lang.NullPointerException refreshing the flim-flam combobulator

We can’t directly tell Splunk “index a max of 1 of these per minute”, but can use a clever application of regular expressions to accomplish roughly the same.  Suppose we build our nullQueue as follows:

(props.conf)

[mysourcetype]
TRANSFORMS-null1 = sampledNull

(transforms.conf)
[sampledNull]
REGEX = ^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:(?!00).*java.lang.NullPointerException refreshing the flim-flam
DEST_KEY = queue
FORMAT = nullQueue

What we’ve done is used a negative assertion in PCRE to basically nullQueue any of these messages that do not occur in the :00 second of a minute. It’s not a perfect rate limit – but it should greatly reduce the quantity of indexed messages of this type without filtering them away entirely. We’re counting on a statistical property that these events occur all the time, somewhat evenly distributed, and “some” will be in the “:00” second of a minute. In theory, this should sample these events at roughly 1/60th of their original throughput.

If you find this useful, or if you can think of a better way of accomplishing this please leave a comment.