Searching date-time values in Splunk

If you’ve worked with Splunk for a little while then you are probably familiar with the existence of the field _time.  With Splunk being a time series data store, it makes sense that every event will have a time.  Internally, Splunk parses the timestamp from your event and converts it to epoch (seconds since Jan 1 1970 00:00:00 UTC).  When you use your time range picker to select a time range, that is also converted internally to epoch and used to control what data is searched.

Sometimes, though, you may have events with multiple timestamps.  While this is less common in your typical infosec dataset, it can happen in other types of data.  For argument’s sake, let’s suppose we have an event with two timestamps in it.  One is the main _time of the event, and the other is some other related timestamp.  What we want to do is be able to filter on BOTH of these timestamps efficiently.

Coming from a RDBMS world, this other timestamp might be defined as a DATETIME data type (MySQL, DB2), a TIMESTAMP (Oracle, PostgreSQL), or other similar data type.  Defined in this way, you can build indexes including that column and efficiently filter based upon it.  Splunk does not work in this way.  To begin with, Splunk doesn’t even have the concept of “columns”.  Also, all of the data in your event is stored as text in a field called _raw.  We can enable indexed extractions to have Splunk create indexed fields of all your of data at index time, but even then there is no concept of data type.  This means that your secondary timestamp is stored as a text string, which makes filtering on it incredibly difficult.

Here’s our example data.  It’s lame I know, bear with me.

date,time,seqno,datefield
2020-06-27,21:48:00,1,1995-12-31 18:35:35
2020-06-27,21:49:00,2,2005-01-01 12:00:00

In this super simple / lame CSV, we have two timestamps.  For argument’s sake, let’s say this is HR data with the “date“, “time” fields representing the hire event and “datefield” is the person’s birthdate.  (Sorry I’m a horrible example-maker).  We have configured INDEXED_EXTRACTIONS per the below to create indexed fields of each column of this CSV.

[epoch_example]
INDEXED_EXTRACTIONS=csv
TIMESTAMP_FIELDS=date,time

Cool.  Now, we want to search for employees born before year 2000.

index="epoch_example" datefield=199*

Looking at this in the job inspector, we see the LISPY generated is relatively efficient.  Because we used a wildcard, it is able to scan for text strings matching 199* and use that as filter criteria.

base lispy: [ AND 199* index::epoch_example ]

But, this is absolutely treating the datefield as nothing more than a text string.  If we wanted to do more elaborate filtering on it we might try something like this:

index="epoch_example" datefield < "2003-02-04 13:11:11"

We get results back fast … because there’s only two events to look at.  Let’s compare the lispy to the one above.

base lispy: [ AND index::epoch_example ]

Oops.  We’re now scanning EVERYTHING in the index at the lispy stage, and then having to post-filter once events are brought back from raw on disk.  It works but it’s incredibly inefficient.  As I said above, Splunk does not store datefield in any particular data type except text string.  And its inverted index scheme does not know how to do range comparisons on text strings.  This is unfortunate, and a whole lot of build up to a neat trick I’d like to share.

Splunk can do range comparisons of indexed fields when those indexed fields are integers.  Let’s demonstrate.  I’m going to add a new thing to my props.conf to make an index time transformation.

[epoch_example]
INDEXED_EXTRACTIONS=csv
TIMESTAMP_FIELDS=date,time
TRANSFORMS-evals = epoch_example_datefield_epoch

And in my transforms.conf I’m going to use a neat new thing called INGEST_EVAL to create a new indexed field at index time.  Yes this means I have to re-index this data.  Nothing’s perfect, sorry.

[epoch_example_datefield_epoch]
INGEST_EVAL = datefield_epoch=strptime(datefield,"%Y-%m-%d %T")

So now – at index time – Splunk will store my datefield twice.  It stores it once as its normal string value, and once in another field called datefield_epoch that is storing an epoch value.  We can now use a fields.conf to tell Splunk that datefield_epoch is an indexed field, and do range queries based on it.

[datefield_epoch]
INDEXED=true

Now let’s run a search and compare the lispy we get:

index="epoch_example" datefield_epoch < 123456789

And its lispy:

base lispy: [ AND index::epoch_example [ LT datefield_epoch 1234567890 ] ]

Ooh look at that.  A range search on the epoch field.  Now we’re cookin’ with gas.  The only truly unfortunate thing about this is that now my user has to know to convert times to epoch before putting them into the search string.  Maybe there’s an easy way around that with a macro.  I originally had a really ugly macro, but the ever-helpful Martin Müller showed me a much more elegant way using an eval-based macro:

[epoch(1)]
args = arg1
definition = printf("%d",round(strptime("$arg1$", "%Y-%m-%d %T")))
iseval = 1

Now I can use this macro to make my search easier:

index="epoch_example" datefield_epoch < `epoch("2009-02-13 17:31:30")`

And my resultant lispy still has the range comparison in it:

base lispy: [ AND index::epoch_example [ LT datefield_epoch 1234567890 ] ]

I’m pretty pleased with this.  If you want to play with / use this on your own, I’ve put all the configs above in a splunk app on GitHub at https://github.com/duckfez/splunk-epoch-example.  Enjoy, and be sure to smash those like and subscribe buttons.

Leave a Reply

Your email address will not be published. Required fields are marked *