Splunk – bucket lexicons and segmentation

About Segmentation

Event segmentation is an operation key to how Splunk processes your data as it is being both indexed and searched.  At index time, the segmentation configuration determines what rules Splunk uses to extract segments (or tokens) from the raw event and store them as entries in the lexicon.  Understanding the relationship between what’s in your lexicon, and how segmentation plays a part in it, can help you make your Splunk installation use less disk space, and possibly even run a little faster.

Peering into a tsidx file

Tsidx files are a central part of how Splunk stores your data in a fashion that makes it easily searchable.  Each bucket within an index has one or more tsidx files.  Every tsidx file has two main components – the values (?) list and the lexicon.  The values list is a list of pointers (seek locations) to every event within a bucket’s rawdata.  The lexicon is a list (tree?) containing of all of the segments found at index time and a “posting list” of which values list entries could be followed to find the rawdata of events containing that segment.

Splunk includes a not-very-well documented utility called walklex.  It should be in the list of Command line tools for use with Support, based on some comments in the docs page but it’s not there yet.  Keep an eye on that topic for more official details – I’ll bet they fix that soon.  There’s not a whole lot to walklex – you run it, feeding it a tsidx file name and a single term to search for – and it will dump the matching lexicon terms from the tsidx file, along with a count of the number of rawdata postings that contain this term.

Segmentation example

I have a sample event from a Cisco ASA, indexed into an entirely empty index.  Let’s look at how the event is segmented by Splunk’s default segmentation rules. Here is the raw event, followed up with the output of walklex for the bucket in question.

2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 9986454 for outside:101.123.123.111/443 (101.123.123.111/443) to vlan9:192.168.120.72/57625 (172.16.1.2/64974)
$ splunk cmd walklex 1399698005-1399698005-17952229929964206551.tsidx ""
my needle:
0 1  host::firewall.example.com
1 1  source::/home/dwaddle/tmp/splunk/cisco_asa/firewall.example.com.2014-05-10.log
2 1  sourcetype::cisco_asa
3 1 %asa-6-302013:
4 1 00
5 1 00:00:05.700433
6 1 05
7 1 1
8 1 10
9 1 101
10 1 101.123.123.111/443
11 1 111
12 1 120
13 1 123
14 1 16
15 1 168
16 1 172
17 1 172.16.1.2/64974
18 1 192
19 1 2
20 1 2014
21 1 2014-05-10
22 1 302013
23 1 443
24 1 57625
25 1 6
26 1 64974
27 1 700433
28 1 72
29 1 9986454
30 1 _indextime::1399829196
31 1 _subsecond::.700433
32 1 asa
33 1 built
34 1 connection
35 1 date_hour::0
36 1 date_mday::10
37 1 date_minute::0
38 1 date_month::may
39 1 date_second::5
40 1 date_wday::saturday
41 1 date_year::2014
42 1 date_zone::local
43 1 for
44 1 host::firewall.example.com
45 1 linecount::1
46 1 outbound
47 1 outside
48 1 outside:101.123.123.111/443
49 1 punct::--_::._%--:_______:.../_(.../)__:.../_(.../)
50 1 source::/home/dwaddle/tmp/splunk/cisco_asa/firewall.example.com.2014-05-10.log
51 1 sourcetype::cisco_asa
52 1 tcp
53 1 timeendpos::26
54 1 timestartpos::0
55 1 to
56 1 vlan9
57 1 vlan9:192.168.120.72/57625

Some things stick out immediately — all uppercase has been folded to lowercase, indexed fields (host,source,sourcetype,punct,linecount,etc) are of the form name::value, and some tokens like IP addresses are stored both in pieces and whole.  But let’s look at a larger example..

I’ve indexed a whole day’s worth of the above firewall log – 5,707,878 events.  The original file unindexed file is about 782MB, and the resulting Splunk bucket is 694MB.  Within the bucket, the rawdata is 156MB and the tsidx file is 538MB.

Cardinality and distribution within the tsidx lexicon

When we look at the lexicon for this tsidx file, we can see the cardinality (number of unique values) of the keywords in the lexicon is about 11.8 million.  The average lexicon keyword occurs in 26 events.

$ splunk cmd walklex 1399784399-1399698000-17952400407545127995.tsidx ""  | egrep -v "^my needle" | wc -l
11801764
$ splunk cmd walklex 1399784399-1399698000-17952400407545127995.tsidx ""  | 
      egrep -v "^my needle" | 
      awk ' BEGIN { X=0; }  { X=X+$2; } END { print X, NR, X/NR } '
309097860 11801764 26.1908

Almost 60% of the lexicon entries (7,047,286) have only a single occurrence within the lexicon — and of those, 5,707,878 are the textual versions of timestamps.

$ splunk cmd walklex 1399784399-1399698000-17952400407545127995.tsidx ""  | 
      egrep -v "^my needle" | 
      awk '$2==1 { print $0 }' | 
      grep -P "\d\d:\d\d:\d\d\.\d{6}" | 
      wc -l
5707878

Do we need to search on textual versions of timestamps?

Probably not.  Remember that within Splunk, the time (_time) is stored as a first-class dimension of the data.  Every event has a value for _time, and this value of _time is used in the search to decide which buckets will be interesting.  It would be infrequent (if ever) that you would search for the string “20:35:54.271819”.  Instead, you would set your search time range to “20:35:54”.  The textual representation of timestamps might be something you can trade-off for smaller tsidx files.

Configuring segmenters.conf to filter timestamps from being added to the lexicon

I created a $SPLUNK_HOME/etc/system/local/segmenters.conf as follows:

[ciscoasa]
INTERMEDIATE_MAJORS = false
FILTER= ^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d.\d{6} (.*)$

Then I added to $SPLUNK_HOME/etc/system/local/props.conf a reference to this segmenter configuration:

[cisco_asa]
BREAK_ONLY_BEFORE_DATE=true
TIME_FORMAT=%Y-%m-%d %H:%M:%S.%6N
MAX_TIMESTAMP_LOOKAHEAD=26
SEGMENTATION = ciscoasa

Starting with a clean index, I indexed the same file over again.  Now, the same set of events requires 494MB of space in the bucket – 156MB of compress rawdata, and 339MB of tsidx files, saving me 200MB of tsidx space for the same data.  The lexicon now has 5,115,535 entries (down from 11,800,000) – and of those 1,332,323 are entries that occur only once in the raw data.  As I look at the items occurring once, a large fraction (1,095,570) are of the form 123.123.124.124/12345 – that is, an IPv4 address and a port number.  Some of the same IP addresses occur with many different values of port number – can we do anything to improve this?  Again, back to segmenters.conf:

[ciscoasa]
INTERMEDIATE_MAJORS = false
FILTER= ^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d.\d{6} (.*)$
MAJOR = / [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = : = @ . - $ # % \\ _

This changes from the default so that “/” becomes a major segmenter.  Now, each IP address and port number will be stored in the lexicon as separate entries instead of there being an entry for each combination of IP and port.  My lexicon (for the same data) now has 2,767,084 entries – 23% of the original cardinality.  The average lexicon entry now occurs in 94 events.  My tsidx file size is down to 277MB – just a little over half of its original size.

Conclusions

What have I gained?  What have I lost?  I’ve lost the ability to search specifically for a textual timestamp.  I’ve gained a reduction in disk space used for the same data indexed.  I’ve slightly reduced the amount of work required to index this data.  I’ve made the job of splunk-optimize easier.

The improvement in disk space usage is significant and easily measured.  The other effects are probably not as easily measured.  Any data going into Splunk that exhibits high cardinality in the lexicon has a chance of making your tsidx files as large (if not larger) than the original data.  As Splunk admins, we don’t expect this because this is atypical for IT data.  By knowing how to measure (and possibly affect) the cardinality of the lexicon within your Splunk index buckets, you can be better equipped to deal with atypical data and the demands it places on your Splunk installation.