Splunk 7.2.2 and systemd

Consider this a draft.  I’ll update it as I have time, but I’m posting now because it may help someone.

Updated 2019-04-07:  Some improvements thanks to Red Hat support.  I am also trying to collect the knowledge and experience of other SplunkTrust and Splunk community people in order to document this more completely.  Many thanks to automine, jkat54, xpac, and Gareth for sharing issues and ideas.

Splunk 7.2.2 brought along new features (which previously didn’t happen in a “maintenance release” – but that’s another topic for another time).  One of the new features is “systemd support”.  It didn’t take long before folks were on Splunk Answers wondering where their cheese had been moved to.  Some workarounds were provided, some of which work in some cases but not others.   So, @automine and I dug into a little more late today.  (Not done yet though)

 

Systemd changes how Splunk does start up / shut down

In prior versions of Splunk, when you would run the splunk enable boot-start command, it would create the files and symlinks and etc in /etc/rc.d and its subdirectories to cause Splunk to start as part of multi-user initialization (runlevel 3).  But, when Splunk 7.2.2 is installed on a systemd-compatible system and you use splunk enable boot-start it instead creates a systemd unit file, usually /etc/systemd/system/Splunkd.service.

Another thing that happens is when the Splunk CLI detects that splunkd is running “under systemd” it changes its mode of operation for the start, stop, and restart commands.   Specifically, the /opt/splunk/bin/splunk binary calls systemctl to see if Splunkd.service exists.  Then, if so, it passes stop / start / restart calls through as calls to systemctl.  Below is a snippet of an strace capture of me running /opt/splunk/bin/splunk stop.  (I snipped a lot out for brevity.)

17169 execve("/opt/splunk/bin/systemctl", ["systemctl", "show", "Splunkd", "--property=Type,ExecStart,LoadSt"...], [/* 34 vars */]) = -1 ENOENT (No such file or directory)
17169 execve("/usr/local/sbin/systemctl", ["systemctl", "show", "Splunkd", "--property=Type,ExecStart,LoadSt"...], [/* 34 vars */]) = -1 ENOENT (No such file or directory)
17169 execve("/sbin/systemctl", ["systemctl", "show", "Splunkd", "--property=Type,ExecStart,LoadSt"...], [/* 34 vars */]) = -1 ENOENT (No such file or directory)
17169 execve("/bin/systemctl", ["systemctl", "show", "Splunkd", "--property=Type,ExecStart,LoadSt"...], [/* 34 vars */]) = 0
17169 brk(NULL)                         = 0x55f6c62a5000
29384 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4f84fb000
29384 write(1, "Stopping splunkd...\n", 20) = 20
29384 write(1, "Shutting down.  Please wait, as "..., 61) = 61
29384 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc4f84f2a10) = 29417
29384 wait4(29417,  <unfinished ...>
29417 set_robust_list(0x7fc4f84f2a20, 24) = 0
29417 execve("/opt/splunk/bin/systemctl", ["systemctl", "stop", "Splunkd"], [/* 30 vars */]) = -1 ENOENT (No such file or directory)
29417 execve("/usr/local/bin/systemctl", ["systemctl", "stop", "Splunkd"], [/* 30 vars */]) = -1 ENOENT (No such file or directory)
29417 execve("/bin/systemctl", ["systemctl", "stop", "Splunkd"], [/* 30 vars */]) = 0
29417 brk(NULL)                         = 0x55c9c4485000

 

We see it fork a new child, and exec “systemctl stop Splunkd“.  Notice no call to sudo or anything here.  This is necessary because when something is running as a systemd service, systemd must be the one to start it and stop it.  Splunk is attempting to do us all a favor by transparently passing legacy calls over to systemctl on our behalf.  But, there is a gotcha here:  calling systemctl means that you may have to authenticate as a privileged user depending on how systemd + polkit are configured.  In a lot of customer environments I see/work in, the “Splunk Team” and the “OS team” exist on other sides of an organizational wall.  As the Splunk Admin, getting the needed privileged access can be difficult.

In Splunk 7.2.1, you could have easily use the splunk user as a service account and issue stop/start/restart commands to your heart’s content and it mostly just works.  In 7.2.2, those commands no longer work for you because Splunk MUST ask systemd to handle the stops and starts for it, so that systemd knows what is happening and can do process restarts and so forth.

Sudo based workaround

One reasonable workaround here is adding sudo rules, and retraining the Splunk Team to use them.  Some sudo rules like these (courtesy of automine) make it possible for the splunk service account to issue the needful commands to systemd in order to stop/start/restart splunk:

splunk ALL=(root) NOPASSWD: /usr/bin/systemctl restart Splunkd.service
splunk ALL=(root) NOPASSWD: /usr/bin/systemctl stop Splunkd.service
splunk ALL=(root) NOPASSWD: /usr/bin/systemctl start Splunkd.service 
splunk ALL=(root) NOPASSWD: /usr/bin/systemctl status Splunkd.service

These don’t help without retraining though!  If your Splunk Admins continue to try to use the classic bin/splunk restart command that worked before, they will continue to be asked to authenticate as a wheel user each time.

Polkit changes

Another workaround provided on Splunk Answers by twinspop adds rules to polkit to have systemd allow for the splunk user to make these calls without issue.  In this way, the classic bin/splunk restart would be transparently proxied to systemctl restart Splunkd, and systemctl would say “oh cool I don’t have to authenticate for this” and it would just happen.  Sadly, this workaround does not work on RHEL or Centos (tested at 7.6) because the version of systemd is too old there to provide the context that the policy needs.  Neither does it work on Ubuntu 18.04 because the version of Polkit on 18.04 is (best I can tell) too old to support Javascript polkit rules.

Splunk support has been suggestion a similar workaround, using a simpler Polkit policy seen below.  The main problem with this policy is that it is overbroad – the Splunk user would be permitted to stop/start/restart any systemd managed service on the system.

polkit.addRule(function(action, subject) {
    var debug = true;
    if (action.id == "org.freedesktop.systemd1.manage-units" &&
        subject.user == "splunk") {
            return polkit.Result.YES;
    }
});

Better Polkit Changes

I opened a case with Red Hat support, initially pushing for them to backport changes from a more revent systemd.  Red Hat support got back to me and said that backporting these into RHEL7 would be both painful and brittle.  These features should be in RHEL8 once it releases.  But, in the meanwhile, they offered up an alternative in the form of https://access.redhat.com/solutions/3100591.  (Sorry it’s behind Red Hat’s paywall.)  But, to summarize the article, the idea is to use Polkit’s spawn operation.  The spawn function launches an external program and provides its result code back in to the Polkit policy.  From the Polkit docs:

The spawn() method spawns an external helper identified by the argument vector argv and waits for it to terminate. If an error occurs or the helper doesn’t exit normally with exit code 0, an exception is thrown. If the helper does not exit within 10 seconds it is killed. Otherwise, the program’s standard output is returned as a string. The spawn() method should be used sparingly as helpers may take a very long or indeterminate amount of time to complete and no other authorization check can be handled while the helper is running. Note that the spawned programs will run as the unprivileged polkitd system user.

Let’s make a polkit rule that uses it:

    polkit.addRule(function(action, subject) {
        if (action.id == "org.freedesktop.systemd1.manage-units" &&
        subject.user == "splunk") {
        try {
            polkit.spawn(["/usr/local/bin/polkit_splunk", ""+subject.pid]);
            return polkit.Result.YES;
        } catch (error) {
            return polkit.Result.AUTH_ADMIN;
        }
        }
    });

So our goal here is to shell out to /usr/local/bin/polkit_splunk if the splunk user is attempting to manage systemd units.  In terms of arguments, our program will receive the PID of the program that is attempting to take the action.  That is, we’ll get the pid of systemctl itself.  The part of this that stinks is we won’t get directly told “splunk user is trying to restart Splunkd”, more like “splunk user is running a command w/ pid 12345 that is attempting to manage systemd units”.  From there, we have to go parse the process command line.

#!/bin/bash -x
COMM=($(ps --no-headers -o cmd -p $1))

if [[ "${COMM[1]}" == "start" ]] ||
   [[ "${COMM[1]}" == "stop"  ]] ||
   [[ "${COMM[1]}" == "restart" ]]; then

        if [[ "${COMM[2]}" == "Splunkd" ]] ||
           [[ "${COMM[2]}" == "Splunkd.service" ]]; then
                exit 0
        fi
fi

exit 1

Other systemd-related weirdness

Splunkd hard kill on stop

Gareth, xpac, and some others have been tracking how Splunk will not be politely shut down when you issue a systemctl stop Splunkd.service.  Hard-kill signals to splunkd are definitely not ideal.  I am not well informed on the issues or solution, but Gareth’s post on Splunk Answers has some solid background and solutions.

Service name weirdness

The name of the systemd service that is created by splunk enable boot-start depends on the value of SPLUNK_SERVER_NAME in splunk-launch.conf.  The docs are a little misleading because they say the “default” for Unix systems is splunkd.  But, Splunk 7.2.5.1 (and possibly others) ship with SPLUNK_SERVER_NAME=Splunkd set.  If you remove that value, then the (apparently in code) default of splunkd is used.

This also has an impact on when you try to run more than one Splunk instance on the same host (like you crazy kids are apt to do).  Even if “splunk2” is not managed by systemd at all (say you’re manually starting and stopping it), you need to change the value of SPLUNK_SERVER_NAME from its default or your attempts to start/stop “splunk2” will actually result in systemctl calls targeted at the other, systemd-managed Splunk service.  Fun, eh!

This is probably a good time to remind you that, technically, multiple instances on a single OS instance is not a supported configuration.  Some (many?) customers do it for what could be very good reasons.  But strictly speaking, Splunk support frowns on it.

Wrap-up

The Polkit changes above certainly work on RHEL/Centos 7.6.  Check out my github repo with the pieces and parts in it.  I included a Centos-based Vagrant build to allow you to try it out for yourself.  It will get better in RHEL 8, but I don’t know what happens in other Linux distributions.  If you’re using something else, comment and provide feedback so we can improve this.