I spent some time yesterday trying to understand why one of the machines here seemed to always take 90s before shutting down or rebooting. This article is a summary of the wild goose chase I embarked on to try and understand what was happening.

When systemd machines hang during shutdown, it usually means that there's a service that is not stopping fast enough or it could be that systemd is waiting for the DefaultTimeoutStopSec timeout before killing the remaining processes after all services have been stopped.

In my case, what I was seeing is the following:

[ OK ] Finished Reboot.
[ OK ] Reached target Reboot.
[ 243.652848 ] systemd-shutdown[1]: Waiting for process: containerd-shim, containerd-shim, containerd-shim, fluent-bit

These processes are part or k3s, they are the parent processes of containers or processes running in the containers themselves. Clearly, it has something to do with k3s.

Update: the systemd bugfix was backported to Ubuntu 20.04.

This service is a little bit special, when restarting it, the containers that it started should continue to run unperturbed (Otherwise upgrades would be fun...). To accomplish that, it uses KillMode=process in its systemd service definition. This tells systemd to only kill the main process that was spawned by ExecStart and not to bother with all the others.

This is a case where, during shutdown, systemd will wait up to DefaultTimeoutStopSec before killing all the remaining processes. This parameter is set to 5s on my machines to avoid exactly that, but somehow on this particular machine it didn't seem to be sufficient.

Systemd v245 Bug

I'll spare all the details, but I ended up having to hack up systemd-shutdown to add some debugging printfs in order to convince myself that the timeout was indeed the default 90s and not the 5s that it should be. Roughly:

apt-get source systemd
# add debug
meson build/
cp /lib/systemd/usr/lib/systemd/systemd-shutdown /lib/systemd/usr/lib/systemd/systemd-shutdown.ori
cp build/systemd-shutdown /lib/systemd/usr/lib/systemd/systemd-shutdown

# I guess I could have learned how to configure this properly, but "I was in a hurry"
chrpath -r /lib/systemd/usr/lib/systemd/systemd-shutdown

reboot  # and pray

Once I was convinced systemd was really using the default timeout value instead of the configured one, I set out to (re-)read its code to figure out where it got confused and maybe fix it. In the end, it was not too bad to spot:

finish:
        /* snip snip snip */

        reset_arguments();

        /* snip snip snip */

        if (shutdown_verb) {
                r = become_shutdown(shutdown_verb, retval);
                log_error_errno(r, "Failed to execute shutdown binary, %s: %m", getpid_cached() == 1 ? "freezing" : "quitting");
                error_message = "Failed to execute shutdown binary";
        }

Basically, all the parameters were reset to their default value by reset_arguments right before calling become_shutdown. Just to make sure, I looked at the main branch and sure enough, the reset_arguments call isn't there anymore. The bug was already fixed: https://github.com/systemd/systemd/commit/7d9eea2bd3d4f83668c7a78754d201b22

That is all well and good, less work for me. But my machine is still slow to shutdown and there's ~~zero chance that systemd will get an update for that particular issue on Ubuntu 20.04...~~ Actually it was eventually fixed!

Fast shutdown workaround

After much fighting with systemd, I ended up with the following workaround:

[Unit]
Description=Kill cgroup procs on shutdown for %i
DefaultDependencies=false
Before=shutdown.target umount.target
[Service]
# Instanced units are not part of system.slice for some reason
# without this, the service isn't started at shutdown
Slice=system.slice
ExecStart=/bin/bash -c 'pids=$(cat /sys/fs/cgroup/unified/system.slice/%i/cgroup.procs); echo $pids | xargs -r kill;'
ExecStart=/bin/sleep 5
ExecStart=/bin/bash -c 'pids=$(cat /sys/fs/cgroup/unified/system.slice/%i/cgroup.procs); echo $pids | xargs -r kill -9;'
Type=oneshot
[Install]
WantedBy=shutdown.target

/etc/systemd/system/cgroup-kill-on-shutdown@.service

This instanced service unit will signal every process in the specified service's cgroup to stop, wait 5 seconds, then kill the survivors just before the filesystems are unmounted during shutdown.

To enable it for the k3s-agent service:

sudo systemctl enable cgroup-kill-on-shutdown@k3s-agent.service.service

(replace k3s-agent.service with whichever service needs cleaning up. Mind the double .service.service)

Debugging systemd shutdown process

Figuring how to set that up was an exercise in patience and bricolage...

I rebooted the machine way too many time and resorted to sleeps, prints and redirections to /dev/tty1 to be able to get any debug info out of the service. Not recommended.

Pitfalls

One of the thing that bit pretty hard was that I was using this construct to kill the processes individually:

while read pid; do
  echo killing $pid
  kill -9 $pid
done </sys/fs/cgroup/unified/system.slice/k3s-agent.service/cgroup.procs

It kept leaving processes behind... What happens is that the file being read is changing with every call to kill and thus bash was skipping/seeking over parts of it with every iteration. Very annoying, also, it was pure luck that the pids were all the same length...

Using the weird pids=$(cat blah); echo $pids | ... construct avoids that issue by reading the whole file only once before killing everything.

Better tool for debugging the shutdown process

systemd has a test mode that is quite handy to figure out what will actually get called when reaching a target.

I used it to understand why my instanced service was not being called, but a regular service with the same content was called.

/usr/lib/systemd/systemd --test --system --unit=shutdown.target

First, all the unit files are listed and then all the jobs. The last part is of interest here since it tells which unit will actually be stopped.

From this I could see that systemd was not even trying to stop the cgroup-kill-on-shutdown@k3s-agent.service.service instanced service.

By comparing the units description with a regular service and the faulty service, I realized that they were not using the same Slice. Setting Slice=system.slice in the instanced service fixed it.

Conclusion

That was quite a detour... I just wanted to reboot my machine.

At least I learned a few things:

Always read the code you're actually running (and compare it to the main branch after)
Reading files exposed by the kernel can/will change while you read them
systemd has a lot of moving parts, but its tooling seems good, once you figure out that there's actually an option to dump what you want.

Other people seemed to be having that problem too but without much in terms of solutions: https://github.com/k3s-io/k3s/issues/2400