tl;dr: pid 1 is special on Linux, it is unkillable, meaning that signals that would normally terminate a process if it has no handler installed do not terminate it. In other words, pid 1 must handle SIGTERM explicitely for the usual scemantics to apply. I keep rediscovering this with containers...

Yesterday I spent way too much time trying to understand why a process (flexo) didn't die instantaneously when I tried to kill it. The process was running in a container in a pod in k3s, and it would take up to 30s (terminationGracePeriodSeconds default value) before it died.

SIGTERM behaviors

In my recollection, the behavior for SIGTERM on Linux could be described as follows:

  • If there's is no handler for the signal, the process is killed by the kernel.
  • If there's a handler, the process is free to do whatever it wants with the signal. (i.e: it could do nothing at all)
  • A process can also ignore the signal. (By setting the handler to SIG_IGN in a call to sigaction)
  • (the process can also block the signal, in which case the signal becomes 'pending' and is delivered when it is unblocked)

There's a crucial piece missing in this list...

Observing signal handling for a process

All of these cases can be observed by looking at the /proc/<pid>/status file:

...
SigPnd: 0000000000000000    # pending, process
ShdPnd: 0000000000000000    # pending, thread
SigBlk: 0000000000000000    # blocked signals
SigIgn: 0000000000001000    # ignored signals
SigCgt: 0000000180000440    # caught/handled signals
...

The format of this file is described in proc(5).

These are hexadecimal signal masks which can be converted to a human readable format by the following snippet:

#!/bin/bash
# https://stackoverflow.com/a/61365083

pid=${1:?Missing pid}
cat /proc/$pid/status|egrep '(Sig|Shd)(Pnd|Blk|Ign|Cgt)'|while read name mask;do
    bin=$(echo "ibase=16; obase=2; ${mask^^*}"|bc)
    echo -n "$name $mask $bin "
    i=1
    while [[ $bin -ne 0 ]];do
        if [[ ${bin:(-1)} -eq 1 ]];then
            kill -l $i | tr '\n' ' '
        fi
        bin=${bin::-1}
        set $((i++))
    done
    echo
done

Output:

SigPnd: 0000000000000000 0
ShdPnd: 0000000000000000 0
SigBlk: 0000000000000000 0
SigIgn: 0000000000001000 1000000000000 PIPE
SigCgt: 0000000180000440 110000000000000000000010001000000 BUS SEGV

In my case (shown above), only SIGPIPE was ignored and SIGTERM didn't have a handler installed.

I tried to observed the behavior with strace, perf record, gdb, etc... and I could see the signal being delivered, but the process didn't die. Oh also, if the process was started under strace (strace prog as opposed to strace -p <pid-of-prog>), the process would die when it received the signal. Very amusing.

Pid 1 is special

After staring at the traces for a while, I remembered that I had debugged this in a former professional life, like 5-6 years ago:

PID 1 is special on linux, it is unkillable, meaning that it doesn't get killed by signals which would terminate regular processes.

So, in a container, the first process that is started really must install handlers for SIGTERM, otherwise it will stick around.

Solution

In my case I enabled shareProcessNamespace which makes the pause process PID 1 in the pod. This process is very simple and has the capability to reap children process, so they don't linger as zombies and handles SIGTERM properly.

With that change in place, deleting the pod causes the main process of every container in the pod to receive a SIGTERM. pause has a handler and exits quickly, flexo doesn't have a handler but it is no longer pid 1, so it gets killed since that's the default action for the SIGTERM signal.

Docker

For processes running under docker, the easiest solution is to use the --init flag, which will run tini as the init (PID 1) process in the container.

Docker compose

For docker-compose, the same behavior can be enabled by activating the init parameter:

services:
  web:
    image: alpine:latest
    init: true