Needles and haystacks: Finding the one bad request among billions with tcpdump
This week, we had a few weird crashes with an HTTP server which we could not easily reproduce and we had a hard time pin-pointing the source of the issue. We knew the problems were triggered by bad input, but since the process was continuously receiving around 3000 requests per second at the time, it was pretty hard to isolate the exact request(s) that made it crash.
The idea we had was to capture HTTP requests data up to the point where the process crashed. Then, we would open the trace and look for the last successful requests, the faulty one would be in there somewhere.
The naive approach would have been to run tcpdump until the process crashed, then to stop tcpdump and to look into the packet capture file (pcap). This solution seemed impractical because the crashes were very infrequent and random. The process could run for hours/days without crashing... we would have ended up with a massive pcap file that would have been too big to transfer or work with.
After a good look at the tcpdump man page, we figured that we could leverage a few of its parameters:
tcpdump -i interface -w file -G rotate_seconds -W filecount -C file_size -z command pcap-filter
This command line does the following:
- Capture packets on
interface
matchingpcap-filter
- Write the matching packets into files named file with a file number appended (from
0
tofilecount-1
) - Switching to a new file if
rotate_seconds
have elapsed or iffile_size
has been reached for the current file.
When reaching filecount-1 number of files, tcpdump will go back to file 0, overwriting its content. Thereby creating a circular buffer of pcap files. - After switching to a new file,
tcpdump
will executecommand
and pass the old pcap filename as its first argument.
With this, we have a circular buffer of pcap files of reasonable size and we can run some code when switching to a new file.
Usually, the 'code' part is used to compress the pcap file (ie. -z gzip
), but we used it to check for the error condition we are interested in and have it kill tcpdump to stop the capture, for example:
#!/bin/bash
ROUTER_PID=8761
CURRENT_PID=$(pgrep router_runner)
if [ $ROUTER_PID != "$CURRENT_PID" ]; then
kill $PPID
fi
In this case, we were chasing a crash, so the quick and dirty way to know if the process had crashed was to check whether its PID
was still the same.
If it wasn't or if the process didn't exits, the script would kill its parent (tcpdump) to stop the packet capture.
The command run by tcpdump can check for whatever condition we're interested in. For example, we had another interresting case were the symptoms were a lot of sockets in CLOSE_WAIT
state. The script could check for this and signal the parent...
For the crash we were chasing, the full command we used was:
tcpdump -i any -G 60 -C 10 -W 15 -z $PWD/router_pid.sh -vv -w 9973.pcap dst port 9973
It created 15 files named 9973.pcap0
through 9973.pcap14
, each file containing at most 10mb or 60 seconds worth of requests and the router_pid.sh
script was executed each time tcpdump switched to a new output file.
We let the capture run for a few hours and eventually the process crashed. Sure enough the culprit was in the pcap files. For the record, the crashed happened in a json parser which choked on input similar to this:
POST / HTTP/1.1
Host: 127.0.0.1
Content-Length: 30
Accept: */*
Content-Type: application/json
{"url":"http\u00253A\u00252F\u
AppArmor
By default, on Ubuntu 12.04 at least, AppArmor is configured to deny syscalls that tcpdump requires to call out to the external command passed via the -z
parameter.
The quick fix for this was to set apparmor to 'complain' mode for tcpdump:
aa-complain /usr/sbin/tcpdump
The aa-complain
command is part of the apparmor-utils
package, in case you need to install it.
Caveat
The -G rotate_seconds
argument works only if tcpdump has captured at least one packet. If the filter doesn't match anything, files won't be rotated.
In this case it doesn't matter much, but if tcpdump is started only with -G rotate_seconds
, one would assume that it would exit after rotate_seconds, which will not be the case if tcpdump has not captured any packet, it will just hang there until it gets at least one packet.
Of course we did NOT discover this by running tcpdump -G 60 ...
in a cronjob and ending up with 200 tcpdump running. that never happened ;-)