This is a (hopefully) short piece explaining the process of debugging ObjectId collisions in MongoDB, in an environment involving Ruby and containers (GCP Cloud Run Jobs -- basically kubernetes jobs).

MongoDB is a document database: json documents can be stored in it and each document has its own unique ObjectId. (This is a very simplified summary, more info here.)

An ObjectId is a 12 byte number that uniquely identifies a document. This number is generated on the client, before inserting a new document in the db. While the format is standard, each client library has to implement it.

The ObjectId format is as follows:

  • 4 bytes timestamp in seconds from unix epoch
  • 5 bytes random value generated once per process. This random value is unique to the machine and process.
  • 3 bytes counter, initialized to a random value, incremented every time an ObjectId is generated.

I like these IDs, they're sortable-ish across machines, the 5 bytes of random "process id" should be enough to avoid collisions, they're fast to generate and you can get 16M of them per second before they repeat.

ObjectIds hit the fan

In the environment where this happened, there were multiple Cloud Run Jobs running a ruby program, each of them inserting a few hundred thousand unique documents into mongo as fast as they could.

From time to time, we'd get an error such as:

E11000 duplicate key error [...] index: _id_ dup key: { _id: ObjectId('64c14f7e421aa90001214e55')}

and one process would halt and exit.

I went fishing in the logs and in the db to figure out what was going on, and sure enough, those duplicates were always coming from 2 different jobs / processes.

What jumped out at me is that they always looked similar, more precisely, they always had 421aa90001 in the middle, in the "5 random bytes / machine / process identification" part. This was rather surprising.

I went looking for the source in the bson-ruby package and saw this (paraphrased):

static char rb_bson_machine_id_hash[HOST_NAME_HASH_MAX];

void init() {
...
  // Get the object id machine id and hash it.
  rb_require("digest/md5");
  gethostname(rb_bson_machine_id, sizeof(rb_bson_machine_id));
  rb_bson_machine_id[255] = '\0';
  rb_bson_generate_machine_id(rb_md5_class, rb_bson_machine_id);
...
}

void rb_bson_generate_machine_id(VALUE rb_md5_class, char *rb_bson_machine_id)
{
  VALUE digest = rb_funcall(rb_md5_class, rb_intern("digest"), 1, rb_str_new2(rb_bson_machine_id));
  memcpy(rb_bson_machine_id_hash, RSTRING_PTR(digest), RSTRING_LEN(digest));
}


VALUE rb_bson_object_id_generator_next(int argc, VALUE* args, VALUE self)
{
  char bytes[12];
  uint32_t t;
  uint32_t c;
  uint16_t pid = BSON_UINT16_TO_BE(getpid());

  if (argc == 0 || (argc == 1 && *args == Qnil)) {
    t = BSON_UINT32_TO_BE((int) time(NULL));
  }
  else {
    t = BSON_UINT32_TO_BE(NUM2ULONG(rb_funcall(*args, rb_intern("to_i"), 0)));
  }

  c = BSON_UINT32_TO_BE(rb_bson_object_id_counter << 8);

  memcpy(&bytes, &t, 4);
  memcpy(&bytes[4], rb_bson_machine_id_hash, 3);
  memcpy(&bytes[7], &pid, 2);
  memcpy(&bytes[9], &c, 3);
  rb_bson_object_id_counter++;
  return rb_str_new(bytes, 12);
}

Basically, it constructs the next ObjectId by:

  1. Taking the time
  2. Taking the first 3 bytes of the md5 of the hostname
  3. Taking the pid
  4. incrementing the counter

I had alarms bells ringing in my mind by that point, but just to confirm my fears, I ran this:

$ echo -n localhost | md5sum | cut -c 1-6
421aa9

Hmmm... looks familiar!

In our env, the hostname inside the container is localhost (421aa9), and the PID is always 1 (0001).

So basically, I have no idea how this ever worked since we've been using the same random bytes for every processes.

The fix

The solution to this mess was to deploy a version of bson-ruby that initializes the 5 "machine id" bytes with actual random bytes via getentropy.

Upstream bson-ruby merged a PR 3 weeks ago that replaces the suboptimal implementation with something good, but it is currently not released, and they bumped the major version of the package.

I didn't want to look at all the changes and see what warranted a major bump, so I opted for a local fork with their changes backported and tweaked it to use getentropy instead of arc4random_buf, which isn't available in the glibc version we're using at the moment.

I plan on keeping that fork while waiting for them to release the new version (and the rest of the ecosystem to catch up to it)

The end.