· 7 min read ·

Beyond Stop-the-World: What Building a Second Garbage Collector Teaches You

Source: lobsters

The canonical starting point for learning garbage collector implementation is Bob Nystrom’s Baby’s First Garbage Collector, roughly 150 lines of C that implement a stop-the-world mark-and-sweep collector with an intrusive linked list threading through every heap object. It is excellent pedagogy: short enough to hold in your head, complete enough to run, and just complex enough to force real thinking. Matheus Moreira’s Baby’s Second Garbage Collector picks up from that foundation, and the gap between the first and second step is where most of the interesting engineering lives.

What the First GC Teaches (and What It Skips)

The mark-and-sweep algorithm is conceptually straightforward. Every object on the heap is reachable or it isn’t. You walk from root objects (stack variables, globals) following pointers, marking everything you touch. Then you sweep through the entire heap, freeing anything unmarked. A typical implementation in C looks like this:

typedef struct Object {
    ObjectType type;
    unsigned char marked;
    struct Object* next;  /* intrusive linked list of all objects */
    /* ... payload */
} Object;

void mark(Object* obj) {
    if (obj->marked) return;
    obj->marked = 1;
    /* recurse into children */
}

void sweep(VM* vm) {
    Object** obj = &vm->firstObject;
    while (*obj) {
        if (!(*obj)->marked) {
            Object* unreached = *obj;
            *obj = unreached->next;
            free(unreached);
        } else {
            (*obj)->marked = 0;
            obj = &(*obj)->next;
        }
    }
}

The algorithm collects cycles, which reference counting cannot handle without auxiliary machinery. It requires no special cooperation from the allocator beyond maintaining the linked list. The core limitation is the stop-the-world pause: to collect, you halt the entire program, walk the heap, and resume. For a toy language interpreter this is acceptable. For anything with latency requirements it becomes a design constraint that forces different choices.

The first GC teaches you what garbage collection is. The second teaches you what makes it hard.

Tri-Color Marking and Why It Exists

The classical solution to stop-the-world pauses is tri-color marking, formalized by Dijkstra, Lamport, and collaborators in their 1978 paper “On-the-fly Garbage Collection: An Exercise in Cooperation.” The core idea is to make marking incremental: instead of one monolithic traversal, you do a small amount of work each time the program allocates or at regular intervals, interleaving GC work with the running program.

Objects are classified into three colors:

  • White: not yet seen, or not reachable. Objects start white; those remaining white after a complete collection cycle are garbage.
  • Gray: discovered but not fully processed. Their children haven’t been traced yet.
  • Black: fully processed. All direct children are at least gray.

The invariant that must hold throughout collection: no black object points directly to a white object. This is the tri-color invariant. Maintaining it means you can pause and resume marking at any point, because the color classification accurately reflects what is known about reachability at that moment.

The incremental marking loop in practice:

void mark_gray(GC* gc, Object* obj) {
    if (obj->color != WHITE) return;
    obj->color = GRAY;
    worklist_push(&gc->worklist, obj);
}

void process_one_gray(GC* gc) {
    Object* obj = worklist_pop(&gc->worklist);
    for (int i = 0; i < obj->num_children; i++) {
        mark_gray(gc, obj->children[i]);
    }
    obj->color = BLACK;
}

Push all roots to gray initially, call process_one_gray incrementally interleaved with program execution, and when the worklist empties, sweep the white objects. The incremental structure distributes the pause across many small interruptions rather than one long one. Pause time becomes proportional to how much work you do per increment, not to heap size.

The Write Barrier Problem

Incremental collection introduces a correctness problem that stop-the-world collection sidesteps entirely. If the program (the mutator in GC literature) modifies the heap while marking is in progress, it can break the tri-color invariant. A black object can acquire a reference to a white object: the black object won’t be visited again since it has already been processed, so the white object will be incorrectly freed even though it is now reachable. This is a use-after-free bug that the GC introduces on itself.

The solution is a write barrier, code that runs whenever a pointer is stored, restoring the invariant after each mutation. There are two main families.

The Dijkstra insertion barrier shades the new referent gray when a pointer is stored:

void write_barrier(GC* gc, Object* container, Object* newval) {
    if (gc->phase == GC_MARKING && newval != NULL && newval->color == WHITE) {
        newval->color = GRAY;
        worklist_push(&gc->worklist, newval);
    }
    container->field = newval;
}

Any newly reachable object gets queued before the black object that references it is considered complete. The alternative, the Yuasa deletion barrier, shades the old referent gray when a reference is overwritten, preserving snapshot-at-the-beginning semantics. The difference matters: the insertion barrier guarantees that nothing newly reachable is missed; the deletion barrier guarantees that nothing reachable at the start of the cycle is missed, even if the mutator deletes references during marking.

Go’s runtime uses a hybrid write barrier combining properties of both to eliminate the need to re-scan goroutine stacks during stop-the-world phases. Before the hybrid barrier was introduced in Go 1.9, the GC had to stop all goroutines a second time to re-scan stacks that might have been modified during concurrent marking. The hybrid design removed that second STW phase and brought pause times down to sub-millisecond territory on typical workloads.

Write barriers carry real overhead. Every pointer store in the program passes through extra code. This is not an abstraction that compiles away; it is a conditional check and potential worklist push on every heap write. For write-heavy workloads this adds up, which is one reason that GC tuning in production runtimes is a sustained engineering discipline rather than a configuration knob.

Copying Collection: A Different Path

Tri-color incremental marking is one direction from the simple mark-and-sweep baseline. Copying collection, specifically the algorithm C.J. Cheney described in 1970, is another. Instead of marking objects in place and sweeping, a copying collector maintains two equal semi-spaces of memory. Live objects are copied from from-space into to-space during collection, and from-space is then discarded wholesale.

Allocation becomes a pointer bump: to allocate N bytes, advance a free pointer by N. No free lists, no fragmentation, no coalescing. Collection cost is proportional to live data rather than total heap size, which is a significant advantage when most objects are short-lived.

Object* evacuate(Cheney* gc, Object* obj) {
    if (obj == NULL) return NULL;
    if (is_forwarding_pointer(obj->header)) {
        return get_forward(obj->header);  /* already evacuated */
    }
    size_t sz = object_size(obj);
    Object* new_loc = (Object*)gc->free;
    memcpy(new_loc, obj, sz);
    gc->free += sz;
    set_forward(&obj->header, new_loc);
    return new_loc;
}

The forwarding pointer is the key mechanism: when you copy an object, you leave a breadcrumb in the old location so that other references to the same object find the new address. Without it, two pointers to the same object would produce two separate copies, breaking object identity. Cheney’s specific contribution was recognizing that to-space itself can serve as an implicit breadth-first search queue. You maintain a scan pointer and a free pointer into to-space and process objects between them until the pointers converge, turning what was a recursive algorithm requiring a separate stack into an iterative one using only two integers.

The cost is memory: half the heap is always reserved for the next collection. If most objects are long-lived, you spend time copying data that didn’t need to move, and you pay double the footprint for no benefit. This tradeoff is what motivates generational collection. The generational hypothesis, an empirical observation about most programs, states that most objects die young. If you apply copying collection only to a small nursery generation and promote survivors to an older region with a different collection strategy, you get cheap allocation and cheap collection for the common case without paying the full semi-space overhead on the entire heap.

The Distance to Production

Both tri-color marking and Cheney copying are pedagogically valuable precisely because they surface complexity that stop-the-world mark-and-sweep hides. Stop-the-world collection sidesteps write barriers behind a single pause. Copying collection forces you to confront pointer fixup, forwarding, and the memory overhead of semi-spaces. Generational collectors combine these concerns and add remembered sets to track pointers from old generations into the nursery without scanning the entire old generation on each minor collection.

Production collectors go considerably further. V8 uses Orinoco, combining a Scavenger for the young generation with concurrent mark-and-compact for the old generation, using parallel marking threads to overlap GC work with JavaScript execution. Incremental marking in V8 uses a combination of write barriers and a marking deque, with the entire system tuned to bound individual marking steps at around 1ms to avoid jank. The JVM’s ZGC targets sub-millisecond pauses on heaps measured in terabytes, using colored pointer bits and load barriers rather than write barriers to handle concurrent relocation, so that the GC can move objects while the program reads them. These outcomes are the product of decades of incremental engineering on top of the same foundational algorithms.

The value of building a second garbage collector is not that the code will ship anywhere. Each algorithm exposes a problem that the previous one concealed. Write barriers only become necessary when marking is incremental. Forwarding pointers only appear when objects move. Remembered sets only matter when generations exist. Concurrent relocation only becomes a problem when you try to compact without pausing. Building sequentially through these layers means each new problem arrives in a context small enough to debug in isolation, which is not a property that reading production GC source code provides. The production code solves all these problems simultaneously, in a form optimized for performance rather than comprehension.

Was this interesting?