· 6 min read ·

What Building a Second Garbage Collector Actually Teaches You

Source: lobsters

The first garbage collector everyone builds is a mark-and-sweep. It is elegant in its simplicity: walk the object graph from the roots, mark everything reachable, then sweep through the heap and free what is unmarked. Robert Nystrom’s “Baby’s First Garbage Collector” is probably the most-read introduction to the subject, and it does a good job of making GC feel approachable. But there is a reason it is called “first” — the algorithm, as written, stops the world.

Matheus Moreira’s “Baby’s Second Garbage Collector” picks up where that leaves off. The goal is to make the collector incremental: instead of halting all execution until marking is complete, you do a bounded amount of GC work per allocation, interleaved with the mutator. This is where things get genuinely interesting, and genuinely hard.

The Problem with Stopping the World

A stop-the-world collector’s pause time grows with the size of the live set. In a small interpreter with a few dozen heap objects, this is irrelevant. In a real application, it is not. A 100ms GC pause is noticeable in a game loop. A 500ms pause is a latency spike in a web server. As heap sizes have grown from kilobytes to gigabytes, the demand for sub-millisecond pauses has pushed every production runtime away from simple STW designs.

The straightforward fix is to run GC incrementally: pause for a small quantum of work (say, process 100 objects), yield to the mutator, repeat. This spreads GC overhead across many tiny pauses instead of one large one. But this immediately breaks the simple two-color model of mark-and-sweep.

The problem: while the GC is mid-mark and the mutator is running, the mutator can create new pointers. It can take an already-scanned (“black”) object and write a reference to an unvisited (“white”) object into it. If the GC has already finished scanning the black object and never revisits it, it will finish marking and conclude the white object is unreachable. It frees it. The black object now holds a dangling pointer. This is the lost object problem, and it is the central challenge of incremental and concurrent GC.

Tri-Color Marking

The solution, formalized by Dijkstra, Lamport, and others in their 1978 paper “On-the-fly Garbage Collection: An Exercise in Cooperation”, is tri-color marking. Instead of a single mark bit, objects are assigned to one of three sets:

  • White: not yet visited; presumed garbage until proven otherwise.
  • Grey: discovered as reachable, but children not yet scanned.
  • Black: fully scanned; all children have been processed.

The GC maintains a worklist of grey objects. Each increment pops a grey object, scans its children (pushing white ones to grey), and colors the object black. When the worklist empties, all white objects are unreachable and can be swept.

The key invariant: no black object may point directly to a white object. As long as this holds, the algorithm is correct regardless of how many increments it takes to complete a cycle.

Write Barriers: The Crux

Tri-color marking does not solve the lost object problem on its own. The mutator can still write a pointer from a black object to a white one, violating the invariant. The fix is a write barrier: a small piece of code that executes on every pointer store and restores the invariant.

There are two fundamental approaches, and understanding the difference between them clarifies a lot of production GC design.

The insertion barrier (Dijkstra-style) re-greys the target when a pointer is written into a black object:

void write_barrier(Object *obj, Object *value) {
    if (obj->color == BLACK && value->color == WHITE) {
        value->color = GREY;
        worklist_push(value);
    }
    obj->field = value;
}

This prevents the invariant violation at the point of creation: you cannot write a black-to-white pointer because the write barrier immediately makes the target grey.

The deletion barrier (Yuasa-style) greys the old referent when it is overwritten:

void write_barrier(Object *obj, Object *value) {
    Object *old = obj->field;
    if (old != NULL && old->color == WHITE) {
        old->color = GREY;
        worklist_push(old);
    }
    obj->field = value;
}

This implements “snapshot-at-the-beginning” (SATB) semantics: everything that was live when the GC cycle started will be treated as live by this cycle, even if the mutator later drops all references to it. The SATB approach can retain some floating garbage (objects that died after the snapshot), but it simplifies concurrent correctness proofs.

Lua’s incremental GC uses an insertion barrier. Go’s GC uses a hybrid barrier that combines both styles to eliminate the rescan phase that was required in earlier versions.

Each write barrier approach has different costs. Insertion barriers are cheaper per-write but may require a full rescan of the root set at the end of the mark phase. Deletion barriers (SATB) add overhead on writes that remove references, but allow the GC to terminate without rescanning.

Pacing the Collector

With incremental marking in place, a new problem emerges: if the mutator allocates faster than the collector can mark, the heap grows without bound. The collector has to be paced against the allocation rate.

A simple approach is to trigger a GC increment on every N bytes allocated, sizing N based on the current heap:

void *gc_alloc(VM *vm, size_t size) {
    vm->bytes_allocated += size;
    if (vm->bytes_allocated > vm->next_gc) {
        gc_increment(vm);  // do a bounded chunk of work
    }
    return allocate(size);
}

The threshold next_gc is typically set to some multiple of the live set size after the previous collection, using a growth factor similar to what Go uses (the GOGC environment variable, defaulting to 100%, means “collect when heap doubles”). The pacing formula matters for tail latency: too aggressive and you waste CPU; too conservative and you accumulate garbage.

What This Is Actually Teaching You

Building this collector from scratch reveals something that reading about production GCs often obscures: the algorithmic constraints are tight, but the engineering is where complexity accumulates.

The tri-color invariant is a clean mathematical property. Maintaining it in practice requires write barriers on every pointer store in the entire runtime. In a real VM, pointer stores happen in object field writes, array stores, closure captures, upvalue mutations, and anywhere the runtime itself moves data between GC-managed slots. Missing a single store site is a latency-delayed memory corruption bug that appears intermittently under GC pressure. Lua’s GC is around 1,000 lines of C and every line carries this responsibility.

There is also the sweep phase. Incremental marking does not make sweep incremental by default. Sweeping still walks the entire object list. For large heaps, this can be a significant pause even if marking was fully incremental. Production collectors address this with lazy sweeping (sweep during allocation, one page at a time) or by using a bitmap allocator that can identify free regions without walking every object.

Generational collectors add another layer: a remembered set or card table to track old-generation pointers into the young generation, so minor collections do not need to scan the entire old heap. V8’s Scavenge and the JVM’s young-generation collector both depend on this. The write barrier for a generational collector records cross-generational writes, and these barriers compose with the tri-color barriers in non-obvious ways when both are needed simultaneously.

The Production Lineage

The collector Moreira builds sits squarely in the lineage of Lua’s incremental GC, which was redesigned from stop-the-world to incremental in Lua 5.1 (2006). Lua’s approach is worth reading directly: it is small enough to understand in full, uses a clear insertion barrier, and the incremental design has been stable and correct for two decades.

Go’s GC took longer to get there. Go 1.5 (2015) introduced concurrent marking, and the GC has been progressively improved through several design iterations to reach its current hybrid barrier design. CPython uses reference counting for most objects, with a supplemental cyclic garbage collector to handle cycles, organized into three generations.

The conceptual path from “Baby’s First” to “Baby’s Second” is precisely the path these production runtimes walked, just compressed into a tutorial. The first collector gives you the structure. The second gives you the invariant that every incremental and concurrent collector must preserve, and the write barrier mechanism that enforces it.

If you work in a language with a managed runtime and have ever wondered why GC pauses are bounded but not zero, why GOGC exists, or why Lua’s GC behaves differently under allocation pressure, building both of these collectors is the most direct path to a real answer. The code is not large. The concepts transfer directly to every production system you will ever debug.

Was this interesting?