· 6 min read ·

From Mark-and-Sweep to Something Worth Keeping: Building Your Second Garbage Collector

Source: lobsters

There is a well-worn path for learning garbage collection. You read Bob Nystrom’s “Baby’s First Garbage Collector”, you implement a mark-and-sweep in C in an afternoon, and you feel like you understand the thing. Then you try to make it fast, or incremental, or generational, and you discover that you understood the concept but not the problem.

Matheus Moreira’s follow-up article picks up exactly where that first tutorial leaves off, and it’s worth reading because the jump from a first GC to a second one is where the real education happens. The first collector you write is correct. The second one has to be good.

Why Mark-and-Sweep Falls Apart Under Pressure

A basic mark-and-sweep collector has two phases. In the mark phase, you walk the object graph from your roots (globals, stack slots, registers) and set a flag on every live object. In the sweep phase, you walk every allocated object and free the ones that weren’t marked. It works. The correctness argument is straightforward. The implementation fits in a few hundred lines.

The problems are structural.

First, fragmentation. When you free objects scattered throughout the heap, you leave holes. The allocator has to manage a free list of variable-size chunks, and over time that list becomes a long sequence of small unusable gaps. Allocating a 1KB object when you have 10KB free in 50-byte fragments fails. You can mitigate this with coalescing and splitting strategies, but you’re fighting entropy.

Second, cache behavior. Surviving objects stay where they were allocated. After a few collection cycles, your live set is scattered across the heap with no locality. Iterating over objects in a hot loop means cache misses on every access.

Third, pause time. Both phases take time proportional to the size of the heap, not just the live set. A large heap with few live objects still requires sweeping the entire allocation space. Your program stops for the duration.

The Copying Collector: Trade Memory for Everything Else

The semi-space copying collector, described by Fenichel and Yochelson in 1969 and made efficient by C.J. Cheney in 1970, takes a different approach. You split the heap into two equal halves: “from-space” and “to-space”. All allocations happen in from-space. When it fills up, you copy every live object into to-space, update all pointers, then swap the roles of the two spaces. The old from-space is now empty and ready for the next round.

This sounds wasteful, and it is: you can only use half your heap at any time. But what you get in return is significant.

Allocation becomes a pointer bump. No free list, no coalescing, no metadata overhead per object beyond what you need for the GC itself. Just increment a pointer and return the old value:

void *gc_alloc(GC *gc, size_t size) {
    if (gc->free + size > gc->to_end) {
        gc_collect(gc);
    }
    void *ptr = gc->free;
    gc->free += size;
    return ptr;
}

After collection, live objects are packed contiguously at the start of the new from-space. Fragmentation is structurally impossible. Cache locality for objects that were allocated together and tend to be accessed together is excellent.

Collection time is proportional to the live set, not the heap size. If you have 100MB heap with 1MB of live objects, you copy 1MB and you’re done.

Cheney’s Algorithm: BFS Without a Queue

The clever part of Cheney’s algorithm is how it avoids needing a separate queue for the breadth-first traversal. The to-space itself serves as the queue.

You maintain two pointers into to-space: scan and free. Initially, both point to the start. When you copy the roots, you copy them into to-space at free and advance free. Then you start scanning from scan. For each object at scan, you copy all the objects it references into to-space at free, update the reference to point to the new location, and advance scan. When scan catches up to free, you’re done.

void gc_collect(GC *gc) {
    // Swap spaces
    char *tmp = gc->from;
    gc->from = gc->to;
    gc->to = tmp;
    gc->free = gc->to;
    gc->scan = gc->to;

    // Copy roots
    for (int i = 0; i < gc->num_roots; i++) {
        *gc->roots[i] = copy(gc, *gc->roots[i]);
    }

    // Scan copied objects
    while (gc->scan < gc->free) {
        Object *obj = (Object *)gc->scan;
        // For each pointer field in obj:
        for (int i = 0; i < obj->num_fields; i++) {
            obj->fields[i] = copy(gc, obj->fields[i]);
        }
        gc->scan += obj->size;
    }
}

The copy function checks whether the object has already been copied by looking for a forwarding pointer, a marker left in the old location pointing to where the object now lives:

Object *copy(GC *gc, Object *obj) {
    if (obj == NULL) return NULL;
    if (obj->tag == FORWARDED) return obj->forward;

    size_t size = obj->size;
    Object *new_obj = (Object *)gc->free;
    memcpy(new_obj, obj, size);
    gc->free += size;

    obj->tag = FORWARDED;
    obj->forward = new_obj;
    return new_obj;
}

This is a complete, correct copying collector in about 50 lines of real logic.

The Write Barrier Problem

Copying collectors introduce a correctness requirement that mark-and-sweep doesn’t have in its basic form: you must update every pointer when you move an object. In a simple interpreter where you control all pointer storage, this is manageable. In a more realistic runtime with a C stack, native handles, or embedded pointers in compiled code, it becomes genuinely hard.

This is where write barriers enter the picture. A write barrier is a piece of code that runs every time a pointer is stored into a heap object. In a generational collector (the natural evolution from a simple copying collector), the write barrier records any pointer from an old-generation object into a young-generation object, because such pointers are additional roots for young-generation collection. The data structure that records these is called a remembered set.

Generational collection applies the copying strategy selectively. Most objects die young, the generational hypothesis observed by empirical study, so you collect the young generation frequently and cheaply. Objects that survive long enough get promoted to an older generation collected less often. The JVM’s G1 and ZGC collectors, V8’s Orinoco, and CPython’s upcoming nogil GC all build on this foundation.

Tri-Color Marking and Incremental Collection

Both mark-and-sweep and copying collectors are stop-the-world: the mutator (your program) pauses for the entire collection. For latency-sensitive applications, this is unacceptable.

Tri-color marking, formalized by Dijkstra, Lamport et al. in 1978, makes incremental and concurrent collection possible. Objects are colored:

  • White: not yet reached; will be collected if still white at the end
  • Gray: discovered but not fully traced; still has unscanned references
  • Black: fully traced; all references have been followed

The invariant is that no black object may point directly to a white object. As long as you maintain this invariant, you can interleave marking with mutator execution. When the mutator creates a pointer from a black object to a white object, a write barrier intercepts it and re-grays the black object (or shades the white object gray), restoring the invariant.

This is the theoretical basis for collectors like Go’s concurrent mark-and-sweep, which has gone through multiple iterations of write barrier design to reduce pause times from hundreds of milliseconds to well under a millisecond.

What the Second Collector Actually Teaches You

Building a copying or generational collector after a mark-and-sweep forces you to think about things the first implementation hid.

Object layout matters. To scan an object’s pointer fields, you need to know where they are. A copying collector needs a type map or tag-based dispatch to find and update every pointer in every object. This is what runtime type information is actually for in managed languages, and it’s why object headers exist.

Precision matters. A conservative collector (like Boehm-Demers-Weiser) can treat any word that looks like a pointer as one. A moving collector can’t: if you misidentify an integer as a pointer, you corrupt it when you update it after moving the object. Moving collectors require precise roots and precise heap maps.

Cooperation between the allocator and collector matters. The bump-pointer allocator that makes copying so fast requires the GC to completely own the heap layout. You can’t mix it with malloc.

None of these are reasons not to build one. The opposite: building a second garbage collector is the thing that turns GC from a black box into a set of concrete engineering tradeoffs you can reason about. The first one proves you can do it. The second one starts to show you why the production implementations look the way they do.

Was this interesting?