· 6 min read ·

From Mark-and-Sweep to Something Better: The Road Every GC Implementation Takes

Source: lobsters

Bob Nystrom’s “Baby’s First Garbage Collector” has been the canonical entry point into GC implementation for years. It fits in a few hundred lines of C, teaches the mark-and-sweep algorithm, and leaves you with a working collector that you actually understand. Matheus Moreira’s “Baby’s Second Garbage Collector” picks up where that leaves off, and the gap between a first and second garbage collector is where most of the interesting engineering lives.

The gap is not just about adding features. It’s about confronting the structural limitations of your first design and deciding which tradeoffs you’re willing to make to fix them.

What Mark-and-Sweep Actually Costs

Mark-and-sweep works in two phases. First, starting from a set of roots (stack variables, globals, registers), you traverse the object graph and mark everything reachable. Then you sweep through the heap and free everything that wasn’t marked. It’s correct, it’s simple, and it handles cycles without special cases, which is why it’s the canonical teaching GC.

But the costs are real. The sweep phase touches every object in the heap, live or dead, which means GC time scales with total heap size rather than just live set size. More importantly, mark-and-sweep does not move objects. After several collection cycles, the heap develops fragmentation: many small free blocks scattered between live objects. Allocation becomes slower because the allocator has to search for a block large enough to satisfy each request. Cache locality degrades because objects allocated together over time end up physically scattered across memory.

There’s also the pause problem. A basic mark-and-sweep collector stops the world for the entire mark and sweep phase. For interactive applications or anything with latency requirements, this is a serious constraint.

These three problems, fragmentation, cache pressure, and stop-the-world pauses, are what motivate every GC design that comes after the basic mark-and-sweep.

The Copying Collector

The most elegant solution to fragmentation is to eliminate it structurally. Copying collectors, based on Cheney’s 1970 algorithm, split the heap into two equal halves called semi-spaces. Allocation always happens in one half (the “from-space”). When it fills up, the collector copies all live objects into the other half (the “to-space”), compacting them together, then swaps the roles of the two spaces.

The implementation in C looks roughly like this:

typedef struct {
    char *from_space;
    char *to_space;
    char *scan;       /* next unscanned object in to-space */
    char *free;       /* next free slot in to-space */
    size_t size;      /* semi-space size */
} SemiSpaceHeap;

void *copy(SemiSpaceHeap *heap, void *obj) {
    /* If already copied, return forwarding pointer */
    Header *hdr = (Header *)obj - 1;
    if (hdr->forwarded) return hdr->forward;

    /* Copy object to to-space */
    size_t sz = hdr->size;
    memcpy(heap->free, hdr, sz);
    hdr->forwarded = true;
    hdr->forward = heap->free + sizeof(Header);
    heap->free += sz;
    return hdr->forward;
}

After collection, all live objects are packed at the start of to-space with no gaps. The allocator becomes a simple pointer bump: increment free by the requested size, no free list traversal required. Allocation in a copying collector is among the fastest possible, often just two or three instructions.

The obvious cost is memory: you can only use half the heap at any time. For memory-constrained environments this is a non-starter. But for systems with headroom, the improved allocation speed, compaction, and cache locality often more than compensate.

The Forwarding Pointer Trick

The subtle part of Cheney’s algorithm is handling the object graph correctly during copying. When you copy an object, other objects may still hold pointers to the original location. You need to update all those pointers.

Cheney’s solution is the forwarding pointer. When an object is copied, you overwrite part of its original header in from-space with a pointer to the new location in to-space. Any subsequent attempt to copy the same object finds the forwarding pointer and returns the already-copied location instead of duplicating it. This handles shared references and cycles correctly without any external bookkeeping.

Cheney also eliminated the recursion that naive copying would require. Instead of depth-first traversal (which needs a stack that can grow as deep as the object graph), he uses a breadth-first scan. The scan pointer chases the free pointer through to-space: everything between scan and free has been copied but not yet scanned for references. Scanning an object means copying all objects it references, advancing free, and then advancing scan past the just-scanned object. When scan catches up to free, collection is complete.

void collect(SemiSpaceHeap *heap, Roots *roots) {
    heap->scan = heap->free = heap->to_space;

    /* Copy roots */
    for (int i = 0; i < roots->count; i++)
        roots->ptrs[i] = copy(heap, roots->ptrs[i]);

    /* Scan copied objects */
    while (heap->scan < heap->free) {
        Object *obj = (Object *)heap->scan;
        scan_references(heap, obj);  /* calls copy() on each reference */
        heap->scan += object_size(obj);
    }

    /* Swap spaces */
    char *tmp = heap->from_space;
    heap->from_space = heap->to_space;
    heap->to_space = tmp;
}

This is a genuinely beautiful algorithm. It uses the to-space itself as its work queue, requires no auxiliary data structures, and runs in time proportional to the live set rather than total heap size.

Where Generational GCs Come From

Copying collection solves fragmentation but doesn’t help with pause times if you have a large live set. Copying a gigabyte of live objects takes time proportional to a gigabyte, regardless of how fast each individual copy is.

The generational hypothesis offers a way out. In most programs, most objects die young. Objects allocated together tend to die together. Garbage collected in the same batch. This observation, supported empirically across a wide range of languages and workloads, motivates splitting the heap by age.

A generational collector maintains at least two regions: a small, frequently collected “young generation” and a larger, infrequently collected “old generation.” Most allocations happen in the young generation. Collection of the young generation (a “minor GC”) is fast because the region is small and most of its objects are dead. Objects that survive enough minor collections get promoted to the old generation.

The engineering challenge is the write barrier. If an old object holds a reference to a young object, the young object must be treated as a root during minor collection even though it’s reachable only through the old generation. To track these cross-generational references, every write to a heap object must execute a small piece of code that records the write if it creates an old-to-young pointer. This is the write barrier.

void write_ref(Object *target, Object **field, Object *value) {
    *field = value;
    if (is_old(target) && is_young(value)) {
        record_in_remembered_set(target);
    }
}

Write barriers have a non-trivial cost in throughput. Interpreted languages absorb it more easily; compiled languages sometimes use card tables or other coarser-grained tracking to reduce per-write overhead.

How Production GCs Combine These Ideas

V8’s garbage collector, Orinoco, uses a generational design with a semi-space young generation (Cheney-style copying) and a mark-compact old generation. The young generation uses copying collection for its compaction and allocation speed. The old generation uses incremental and concurrent marking to reduce pause times, with compaction done selectively on pages with high fragmentation.

Go’s garbage collector takes a different path entirely. It’s a non-generational, non-compacting, concurrent mark-and-sweep collector. Go’s designers chose not to compact because the language’s unsafe pointer arithmetic makes it difficult to move objects reliably. Instead, they focus on reducing pause times through concurrent marking and a tricolor invariant with write barriers. Go 1.5 brought pauses down to the sub-millisecond range, and they’ve continued improving since.

The JVM’s G1 collector divides the heap into fixed-size regions rather than contiguous semi-spaces or generations, allowing it to collect the regions with the most garbage first (hence “Garbage First”). ZGC and Shenandoah go further with concurrent compaction, using load barriers to handle object movement without stopping application threads.

Each of these is a different answer to the same set of tradeoffs: throughput vs. latency, memory overhead, implementation complexity, and the constraints imposed by the host language’s semantics.

Building the Second One Is the Education

The first garbage collector teaches you how collection works. The second one teaches you why it’s hard. You have to make concrete decisions about memory overhead, pause characteristics, and implementation complexity, and you have to live with those decisions as you write the code.

The semi-space copying collector is a natural second project because it solves real problems (fragmentation, allocation speed) with an algorithm that’s still small enough to hold in your head. From there, the path to generational collection, incremental marking, and concurrent collection is a sequence of incremental refinements, each motivated by a specific cost you’ve already seen.

For anyone who’s worked through Nystrom’s tutorial and wants to keep going, Moreira’s follow-up is exactly the next step. And Wilson’s 1992 survey “Uniprocessor Garbage Collection Techniques” remains the best map of the entire design space if you want to see where all the paths lead.

Was this interesting?