· 7 min read ·

Past Mark-and-Sweep: What Your Second Garbage Collector Forces You to Understand

Source: lobsters

Bob Nystrom’s Baby’s First Garbage Collector has been a rite of passage for systems programmers for over a decade. It walks through a complete mark-and-sweep collector in roughly 150 lines of C, building a tiny VM that manages memory without ever calling free() manually. Matheus Moreira’s Baby’s Second Garbage Collector picks up where that left off, and the jump in complexity it represents is worth examining carefully.

The first collector is deceptively approachable because mark-and-sweep maps almost directly onto the mental model most programmers already carry. You walk all the live objects, mark them, then free everything else. The heap stays in place. Pointers stay valid. You need to know where your roots are (the VM stack, in Nystrom’s example) and be able to enumerate each object’s outgoing references. Everything else follows from those two facts.

What Mark-and-Sweep Leaves Unsolved

The simplicity comes with real costs that become visible as soon as you push the collector beyond a toy context.

Fragmentation is the most immediate problem. After many collection cycles, the heap fills with small gaps where short-lived objects used to live. An allocator using a free list can usually find a slot for a new object, but not always from a contiguous region. Cache behavior degrades because related objects that were allocated together drift apart as some survive and some get freed. A long-running process with high allocation rates can end up with a heap that appears full to any request larger than a certain size, even when total free space would be sufficient if contiguous.

Pause time scales with heap size, not with the amount of live data. The sweep phase must visit every object in the heap to determine what to free. If your heap is 2 GB and only 100 MB is live, you still pay to walk 2 GB of object headers on every collection cycle.

A second GC forces you to confront these trade-offs directly.

Copying Collection: Moving Objects as a Feature

The Cheney copying collector, published in 1970, addresses both fragmentation and pause proportionality in one move. The heap is divided into two equal semi-spaces: from-space and to-space. All allocation happens in from-space using a bump pointer, a single integer that increments on each allocation. No free list, no coalescing, no header manipulation.

When from-space is exhausted, collection begins. The collector copies all live objects into to-space, updates every pointer that referred to the old location, and then swaps the spaces. From-space becomes the new to-space, empty and ready; the old to-space, now densely packed with live objects, becomes the new from-space.

Cheney’s key insight was that you can manage the gray object worklist without any auxiliary data structure by maintaining two pointers into to-space:

void collect(VM *vm) {
    Object *scan = to_space;
    Object *free_ptr = to_space;

    // Copy all roots first
    for (int i = 0; i < vm->stack_size; i++) {
        vm->stack[i] = copy(vm->stack[i], &free_ptr);
    }

    // scan chases free through to-space like a queue
    while (scan < free_ptr) {
        for each reference field f in *scan:
            *f = copy(*f, &free_ptr);
        scan += object_size(scan);
    }

    swap(&from_space, &to_space);
}

Object *copy(Object *obj, Object **free_ptr) {
    if (obj == NULL) return NULL;
    if (obj->forwarding != NULL) return obj->forwarding; // already moved

    Object *new_obj = *free_ptr;
    *free_ptr += object_size(obj);
    memcpy(new_obj, obj, object_size(obj));
    obj->forwarding = new_obj;
    return new_obj;
}

By using scan and free_ptr as implicit queue boundaries, Cheney’s algorithm performs a breadth-first traversal of the live object graph without allocating any auxiliary memory. The to-space itself serves as the worklist. Pause time becomes proportional to the live set rather than heap size, and the heap is completely defragmented after every collection.

The cost is that you need twice as much address space as your maximum live set requires. More significantly, every object must carry metadata that lets the collector determine its size and enumerate all its pointer fields.

That second requirement is what makes implementing a real copying collector harder than implementing mark-and-sweep. In Nystrom’s toy VM, every object is either an integer or a pair, handled with a simple switch statement. In a real runtime, objects can be strings, arrays, closures, hash maps, and native objects with finalizers. The collector needs a type-specific traversal function or an embedded pointer map for every object shape the language can produce.

The Forwarding Pointer Problem

The subtlest part of a copying collector is managing forwarding pointers correctly. When an object is copied to to-space, the original slot in from-space is overwritten with the address of the new copy. Any subsequent attempt to copy that object must detect the forwarding pointer and return the already-copied address instead of making a second copy.

This means every object needs either a dedicated forwarding pointer field or some in-band signaling scheme. One common approach is to use a sentinel bit in the object’s tag or type field. Another is to dedicate the first word of every object: if the first word has a specific bit pattern, treat it as a forwarding address; otherwise treat it as the beginning of normal object data. Both approaches constrain your object memory layout before you write a line of GC code, which is the kind of systemic design coupling that a first collector lets you ignore entirely.

The Generational Hypothesis

Copying collection is fast for workloads where most objects are short-lived, because pause cost is proportional to what survives rather than what dies. Generational collection makes this observation structural and exploits it.

The generational hypothesis, developed by David Ungar and others in the mid-1980s, holds that most objects become unreachable within a few million instructions of allocation. Empirical measurements across many workloads consistently confirmed this pattern. A collector focused on newly allocated objects can recover most garbage for very little work.

A generational collector divides the heap into a small nursery (young generation) and a larger old generation. New objects are allocated in the nursery using bump-pointer allocation. When the nursery fills, a minor collection copies surviving nursery objects into the old generation. Full collections of the old generation happen much less frequently. Minor collections are cheap because the nursery is small, pauses are short, and most objects die before surviving.

Write Barriers: The Hidden Cost of Generations

Generational collection introduces a problem that does not exist in simpler collectors. A minor collection must find all nursery objects that are still reachable. Some will be referenced from the VM stack, some from other nursery objects, but some may be referenced from old-generation objects. Without accounting for those cross-generational pointers, the collector would incorrectly identify nursery objects referenced only from old-generation as garbage.

The solution is the write barrier: code that executes on every pointer store and records any write that creates a reference from an old-generation object to a young-generation object. The collector uses this record as an additional root set during minor collections.

// Card table: the heap is divided into fixed-size cards (typically 512 bytes).
// A byte array indexed by card number tracks which cards contain
// cross-generational pointers.

#define CARD_SHIFT 9  // 512-byte cards
uint8_t card_table[HEAP_SIZE >> CARD_SHIFT];

void write_barrier(Object *host, Object **field, Object *new_value) {
    *field = new_value;
    if (is_old_gen(host) && is_young_gen(new_value)) {
        card_table[(uintptr_t)host >> CARD_SHIFT] = DIRTY;
    }
}

During a minor collection, the GC scans all dirty cards for old-generation objects that might point into the nursery, treats those objects as additional roots, then clears the dirty flags. The overhead is a few instructions on every pointer store, which in practice costs roughly one to five percent throughput for typical object-oriented workloads.

The alternative is a remembered set: an explicit list of old-generation locations containing young-generation pointers. Remembered sets are more precise (no false positives from non-pointer writes that happen to touch a dirty card) but more expensive to maintain per write. HotSpot uses a card table for its generational boundary. V8 uses a combination of write barriers and remembered sets depending on the memory space in question.

Where Toy Implementations End

Nystrom’s first collector and Moreira’s second work on the same pedagogical foundation: get the algorithm correct in a context where you can see the whole system at once. The VM stack is the only root set. Objects have a fixed set of shapes. The heap is a single contiguous region.

Real runtimes are considerably messier. The JVM has to handle roots in native method frames, JNI handles, class statics, JIT-compiled code with embedded object references, and weak references with finalization semantics. V8 handles JavaScript’s dynamic property model, the boundary between JavaScript objects and C++ objects managed by the embedder API, and the constraint that collection must be possible while JavaScript execution continues in some configurations.

The algorithms underlying all of these systems are recognizably the same: tri-color marking, copying collection, generational boundaries, write barriers. But the implementation surface area expands enormously once you add the requirement that the collector must handle every object the runtime can possibly produce.

Building the toy versions is not wasted effort. Understanding why forwarding pointers must be checked before each copy, why write barriers are necessary rather than optional in a generational collector, and why pause time scales differently for different algorithm families gives you the conceptual vocabulary to read the literature and understand the engineering decisions in production runtimes. Jones, Hosking, and Moss’s The Garbage Collection Handbook is the standard reference for anyone who wants to go beyond the toy stage.

The second collector is where the interesting friction starts. The first one shows you the shape of the problem. The second one shows you how much the first one was quietly leaving out.

Was this interesting?