· 5 min read ·

The Cache Miss Problem That Green Tea GC Was Designed to Solve

Source: go

Go’s garbage collector has used concurrent tri-color mark-and-sweep since Go 1.5. The algorithm is well understood, the concurrency model is solid, and the engineering team has spent years tuning it. So when the Go team announced Green Tea GC claiming up to 40% reductions in GC CPU time, the natural question is: what changed?

The algorithm did not change. The data structures did. That distinction matters more than it might sound.

What the Old GC Was Actually Doing

Go’s tri-color mark-and-sweep GC works by maintaining a worklist of heap objects that need to be traced. During the mark phase, the collector pulls an object off the worklist, scans its pointers, and adds any referenced objects it hasn’t seen yet. Repeat until the worklist is empty. Everything not marked is garbage.

The worklist is a queue of object addresses. Each address points somewhere in the heap. The heap is not arranged to make those addresses close to each other; objects are allocated based on size classes and span availability, not on any notion of traversal order. When the GC follows a pointer from object A to object B, object B might be in a completely different region of memory.

This creates a predictable problem for the CPU. Modern processors hide memory latency through caching and prefetching. Hardware prefetchers are good at detecting stride patterns: sequential reads through an array, for instance, or a stride of fixed size. Random pointer chasing through scattered heap memory defeats both of those mechanisms. Every pointer follow is potentially a cache miss, which means a stall waiting on main memory.

The Go team measured the actual cost of this. Approximately 35% of GC marking CPU time was being spent waiting on main memory stalls due to cache misses. A third of the GC’s CPU budget was going to waiting, not working.

The Page-Based Shift

Green Tea GC changes the fundamental unit of tracking from individual objects to memory pages. Instead of a worklist of object addresses, the collector maintains a FIFO queue of pages. Each page contains contiguous slots of the same size class.

The metadata changes accordingly. Rather than per-object state stored inline with heap allocations, Green Tea uses per-page bitmaps stored off-heap. Each object slot on a page gets two bits:

  • A “seen” bit, indicating the object is reachable from roots
  • A “scanned” bit, indicating the object’s pointers have been fully processed

These bitmaps are small and tightly packed. Scanning a page means reading through a contiguous block of bitmap data, which is exactly the kind of access pattern that hardware prefetchers handle well. The CPU can load a cache line of 64 bytes of bitmap data and process many objects’ worth of state from it before needing to go back to memory.

On hardware with AVX-512 support, the implementation uses VGF2P8AFFINEQB to process 64 bytes of per-page bitmap data in a single instruction, which yields an additional roughly 10% improvement on top of the baseline gains.

The mark-and-sweep algorithm is otherwise unchanged. The collector still does concurrent marking, still uses write barriers for the same purposes, still sweeps dead objects after marking completes. The insight is that the algorithm’s correctness and concurrency properties were fine; the data layout was losing performance on the memory subsystem.

Why Not Generational GC

The obvious comparison is to generational garbage collectors. The JVM’s G1 GC and .NET’s generational GC both divide the heap into generations, collecting young objects frequently and promoting survivors to older generations. The rationale is the generational hypothesis: most objects die young.

Go has resisted this approach, and the resistance has a concrete technical reason. Generational collectors require write barriers to track pointers from old generations to young ones. Whenever a program writes a pointer from an older region to a younger region, the runtime has to record that write so the collector knows to check that pointer when it collects the young generation.

Write barriers add overhead to every pointer write in a program. For many workloads this is acceptable; the JVM and .NET accept this cost because the generational collection benefits outweigh it. For Go programs, the calculus is different. Go programs do heavy pointer manipulation through interface values, closures, goroutine stacks, and channel operations. The Go team has evaluated generational designs and found that write barrier overhead eats significantly into the gains from generational collection in typical Go workloads.

Green Tea stays within the concurrent tri-color model that Go already uses, avoiding write barrier costs entirely while still addressing the core cache inefficiency problem. It is a different bet on where the performance is.

What 40% Actually Means

The 40% reduction is in GC CPU time, not wall-clock pause time, and it applies to allocation-heavy workloads, not all workloads. Typical workloads see closer to a 10% reduction in GC CPU time. The headline figure applies to programs doing heavy allocation of short-lived small objects: HTTP servers processing request contexts, programs allocating many small structs in hot paths, applications with high channel throughput.

Translating GC CPU time savings to application-level impact requires knowing what fraction of total CPU time was going to GC in the first place. A service spending 5% of its cycles on GC that gets a 40% GC reduction saves roughly 2% of wall-clock time. A service spending 20% on GC saves around 8%. The savings compound with multi-core workloads, where reduced cache contention during marking has throughput effects that exceed what the raw GC time measurement captures.

Go 1.25 shipped Green Tea as an experimental opt-in via GOEXPERIMENT=greenteagc. The Go team ran it in production at Google during the 1.25 cycle to validate it against real workloads before making it the default. Go 1.26 makes it the default collector.

The Broader Pattern

There is a recurring observation in systems programming that cache-friendly data layout often matters as much as algorithmic complexity. The famous example is the comparison between a pointer-chasing linked list and a flat array for traversal: the array wins by a factor of several times on realistic sizes, even though both are O(n), because the array fits in cache and the linked list does not.

Green Tea applies this observation to garbage collection infrastructure. The tri-color marking algorithm is not cache-unfriendly by design; it just happened to be implemented with a data structure, the object-address worklist, that produces random access patterns. Switching to page-level tracking creates sequential access patterns that hardware is built to accelerate, without any change to the algorithm’s correctness or its interaction with Go’s concurrency primitives.

This is also why the improvement is clean. There is no new algorithmic complexity to reason about, no new edge cases in concurrent marking, no change to the invariants that the rest of the runtime depends on. The GC does the same thing it always did; it just does it with better memory access patterns.

For programs where GC overhead is a meaningful fraction of CPU time, upgrading to Go 1.26 is likely the lowest-effort performance improvement available. The work has already been done; you just have to build with the new toolchain.

Was this interesting?