Sequential Scans Over Pointer Chasing: Inside Go's Green Tea Garbage Collector
Source: go
Go has had sub-millisecond stop-the-world pauses since Go 1.8. The concurrent tri-color mark-and-sweep introduced in Go 1.5 addressed the latency side of GC, and the hybrid write barrier in Go 1.8 eliminated the stack re-scan that produced the remaining long pauses. By 2017, GC pause time was effectively solved for most workloads, but CPU consumption during marking had remained largely unaddressed.
Green Tea, introduced experimentally in Go 1.25 in October 2025, targets that remaining cost: how much compute the GC burns concurrently with the application, regardless of whether anything is stopped. With Go 1.26 in development and Google already running Green Tea in production, this is a good moment to look at what the algorithm actually changes and why the design works.
The Mark Phase and Cache Pressure
Go’s current GC uses what the Green Tea blog post calls a graph flood algorithm. During the mark phase, workers pull gray objects from a shared work queue, scan the pointer fields of each object, and enqueue any newly discovered white objects. This is the concurrent tri-color mark-and-sweep described by Dijkstra et al. in 1978, faithfully implemented with Go’s hybrid write barrier maintaining the tri-color invariant while the application continues running.
Pointer chasing degrades badly on modern cache hierarchies. Object A points to B, which points to C, which points to D. Each dereference may land in a different cache line, a different page, or a different NUMA node. On current hardware, an L3 cache miss costs 50-100 nanoseconds; a NUMA remote access costs considerably more. The Green Tea blog post puts a concrete number on this: at least 35% of mark time in the current GC is stalled on memory accesses.
That fraction has been growing as hardware evolves. Per-core memory bandwidth has not scaled proportionally with core counts. As Go programs use more cores, each core competes for the same memory bus, making effective per-core bandwidth worse over time. An algorithm designed in 2015 now runs on hardware where the pointer-chasing penalty is proportionally higher than it was then.
How Green Tea Changes the Work List
Green Tea does not modify the tri-color invariant, the write barrier, or the concurrent marking model. The change is confined to the work list discipline within the mark phase.
Instead of a queue of individual gray objects, Green Tea maintains a queue of memory pages. When a live object is discovered on a page, the page itself is enqueued. Mark workers then scan pages sequentially, left-to-right, in one or more passes. Each object now requires two bits of state: a “seen” bit (a live pointer to this object has been found) and a “scanned” bit (this object’s pointer fields have been walked). Under the graph flood model, one bit per object sufficed because an object was scanned exactly once. Under Green Tea, an object may be encountered across multiple passes of a page before it transitions from gray to black.
The practical effect on memory access is significant. Scanning a page means reading sequentially from its base address to its end. The page and its associated metadata remain in L1 or L2 cache for the duration of the pass. The page-level work queue also has far fewer entries than an object-level queue, which reduces contention between parallel mark workers and cache-line bouncing on the shared queue structure.
Span Bitmaps
Go’s allocator organizes the heap into mspans: contiguous ranges of pages where every object belongs to the same size class. Go defines roughly 70 size classes ranging from 8 bytes to 32 KB; objects above 32 KB receive their own dedicated span. This structure already existed for the allocator’s benefit, and it turns out to be exactly the right shape for Green Tea’s bitmaps.
Because every object in a span is the same size, the seen and scanned bitmaps for a page are dense and regular. For an 8 KB page holding 128-byte objects (64 objects per page), each bitmap is exactly 64 bits. Both bitmaps together fit in 128 bytes, occupying two cache lines. On AVX-512-capable hardware, both fit simultaneously in two 512-bit vector registers.
Size-class uniformity provides another benefit: the pointer/scalar layout map, which identifies which words in an object hold pointers, is identical for every object in a span. Under the graph flood model, the GC loads type information once per object. Under Green Tea, the layout loads once per page scan and applies uniformly to every object on that page.
The AVX-512 Acceleration Path
The blog post describes a vector acceleration path planned for Go 1.26 that adds roughly 10% further GC CPU reduction on supporting hardware. The implementation relies on VGF2P8AFFINEQB, part of Intel’s GFNI (Galois Field New Instructions) extension.
The scanning kernel needs to expand per-object bitmaps into per-word bitmaps. If a size class contains 4-word objects, the single “seen” bit for each object must replicate into 4 bits before it can be ANDed against the pointer/scalar layout map. VGF2P8AFFINEQB performs an affine transformation over GF(2): given an 8x8 bit matrix and an 8-bit input, it computes the matrix-vector product using AND for multiplication and XOR for addition. By choosing the matrix appropriately for each size class, one input bit fans out to exactly N output positions in a single instruction. Each size class gets a pre-computed expansion matrix, generated at build time by src/internal/runtime/gc/scan/mkasm.go.
With this path active, the entire scanning kernel processes a page’s GC state almost entirely in registers, touching heap memory only at the final step when it gathers pointer values for the mark buffer.
Comparison With Java’s GC Family
Java’s GC history covers similar ground from a different angle, and the comparison clarifies what Green Tea is and is not trying to do.
G1 GC, the default since Java 9, divides the heap into fixed-size regions and tracks live data at region granularity using per-region bitmaps. The motivation parallels Green Tea: working at coarser-than-object granularity yields better cache behavior during marking. G1, however, is a compacting collector; it moves live objects out of partially filled regions to reclaim space. Green Tea operates strictly within Go’s non-moving model.
ZGC and Shenandoah focus on near-zero pause latency by performing relocation concurrently with the application. ZGC uses colored pointer bits and load barriers; Shenandoah uses forwarding pointers. Both are solutions to the latency problem. Green Tea targets throughput: total GC CPU consumed, independent of pause duration.
Go cannot adopt a compacting collector. The language exposes raw pointers, permits interior pointers into struct fields, and interoperates with C through CGo. Moving an object requires updating every pointer to it, which fails when pointers are held in C stack frames or registered with external systems. This is a deliberate structural boundary in Go’s design. Within it, Green Tea pushes the non-moving mark phase about as far toward cache efficiency as the constraint allows.
Academic Precedent
The insight that coarser marking granularity improves cache behavior predates Green Tea by nearly two decades. The Immix GC (Blackburn and McKinley, PLDI 2008) demonstrated this empirically by dividing the heap into 32 KB blocks subdivided into 256-byte lines, tracking liveness at line granularity during the trace phase. Immix showed that line-level marking outperforms object-level marking on cache metrics and also enables opportunistic defragmentation by moving objects within blocks during marking.
Green Tea applies the same principle within Go’s constraints: mspans instead of Immix blocks, size-class-aligned bitmaps instead of line-level bitmaps, and non-moving collection throughout. The span structure Go’s allocator maintained for allocation efficiency turned out to be the right abstraction for this GC optimization as well.
Performance and Current Status
The Go team reported a 10% median reduction in GC CPU cost across typical workloads, with best-case improvements reaching 40%. Google deployed Green Tea internally before the public release and observed production results consistent with those benchmarks. Go 1.26 is planned to make Green Tea the default, with the AVX-512 vector path arriving in the same release.
Workloads with sparse heaps, where only one or two live objects occupy each page at scan time, see smaller gains; the page-accumulation advantage requires multiple live objects per page to materialize. The implementation includes a fallback path for single-object pages that recovers near-graph-flood behavior in the sparse case.
To try it on Go 1.25:
GOEXPERIMENT=greenteagc go test -bench=. ./...
GOEXPERIMENT=greenteagc go build -o myapp ./cmd/myapp
The experiment flag is baked into the binary at build time and visible via go version -m. The standard tuning parameters, GOGC and GOMEMLIMIT, continue to work without changes. The Go team is collecting production reports via issue #73581.
Green Tea is the most consequential architectural change to Go’s mark phase since concurrent marking arrived in Go 1.5. A decade of GC work closed the latency gap; Green Tea addresses the CPU overhead that persisted after that work was done.