· 7 min read ·

From tcache to mcache: How Go's Runtime and jemalloc Converged on the Same Architecture

Source: hackernews

Meta’s renewed investment in jemalloc is worth reading alongside Go’s runtime allocator documentation, because the two implementations converged on nearly identical architectures through entirely separate development paths. jemalloc was solving the problem for C/C++ server processes; Go’s runtime was solving it for a garbage-collected language with different constraints. The similarities are striking, and the divergences explain why C/C++ allocators require continued engineering investment that managed runtimes get for free.

The Architecture Both Reached

Go’s allocator, derived from Google’s tcmalloc with substantial modifications, uses a three-level hierarchy. At the base is mheap: a global structure that acquires large spans of memory from the OS via mmap. Above that are per-size-class mcentral structures, each managing a list of spans for one particular size class. At the top is mcache: a per-P (processor) local cache that holds a working set of objects for each size class, accessible without acquiring any lock.

This maps almost directly to jemalloc’s structure. jemalloc’s arenas correspond to the mheap-plus-mcentral layer: independently managed allocation domains with their own free lists and extent trees. jemalloc’s tcache (thread cache) corresponds to mcache. Go uses 67 size classes, up to a 32 KB threshold where large objects are handled differently; jemalloc uses approximately 87 on 64-bit Linux, with its slab-based path handling objects up to roughly 14 KB. Both use logarithmically spaced size classes to bound internal fragmentation, and both handle large objects through a path that goes more directly to the backing store.

The convergence is not coincidental. Both allocators were solving the same problem: eliminating contention on a global free list under high concurrency. A single global lock protecting the heap is the bottleneck that glibc’s ptmalloc2 hit, and that prompted both jemalloc’s arena design and Go’s per-P mcache layer. Independent development produced nearly identical structures because the problem constrained the solution space.

Both runtimes expose their allocator state through introspection APIs that reveal the same underlying layering. In Go:

var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("HeapAlloc:   %d MB\n", m.HeapAlloc/1<<20)   // live heap bytes
fmt.Printf("HeapSys:     %d MB\n", m.HeapSys/1<<20)     // obtained from OS
fmt.Printf("HeapIdle:    %d MB\n", m.HeapIdle/1<<20)    // returned or available to scavenger
fmt.Printf("HeapReleased:%d MB\n", m.HeapReleased/1<<20)// returned to OS

The HeapAlloc versus HeapSys gap in Go is conceptually the same as jemalloc’s stats.allocated versus stats.resident gap: live bytes versus total footprint. Watching both ratios in production, in either runtime, tells you how efficiently your allocation patterns are using the memory the process holds.

Where the JVM Takes a Different Path

The JVM’s allocation model is at first glance much simpler. In a generational collector with Eden space, allocation within a Thread-Local Allocation Buffer (TLAB) is bump-pointer: each thread has a reserved chunk of Eden, and allocation is a pointer increment plus a bounds check. There are no size classes in the hot allocation path, no free lists, no arenas to select. The JIT inlines TLAB allocation as a handful of machine instructions.

The cost of that simplicity is deferred to collection time. When a TLAB fills, it’s retired and the thread gets a new one from Eden. When Eden fills, a minor GC runs, copying surviving young objects to the survivor space and resetting Eden. The heap fragmentation that jemalloc and Go’s allocator work to prevent structurally never has time to accumulate; the collector bulldozes it at each collection cycle.

This is the asymmetry that matters for the rest of this comparison. A managed runtime can move objects because it maintains the metadata to find and update all pointers to a given object. C and C++ cannot do this: once a pointer to an allocation is handed to application code, its value is fixed. An allocator cannot relocate the object without the program’s cooperation, and no existing C++ codebase written with raw pointers, member pointers, and external storage offers that cooperation. The option of compaction is structurally unavailable.

So C/C++ allocators must control fragmentation through design choices that make compaction unnecessary. That responsibility falls entirely on the allocator’s architecture.

Why Slabs Substitute for Compaction

Consider what happens to a slab containing objects of a single size class. When a service allocates a batch of 1000 128-byte objects, those fill some number of slabs, each divided into 128-byte slots tracked by a free-slot bitmap. When the service frees 990 of them, 99% of the slots in most slabs become free. jemalloc can identify slabs where every slot is free and return those entire slabs to the OS immediately. The condition for reclamation is binary: a slab is either fully free or it is not.

This differs structurally from a coalescing allocator managing variable-size free regions. In a coalescing allocator, a single small live allocation anywhere in a large free region prevents that region from being returned, because the free space is non-contiguous. The more mixed the allocation sizes and lifetimes throughout a process’s history, the worse this fragmentation becomes. Long-running server processes suffer this accumulation across days of mixed allocation traffic.

jemalloc’s slab-per-size-class design converts the fragmentation problem into the isolation problem. Objects of the same size class are isolated together. Server workloads with coherent allocation phases, where a batch of objects of one type is allocated, processed, and freed together, produce slabs that drain cleanly rather than leaving persistent holes. This does not eliminate fragmentation in the worst case, but for typical server workloads the slab structure keeps it bounded without any compaction pass.

The JVM doesn’t need any of this because GC handles the worst case retroactively. jemalloc needs the structural design to avoid the worst case proactively. Both approaches work; they impose their costs at different points in the system.

Go’s Scavenger vs jemalloc’s Decay

After objects are freed, both Go’s runtime and jemalloc face the same question: when and how aggressively to return memory to the OS. Holding freed memory accelerates future allocations; returning it reduces RSS and infrastructure cost.

Go’s background scavenger calls madvise with MADV_DONTNEED or MADV_FREE on spans that have been idle for long enough, targeting a heap that stays within the headroom defined by GOGC. Go 1.19 added GOMEMLIMIT, a soft memory ceiling that the scavenger and GC use jointly to prevent RSS from exceeding a configured bound. The scavenger runs proportional to heap growth and idle time, adjusting its aggressiveness dynamically.

jemalloc’s decay model does the same thing at a lower level of abstraction. Freed extents pass through dirty, muzzy, and retained states, with madvise calls triggering transitions at configurable time intervals. The background thread introduced in jemalloc 5.0 runs this asynchronously, exactly as Go’s scavenger does, to avoid landing purge work on allocation threads as latency spikes. Setting dirty_decay_ms and muzzy_decay_ms in MALLOC_CONF controls the tradeoff between RSS and allocation latency.

The difference is the granularity of control. jemalloc exposes per-value decay timers, and one of the changes in Meta’s current investment is per-arena decay parameters, so that components of the same process can have independent retention policies. An ML inference server where a large model weight region and a request-handling region have different allocation cadences could give each arena its own decay settings. Go’s scavenger has GOGC and GOMEMLIMIT as primary knobs but offers no per-goroutine or per-component tuning. The Go model assumes a reasonably uniform heap profile, which holds for many programs and fails for the heterogeneous workloads Meta is specifically targeting.

NUMA and Huge Pages: A Shared Catch-Up Problem

Both Go’s runtime and jemalloc are presently being updated for the same hardware reality: modern servers are multi-socket NUMA machines, and 4 KB page granularity is expensive for large heaps under TLB pressure.

Go has added NUMA awareness incrementally. The P scheduler can be influenced by CPU affinity at the OS level, and the runtime’s heap span allocation has been made more NUMA-local over successive releases. Go 1.21 improved transparent huge page support by aligning span sizes and allocation cadence to cooperate with the kernel’s THP collapse heuristics. The intent is the same as what Meta is building into jemalloc: reduce remote memory accesses and increase the proportion of heap memory backed by 2 MB huge pages rather than 4 KB pages.

Meta’s announced changes to jemalloc include NUMA-aware arena assignment, where arenas are bound to specific NUMA nodes and thread-to-arena assignment respects CPU affinity, and extent layout changes to cooperate with 2 MB THP alignment requirements. These are not novel ideas; Go’s runtime and the JVM’s G1 and ZGC collectors have had hardware-topology awareness for years. jemalloc is closing a gap relative to where managed runtimes have been for some time.

The reason managed runtimes got here first is that GC already requires maintaining a complete object graph, understanding which pages contain live data, and running background threads. Adding NUMA-awareness to GC is an extension of existing infrastructure. jemalloc has no such infrastructure to extend; the work requires building NUMA-awareness into the allocator from a foundation that never needed to track object topology.

What the Comparison Shows

The convergence of Go’s allocator and jemalloc on size classes, thread-local caches, and background scavenging is evidence that these design choices address the right problems, not that one project copied the other. Different teams, different constraints, same problem, same structural answers.

The divergence at compaction explains the sustained engineering investment that C/C++ allocators require. Managed runtimes absorb the cost of fragmentation into GC, which is already a complex subsystem running background work and maintaining full pointer provenance. Adding compaction to GC is a natural extension. Adding compaction to a C/C++ allocator is impossible without application cooperation that the existing ecosystem of C and C++ code does not provide.

Meta’s announcement is, in this light, a statement about the permanent cost of not having a garbage collector. Every improvement that managed runtimes get for free from compaction, jemalloc must earn through structural design, careful tuning, and periodic re-engineering as hardware changes. NUMA topology and transparent huge pages are the current hardware gap; future hardware will introduce others. The investment is not a one-time project to close a known list of issues; it’s ongoing maintenance of infrastructure that compensates for an absence that is itself permanent.

The rest of the ecosystem that depends on jemalloc, including FreeBSD, Redis, RocksDB, and production Rust services using tikv-jemallocator, benefits from that maintenance without bearing its cost. That arrangement has worked for twenty years and will presumably continue.

Was this interesting?