NUMA-Blind No More: The Hardware Gap jemalloc Is Finally Closing

On a dual-socket server running 64 physical cores, memory access latency is not uniform. Memory attached to socket 0 is local to that socket’s cores; accessing memory on socket 1 from a core on socket 0 costs roughly 30 to 60 ns more per access, depending on the processor generation and interconnect speed. For most software this non-uniform memory access (NUMA) is invisible. For an allocator making hundreds of millions of decisions per second about which memory to hand out to which threads, it matters significantly.

jemalloc’s arena model was designed around a different kind of locality: thread isolation. The default arena count is 4 * ncpus, and threads are assigned in round-robin order. Arena 0 gets thread 1, arena 1 gets thread 2, and so on. This eliminates lock contention between threads on different arenas, which was the original problem jemalloc was built to solve on SMP hardware in 2005. What it does not do is ensure that arena 0, assigned to a thread running on socket 0, only allocates from memory physically attached to socket 0. Meta’s renewed investment is fixing that, upstream, for everyone.

The NUMA Problem in Detail

NUMA topology on modern x86 servers is determined at boot time and exposed to the OS through ACPI tables. Linux makes it available through the /sys/devices/system/node/ hierarchy and through the libnuma API. The distance matrix, inspectable via numactl --hardware, shows typical values of 10 for local access and 21 for one-hop remote on AMD EPYC; more complex four-socket topologies push remote distances higher.

For a workload where threads stay on one socket for their operational lifetime, which is the common case for thread-pool services, the current round-robin arena assignment produces predictable NUMA violations. Thread 1 is on core 4 (socket 0) and assigned arena 1. Arena 1 calls mmap() without NUMA constraints; the kernel places the resulting pages on whichever node has free capacity, which may be socket 1. Every subsequent allocation that thread 1 makes from arena 1 may be touching remote memory. The 30 to 60 ns penalty per access is invisible in any single operation but accumulates across millions of cache-miss accesses over the lifetime of a request.

The problem grows with working set size. L3 cache capacity per socket is typically 32 to 96 MB on current processors. Any hot data that exceeds L3 goes to DRAM. When that DRAM is remote, the latency penalty is unavoidable and constant for the duration of the workload.

The Fix: Topology-Aware Arena Assignment

The NUMA work Meta is contributing upstream uses numa_node_of_cpu() to determine which NUMA node owns the CPU a thread is currently running on, then assigns the thread to an arena whose extents are allocated from that same node. The arena-level mmap() calls are either replaced with NUMA-aware variants using mbind() to bind the resulting pages to the local node, or prefixed with a set_mempolicy(MPOL_BIND) call that constrains where the kernel places physical pages.

The result is that threads on socket 0 allocate from socket 0 memory, and threads on socket 1 allocate from socket 1 memory. The arena isolation model that eliminated lock contention now also eliminates remote memory access as a default outcome. The two isolation properties are orthogonal and maintained simultaneously.

There is a complication: threads can migrate between CPUs and sometimes between NUMA nodes, particularly on systems without explicit CPU affinity pinning. The implementation handles this by periodically checking the current NUMA node and re-evaluating arena assignment on migration. For workloads that use pthread_setaffinity_np() or Linux cgroup CPU pinning, migration does not occur and the check cost disappears entirely.

Transparent Huge Pages and the Extent Allocator

The second major improvement targets transparent huge pages (THP), and it connects directly to the extent allocator redesign from jemalloc 5.0.

The x86 architecture supports two page sizes for general-purpose physical memory: 4 KB base pages and 2 MB huge pages. The TLB caches virtual-to-physical address translations; a single 2 MB TLB entry covers 512 times the virtual address range of a 4 KB entry. For workloads with large heaps and high access rates, TLB coverage is a genuine bottleneck: a missed TLB entry triggers a page-table walk costing hundreds of nanoseconds.

The kernel’s transparent huge page machinery can back a virtual memory range with 2 MB physical pages automatically, without application involvement, but only when the virtual range is both 2 MB-aligned and 2 MB in size. jemalloc 5.0 introduced variable-size extents, tracked in per-arena red-black trees, which replaced the fixed 2 MB chunks of earlier versions and were designed in part with huge-page alignment in mind. But the extent allocator does not consistently produce 2 MB-aligned, 2 MB-sized extents for all large allocations, so the kernel’s THP machinery applies opportunistically rather than reliably.

Meta’s work modifies the extent allocator to prefer 2 MB-aligned, 2 MB-sized extents for allocations above a configurable threshold. When alignment is consistent, THP coverage becomes reliable. Meta reports double-digit latency improvements from reliable THP on heap memory, a figure consistent with what other large deployments observe when they enforce huge-page alignment on heap regions through explicit mmap() hints.

The two improvements interact directly. NUMA-aware arena assignment ensures that the 2 MB extents land on the correct NUMA node’s memory. Without NUMA awareness, the THP improvement reduces TLB pressure but leaves remote-access latency intact. With both, allocations are locally sourced and TLB-covered.

Who Else Benefits

jemalloc is not Meta-specific infrastructure. FreeBSD ships it as the default system allocator. Redis and Valkey explicitly recommend it for fragmentation control; the Redis documentation notes that jemalloc 5.x is the tested and recommended option for production deployments. RocksDB, used as the storage engine in CockroachDB, TiKV, Pebble, and dozens of other systems, inherits jemalloc from its typical deployment environments. All of these projects run on multi-socket server hardware. All of them will benefit from topology-aware arena assignment without changing a line of their own code.

Rust removed jemalloc as its default global allocator in 1.28, citing binary size and build complexity, but production Rust services commonly re-add it. The tikv-jemallocator crate provides a #[global_allocator] implementation, and tikv-jemalloc-ctl exposes the mallctl API from safe Rust, giving Rust services the same observability surface as C++ services:

use tikv_jemalloc_ctl::{epoch, stats};

// Advance the epoch to get fresh statistics
let e = epoch::mib().unwrap();
e.advance().unwrap();

let allocated = stats::allocated::mib().unwrap().read().unwrap();
let resident = stats::resident::mib().unwrap().read().unwrap();
println!("fragmentation + decay overhead: {} bytes", resident - allocated);

For Rust services on multi-socket hardware, the NUMA improvements land automatically once they reach a stable jemalloc release.

The upstream-first commitment matters here. Large companies routinely maintain internal patches to open-source dependencies, and those patches stay internal until someone does the explicit work of upstreaming them. The 2026 announcement explicitly commits to contributing changes to the public repository as the primary workflow, not as a secondary backport. That means FreeBSD, Redis deployments, CockroachDB clusters, and Rust services all receive the improvements on the same timeline as Meta’s own infrastructure rather than on a separate delay.

Diagnosing NUMA Allocation Problems Today

The NUMA improvements are not yet in a stable release. Until they ship, numastat -p <pid> shows per-NUMA-node memory distribution for a running process. A process using eight arenas on a two-node system, where each node should ideally own four arenas worth of allocation, will show lopsided node distribution if jemalloc is ignoring topology.

jemalloc’s mallctl stats expose per-arena resident memory through stats.arenas.<n>.resident. Cross-referencing per-arena resident values against numastat output identifies which arenas are allocating from which nodes. Advancing the epoch is required before reading any stats:

uint64_t epoch = 1;
size_t sz = sizeof(epoch);
je_mallctl("epoch", &epoch, &sz, &epoch, sz);

size_t arena_resident;
sz = sizeof(arena_resident);
je_mallctl("stats.arenas.0.resident", &arena_resident, &sz, NULL, 0);

For services where NUMA effects are significant, the combination of perf mem to measure memory access latency distributions and numactl to run experiments with explicit node binding confirms whether topology-blind allocation is contributing to latency. In most cases where working sets fit in L3 cache, NUMA distance does not matter. In cases where they do not fit, topology-blind allocation can be a consistent source of tail latency that is easy to miss because it appears as general slowness rather than a discrete event.

The gap between what jemalloc’s arena model was designed to do and what modern server hardware requires has been narrowing incrementally since 5.0. The NUMA and THP work closes the most significant remaining piece.