How PostgreSQL, LLVM, and Meilisearch All Arrived at the Same Two-Allocator Architecture
Source: lobsters
PostgreSQL solved the two-phase allocation problem in the early 1990s. Meilisearch, through three allocator experiments with jemalloc, bumpalo, and mimalloc, arrived at the same structural answer: a general-purpose allocator for persistent state, an arena allocator for temporary work. Clément Renault’s post on the experience is a useful incident report, documenting each allocator’s failure mode in production. The pattern it exposes is older than any of the three allocators under evaluation.
The Phase Problem
Search engines run two allocation profiles that barely resemble each other. Indexing allocates heavily and irregularly: posting list entries of variable size, sort buffers that expand and contract, per-document string normalization, merge scratch space. The allocations are interspersed and freed in unpredictable order, which is the textbook external fragmentation scenario. Query serving allocates more predictably and at high volume: small score arrays, per-query candidate lists, filter bitmaps, tokenized slices. These are uniform in size, short-lived, and turn over at query throughput rates.
jemalloc handles both profiles competently, but it handles them together less well than it handles them separately. Indexing frees large amounts of varied-size memory in bursts; jemalloc’s dirty page decay timer (dirty_decay_ms, default 10 seconds) holds those pages in a decay queue rather than returning them to the OS immediately. This amortizes the cost of madvise syscalls across time, which is the right call for a general workload. The consequence for a search engine is that RSS stays elevated after indexing bursts even when the working set has shrunk substantially. jemalloc is making a correct heuristic decision. It cannot know the phase is over because no one has told it.
PostgreSQL’s Answer
PostgreSQL’s memory context system, present since the early 1990s, addresses the phase problem by making phase boundaries explicit in code rather than inferred from allocation patterns. A MemoryContext is an arena scoped to an operation. The query executor creates a context at query start, a child context for portal execution, and a per-tuple context that is reset between rows:
MemoryContext per_tuple_ctx = AllocSetContextCreate(
CurrentMemoryContext,
"per-tuple context",
ALLOCSET_SMALL_SIZES);
foreach(tup, tuples) {
MemoryContext old = MemoryContextSwitchTo(per_tuple_ctx);
/* Evaluate expressions, coerce types, compute projections */
Datum result = ExecEvalExpr(expr, econtext, &isnull);
MemoryContextSwitchTo(old);
MemoryContextReset(per_tuple_ctx);
}
MemoryContextReset frees every allocation in the context in one call, regardless of how many individual allocations the phase produced. The caller knows the phase is over and says so. There is no decay queue and no heuristic; the allocator is not guessing at boundaries from usage patterns. The code states them directly.
PostgreSQL’s design is hierarchical: contexts form a tree rooted at TopMemoryContext, with transaction, query, and execution contexts as nested children. MemoryContextDelete destroys a subtree. The econtext->ecxt_per_tuple_memory context is reset after each row; the surrounding query context persists for the query’s full lifetime. The granularity of each context matches the granularity of the phase it serves, and the boundaries between them are stated in code rather than approximated by a timer.
The Same Pattern Everywhere
LLVM’s BumpPtrAllocator does this for compilation. AST nodes, IR instructions, and metadata objects for a compilation unit are allocated into a bump arena that lives exactly as long as the compilation unit. LLVM’s documentation notes 5-10x allocation throughput improvement for IR node creation compared to individual new calls, which is the expected result when allocation is a pointer increment and deallocation is a single free at the end.
V8’s Zone allocator serves the same purpose for Turbofan IR. Each compilation pass creates a Zone, allocates its IR nodes into it, and destroys the Zone when the pass completes. The Zone’s lifetime is the pass’s lifetime, a natural phase boundary in the JIT pipeline. Zone memory never fragments the general heap because it never touches the general heap as individual objects; it enters as one backing slab and leaves the same way.
RocksDB’s Arena backs memtable allocations in the same structure. When a memtable fills and is flushed to an L0 SSTable, the arena is destroyed and all its memory is reclaimed in one operation. The arena’s lifetime is the write buffer’s lifetime, a phase boundary that the LSM write path makes unambiguous.
The structural answer repeats because it addresses a genuine constraint. A general-purpose allocator handles every allocation without knowing why it was made or how long it will last. When an indexing batch ends and thousands of varied-size allocations are freed, the allocator sees thousands of individual frees, manages their effect on free lists and size bins, and eventually decides through heuristics when to return pages. An arena allocator has the same information given explicitly. When drop(bump) or MemoryContextReset runs, the phase is over. Every allocation from that phase is gone. One operation, no inference required.
What bumpalo Provides in Rust
bumpalo is the Rust incarnation of this pattern. Bump::alloc is a bounds check and a pointer increment. Individual deallocation is unsupported. Everything is freed when the Bump drops or reset() is called.
use bumpalo::Bump;
use bumpalo::collections::Vec as BVec;
fn index_document<'bump>(doc: &Document, bump: &'bump Bump) -> IndexEntry {
let mut tokens: BVec<&str> = BVec::new_in(bump);
let mut postings: BVec<u32> = BVec::new_in(bump);
tokenize_into(&doc.body, &mut tokens);
for token in &tokens {
postings.extend(lookup_positions(token));
}
// IndexEntry owns its data on the global heap; no arena references escape
IndexEntry::build(&postings)
}
fn index_batch(docs: &[Document]) {
let bump = Bump::new();
for doc in docs {
let entry = index_document(doc, &bump);
write_to_lmdb(entry);
}
// bump drops here; all intermediate allocations freed in one operation
}
The improvement over PostgreSQL’s MemoryContextReset is where the correctness guarantee lives. In PostgreSQL, nothing prevents allocating into a short-lived context and storing the result somewhere that outlives it. The bug manifests at runtime, usually far from the allocation site. In Rust, the borrow checker rejects this at compile time. The 'bump lifetime on references vended by a Bump cannot outlive the Bump itself. If IndexEntry::build tried to return a reference into the arena rather than owned data, the compiler would refuse the code before it runs.
This is a genuine narrowing of the error surface. The class of PostgreSQL memory context bug where a short-lived context is used when a long-lived one was intended cannot arise in Rust; the borrow checker catches it at the reference level.
What the borrow checker cannot enforce is where the arena itself sits in the ownership hierarchy. This is the failure Meilisearch hit: a Bump stored inside an Arc<IndexState> that lived for the full lifetime of an open index. Every search request allocated into it; nothing ever freed it. The borrow checker confirmed that references into the arena were correctly scoped relative to the arena. It had no opinion on whether the arena was correctly scoped relative to the operations it was meant to serve. That judgment belongs to the programmer, as it does when choosing the right MemoryContextSwitchTo target in PostgreSQL code. The type system covers half the problem; the other half is design review.
Why jemalloc Stays
tikv-jemallocator wires jemalloc in as Rust’s global allocator with minimal ceremony. For everything that does not fit neatly into a scoped arena, jemalloc’s size-class bins and arena model handle long-running mixed workloads better than the system allocator. Meilisearch’s experiments documented fragmentation ratios around 1.3:1 (resident to allocated bytes) with jemalloc versus roughly 3:1 with glibc malloc under indexing-heavy traffic. That difference is operationally meaningful.
The decisive production advantage is the mallctl API, accessible from Rust through tikv-jemalloc-ctl:
use tikv_jemalloc_ctl::{epoch, stats};
epoch::mib().unwrap().advance().unwrap();
let allocated = stats::allocated::mib().unwrap().read().unwrap();
let resident = stats::resident::mib().unwrap().read().unwrap();
let ratio = resident as f64 / allocated as f64;
Expose that ratio as a Prometheus gauge, correlate it with indexing activity, and you can distinguish fragmentation from retention from a genuine leak without restarting the process. Neither mimalloc nor bumpalo offers equivalent introspection. This observability gap was the deciding factor in keeping jemalloc as Meilisearch’s Linux default even after mimalloc showed better raw allocation throughput on synthetic benchmarks.
mimalloc’s failure under mixed load illustrates why benchmark throughput is not the binding constraint. Its delayed-free mechanism accumulates cross-thread deallocations in a per-page free list and processes them on the owning thread’s next allocation. Under concurrent indexing and query traffic, query-serving threads periodically stall to drain accumulated cross-thread frees from indexing threads. p99 latency degrades while the median stays flat. Synthetic benchmark suites like mimalloc-bench miss this because they run workloads in isolation; the regression only appears under interleaved load.
What the Stabilizing Allocator API Changes
The Rust nightly Allocator trait, once stabilized, will make the two-tier pattern available through standard library collections without bumpalo’s parallel types. Vec<T, &Bump>, HashMap<K, V, &Bump>, standard containers backed by arena memory with lifetimes enforced at compile time. The pattern becomes idiomatic rather than crate-specific.
What stabilization will not change is the architectural question of where arenas belong relative to the operations they serve. PostgreSQL’s codebase has an extensive contributor guide covering which memory context is correct for different kinds of executor state, because getting it wrong produces bugs that are recurring and difficult to trace even for experienced contributors. The failure mode is the same as bumpalo’s Arc<Bump> problem: a context or arena scoped too broadly, accumulating memory that is not leaked in the safety sense but is not being reclaimed when the work ends.
The type system handles reference scoping. The programmer handles arena placement. This split has been the case since PostgreSQL memory contexts were first introduced, and it remains the case in every language that has adopted the arena pattern since. The abstraction boundary has not moved; it is just enforced on one side now where it was previously advisory.