· 7 min read ·

Phase Blindness: What Meilisearch's Allocator Experiments Reveal About Memory Under Real Load

Source: lobsters

The memory allocator is one of those components that most programs never need to think about. Pick the system default, ship it, and the allocator quietly handles every Box::new, Vec::push, and HashMap::insert for the life of the process. For most programs this is the right call. For a search engine doing sustained indexing under concurrent query load, it is almost certainly wrong, and the recent investigation by Clément Renault (kerollmops) into Meilisearch’s allocator behavior illustrates why in concrete terms.

The title signals what the experiments found: one allocator behaved well, one misbehaved, and one leaked. Working out which is which, and why the outcomes were predictable, requires understanding the shape of the problem those allocators are being asked to solve.

The Bimodal Memory Problem in Search Engines

Meilisearch has two fundamentally different workloads running in the same process. During indexing, the engine ingests documents, tokenizes them, constructs intermediate posting list data, merges sorted runs, and writes everything to LMDB. During search, the engine parses queries, intersects Roaring Bitmap posting lists from LMDB’s memory-mapped pages, and scores and ranks candidates.

These two workloads produce almost opposite allocation patterns.

Indexing allocates heavily and variably: short token strings, variable-length posting list entries, merge buffers that expand and contract, per-document metadata blobs. The allocations are ephemeral, but they span a wide range of sizes and live for the duration of processing one document or batch. When a batch finishes and those allocations are freed, they leave holes of many different sizes scattered through the heap.

Search allocates predictably and repetitively: small score accumulators, Roaring Bitmap scratch buffers, result ranking structures. These are short-lived, turn over rapidly, and are much more uniform in size. If the heap is already fragmented from indexing, the spatial locality of these small search allocations degrades, and RSS sits elevated even when most of the heap is technically “free.”

This is the textbook external fragmentation scenario, made worse by the fact that search and indexing can run concurrently. Indexing’s large transient allocations interleave with search’s small persistent ones in the heap, and when the indexing allocations are freed, the heap looks like a gap-filled street: each small search allocation is an inhabited building with an empty lot beside it.

jemalloc: The Reliable Default and Its Decay Behavior

jemalloc (Jason Evans malloc, originally written for FreeBSD in 2005 and maintained primarily by Meta) is the canonical choice for high-performance Rust servers. Meilisearch sets it as the global allocator via tikv-jemallocator:

use tikv_jemallocator::Jemalloc;

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

This single declaration routes all alloc::* operations through jemalloc for the lifetime of the process. The arena model assigns threads to independent heaps (4 × CPU count by default), each with its own free lists, slab metadata, and dirty page tracking. The per-thread tcache makes allocation lock-free on the hot path: small allocations hit the tcache with essentially no synchronization. For search-phase workloads, this is excellent behavior.

The problem surfaces after heavy indexing. jemalloc’s dirty page decay mechanism, controlled by dirty_decay_ms (default 10,000 ms) and muzzy_decay_ms (default 5,000 ms), intentionally holds freed memory in a decay queue rather than immediately returning it to the OS. This amortizes the cost of mmap/munmap syscalls across time. The consequence is that RSS stays elevated for up to ten seconds after a large indexing burst, and under continuous indexing load, the decay queue never fully drains.

You can observe this divergence using tikv-jemalloc-ctl:

use tikv_jemalloc_ctl::{epoch, stats};

epoch::mib().unwrap().advance().unwrap();
let allocated = stats::allocated::mib().unwrap().read().unwrap();
let resident  = stats::resident::mib().unwrap().read().unwrap();
println!("fragmentation ratio: {:.2}", resident as f64 / allocated as f64);

Ratios of 2.0 to 4.0 are commonly reported for search engines under sustained mixed load, even when jemalloc is otherwise performing correctly. The heap is not broken; it is doing exactly what it was designed to do. The disconnect is that jemalloc has no knowledge of which allocations were transient indexing work and which were long-lived search structures. It treats all freed memory with the same decay policy.

One lever Meilisearch can pull is embedding a custom malloc_conf string at build time through tikv-jemalloc-sys to tune decay aggressiveness:

JEMALLOC_SYS_WITH_MALLOC_CONF="background_thread:true,dirty_decay_ms:1000,muzzy_decay_ms:0"

Setting dirty_decay_ms aggressively low reduces RSS peaks at the cost of more frequent syscalls. Whether that trade-off is acceptable depends entirely on the indexing throughput requirements.

mimalloc: Different Architecture, Familiar Challenges

mimalloc (from Microsoft Research, authored by Daan Leijen in 2019) takes a structurally different approach. Rather than global arenas, it uses per-thread heaps with sharded free lists within each page. The sharding separates the owning thread’s local free list, which is entirely lock-free, from a thread_free list that other threads post freed objects to via atomic compare-and-swap. The owning thread merges them periodically. This design makes mimalloc particularly strong under workloads with high cross-thread free rates, which map well to web servers and request-response patterns.

The Rust bindings are available via the mimalloc crate:

#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

mimalloc uses 128 fine-grained size classes for small objects (compared to jemalloc’s roughly 30), which can reduce internal fragmentation for variable-size workloads in theory. Its segment-based memory management operates in 64 MiB chunks, and page headers are stored separately from data to avoid false sharing. The original paper reports ~0.2% overhead per allocated byte, lower than jemalloc for small allocations.

For search workloads specifically, the behavior under bimodal indexing and search load depends on how the allocator handles cross-slab fragmentation after variable-size indexing bursts. mimalloc’s segment structure handles medium and large allocations somewhat differently than jemalloc’s extent model, which can produce better or worse fragmentation outcomes depending on the exact size distribution of a given document corpus. Neither allocator has a structural advantage here without workload-specific tuning; both are making the same fundamental trade-off of speed against memory returned to the OS.

bumpalo: Right Tool, Wrong Scope

bumpalo (authored by Nick Fitzgerald at Fastly) is a bump-pointer allocator for Rust. Allocation is three instructions: check if the remaining space in the current chunk fits the request, return the current pointer, advance by the aligned size. Deallocation of individual objects is not supported; you free everything at once by dropping the Bump or calling reset().

The appeal for Meilisearch’s indexing pipeline is immediate. Tokenization produces many short-lived objects. Per-document intermediate structures are created, used, and discarded in a tight batch loop. This is the use case bumpalo is designed for:

use bumpalo::Bump;
use bumpalo::collections::Vec as BumpVec;

let bump = Bump::new();
let mut tokens: BumpVec<&str> = BumpVec::new_in(&bump);
// fill tokens, build intermediate structures into bump...
drop(bump); // all memory reclaimed at once, zero fragmentation in global heap

Zero allocation reaches the global heap during this phase. Zero holes get punched into the global heap when the phase ends. The indexing intermediates simply vanish.

The catch in a long-running server is chunk retention combined with chunk size ratcheting. reset() does not free bumpalo’s underlying memory chunks; it resets the bump pointer back to the start so the memory can be reused. This is correct and efficient for steady workloads. The problem is that bumpalo doubles chunk sizes when it needs more space. One anomalously large document or batch forces bumpalo to allocate a large chunk, and that chunk persists for the lifetime of the Bump even after reset(). In a server running for days, the arena’s committed memory slowly ratchets upward on each large document and never comes down. This is the “leaky” behavior: not a true leak in the sense that the memory is untracked, but a monotonically growing RSS that is nearly indistinguishable from a real leak when you are staring at a memory graph at 3am.

The committed size is inspectable:

let committed = bump.allocated_bytes_including_metadata();

The idiomatic fix is to pool arenas and discard them when they grow past a threshold:

fn checkout_arena(pool: &mut Vec<Bump>) -> Bump {
    let bump = pool.pop().unwrap_or_default();
    if bump.allocated_bytes_including_metadata() > MAX_ARENA_BYTES {
        Bump::new()
    } else {
        bump
    }
}

This bounds committed memory at the cost of occasional re-mmap on anomalously large documents. The with_capacity constructor helps too: pre-allocating the expected batch size avoids the doubling ratchet for common cases.

The Structural Lesson

What the Meilisearch experiments expose is less about which allocator wins a throughput benchmark and more about the mismatch between workload phase and allocator scope. A general-purpose allocator like jemalloc or mimalloc is designed to serve an entire process with no knowledge of phase boundaries. It cannot know that the spike of variable-size allocations it just handled was transient indexing work that can be aggressively reclaimed. It makes conservative decisions, and conservatism accumulates.

bumpalo knows, but only because you explicitly tell it by calling reset(). The engineering challenge is doing that reliably in a system where indexing and search interleave, where “one batch” has variable and sometimes extreme size, and where the arena itself can be a source of growth if not monitored.

The correct architecture for search engines that care about memory predictability is to segregate allocation by phase: bump arenas for indexing intermediates, the global allocator for persistent search structures, and LMDB’s memory-mapped regions for indexed data (which lives outside the allocator heap entirely). This is harder to maintain than swapping in a better global allocator, but it addresses the fragmentation problem structurally rather than tuning around it.

This is not a problem unique to Meilisearch. Elasticsearch’s indexing-induced GC pressure on the JVM heap has the same root cause: phase-correlated allocations mixed with persistent ones in a single managed heap. In a garbage-collected language you call it GC pressure; in a Rust process you call it RSS fragmentation; but the underlying workload mismatch is the same shape. The allocator is not the problem. The problem is asking one allocator to serve two workloads that differ in every dimension that matters to allocator design.

Was this interesting?