Three Allocators, Three Failure Modes: What Meilisearch Found Under the Hood

Clément Renault published an investigation into Meilisearch’s memory allocator choices that is worth reading carefully, because it surfaces something most allocator comparisons skip. The interesting result is not which allocator wins on throughput benchmarks. The interesting result is that jemalloc, bumpalo, and mimalloc each fail in a categorically different way, and diagnosing one failure mode will not help you find the others.

This matters if you are running a Rust service in production and watching RSS climb. The tools and intuitions you bring to the investigation depend on which failure mode you are dealing with.

The Workload That Makes Allocators Matter

Meilisearch has a memory profile that stresses allocators in two distinct phases. During indexing, the pipeline allocates many varied-size structures: inverted index builders, field norms, prefix trees, intermediate sort buffers. These allocations are short-lived but varied in size, which is exactly the workload that exposes fragmentation. During search, allocations are smaller and more uniform, scoped to individual queries. The system also uses LMDB for its primary data store, which relies on mmap, so some RSS comes from OS page cache rather than the heap allocator at all. Separating those two sources is the first diagnostic challenge.

The team’s baseline was glibc’s malloc on Linux. For a representative workload, they observed RSS/allocated ratios near 3:1 after indexing, meaning the process appeared to hold three times more memory than it had actually allocated into live objects. This is a fragmentation signature: the allocator has reserved pages from the OS that contain a mix of live and freed objects, and it cannot return those pages until the freed regions are large enough and contiguous enough to give back.

Jemalloc: Leaky by Design

Jemalloc was originally developed for FreeBSD and is now used at scale by Meta, Mozilla, and a wide range of Rust infrastructure projects via the tikv-jemallocator crate.

[dependencies]
tikv-jemallocator = "0.5"

#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

Jemalloc reduces fragmentation through a size-class system: small allocations are bucketed into fixed size classes and packed tightly into runs. This brought Meilisearch’s fragmentation ratio from ~3:1 down to around 1.3:1, a substantial improvement.

The failure mode is different from fragmentation. After a large indexing batch completes, jemalloc holds freed pages in per-arena dirty lists and thread-local caches. From the OS perspective, the process RSS stays elevated. The pages are not fragmented; they are simply retained for reuse. Jemalloc calls this behavior intentional: future allocations will draw from those dirty pages without going back to the kernel, which is faster. But to an operator watching a dashboard, the process looks like it is leaking.

You can tune this with MALLOC_CONF:

MALLOC_CONF=narenas:4,background_thread:true,dirty_decay_ms:5000,muzzy_decay_ms:10000

The dirty_decay_ms and muzzy_decay_ms settings control how long jemalloc holds freed pages before returning them to the OS. Setting them to zero forces immediate return, but destroys the caching benefit and drops throughput. The background thread handles purging asynchronously so the application does not stall.

Jemalloc also exposes a mallctl API via tikv-jemalloc-ctl that lets you read fragmentation metrics at runtime:

use tikv_jemalloc_ctl::{epoch, stats};
epoch::mib().unwrap().advance().unwrap();
let allocated = stats::allocated::mib().unwrap().read().unwrap();
let resident  = stats::resident::mib().unwrap().read().unwrap();
println!("frag_ratio={:.2}", resident as f64 / allocated as f64);

This observability advantage is real. When Meilisearch needed to understand whether an RSS spike was fragmentation, arena retention, or something else, jemalloc’s profiling support gave them answers that the other allocators could not.

Bumpalo: Leaky by Accident

Bumpalo is an arena allocator. It maintains a slab of memory and satisfies allocations by bumping a pointer. Deallocation of individual items is a no-op; the entire arena is freed at once when it is dropped. For workloads that fit this model, the performance advantage is significant: allocation costs around 2-5 ns versus 20-100 ns for a general-purpose allocator, with zero fragmentation within the arena’s lifetime.

use bumpalo::Bump;

let bump = Bump::new();
let x = bump.alloc(42u32);
let s = bump.alloc_str("transient scratch");
// All memory freed when `bump` is dropped

Meilisearch used bumpalo for per-request scratch space: query parsing, facet computation, intermediate sort buffers. The pattern is sound. The failure mode is a scope bug that is easy to introduce and hard to notice.

In one case, an Arc<Bump> was stored in a struct that lived for the entire lifetime of an open index. Every search request that allocated into that arena added to the live set permanently. The arena never dropped. RSS climbed steadily with query volume, with no error messages, no allocation failures, and no obvious cause. Tools that look for leaked pointers or dangling references will not catch this because nothing is technically wrong with the memory; it is all correctly owned by a live arena.

The fix is architectural:

// Scope the arena tightly to the operation that needs it
fn handle_search(index: &Index, query: &str) -> Results {
    let arena = Bump::new();
    let parsed = parse_with_arena(&arena, query);
    compute_results(index, parsed)
    // arena drops here; all scratch memory returns
}

The diagnosis for this failure mode requires tracing arena lifetimes through the ownership graph, not inspecting allocator internals. Heap profilers that sample allocations by call site will eventually show the accumulation, but you need to correlate the growth rate with query volume, not time, to identify the pattern.

Mimalloc: The Trade-off That Changes Shape

Mimalloc came out of Microsoft Research in 2019 with a paper describing its core innovation: free list sharding. Each thread maintains a local free list per page that only it touches, plus a generic free list for cross-thread frees. This structure eliminates most lock contention on the hot path.

[dependencies]
mimalloc = "0.1"

use mimalloc::MiMalloc;
#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;

Mimalloc’s RSS behavior is the best of the three for operators who care about memory returning to the OS quickly. It uses madvise(MADV_FREE) aggressively, so RSS drops after indexing spikes faster than with jemalloc’s decay-based approach. For Meilisearch’s workload, mimalloc reached the lowest settled RSS of any allocator tested.

The failure mode here is latency, not memory. Mimalloc’s “delayed free” mechanism processes cross-thread deallocations on the owning thread’s next allocation, which introduces non-determinism into deallocation timing. In Meilisearch’s search path, this showed up as p99 latency spikes that did not appear under steady load but emerged under mixed indexing-and-search workloads. The median latency was comparable to jemalloc; the tail was not.

Mimalloc is also lighter on introspection. Where jemalloc exposes a rich mallctl interface for production monitoring, mimalloc offers basic statistics through its extended API. If you are debugging a production memory issue, that matters.

What the Three Cases Have in Common

Each allocator surfaces a different dimension of the same underlying problem: the gap between what the application thinks it is using and what the OS observes. Jemalloc’s arena retention is visible in RSS but not in allocated bytes. Bumpalo’s scope bug is visible in allocated bytes but only if you profile at the right granularity. Mimalloc’s delayed free is visible in latency histograms, not memory graphs at all.

The practical lesson from Meilisearch’s investigation is that allocator selection should be paired with a monitoring strategy tuned to that allocator’s failure modes. Switching allocators without also changing how you observe memory will leave some failure modes invisible.

Meilisearch settled on jemalloc as the default global allocator for Linux, bumpalo retained for short-lived hot paths with enforced lifetime discipline, and mimalloc recommended for Windows builds and deployments where fast OS page return outweighs tail latency concerns. That is not a single answer; it is a taxonomy of the problem.

The mimalloc-bench suite from Microsoft shows mimalloc ahead of jemalloc by 5-15% on multi-threaded allocation benchmarks. In practice, for a mixed workload like Meilisearch’s, that gap is secondary to fragmentation characteristics and observability. The allocator you can reason about in production is often worth more than the one that wins in a benchmark.