· 8 min read ·

Three Allocators, Two Workloads, One Search Engine: What Meilisearch Learned from jemalloc, bumpalo, and mimalloc

Source: lobsters

Clément Renault’s post “The Good, the Bad, and the Leaky” covers Meilisearch’s allocator history honestly: they shipped jemalloc, experimented with mimalloc, and eventually settled on a hybrid strategy that pairs a global allocator with arena allocation for short-lived query work. The post is a good incident report. It is also a useful anchor for a deeper question: why does allocator choice matter so much in a search engine specifically, and what does the two-tier pattern actually buy you in Rust?

The answer comes down to the shape of the workload. Search engines do two fundamentally different things: they index documents, and they serve queries. These two operations have almost nothing in common from an allocator’s perspective.

Indexing vs. Query: Two Different Heap Profiles

Indexing is a batch process. You parse documents, build inverted index structures, merge posting lists, compress them, and write them to disk. The allocations vary enormously in size, they live for different durations, and they interleave with each other in patterns that are hard to predict. You might allocate a 1 MB buffer for a merge operation, then a few hundred small posting list entries, then a 4 MB working buffer for HNSW graph construction. These allocations come from different code paths, at different times, and are freed at different times.

This is exactly the workload that causes heap fragmentation. When allocations of different sizes are interspersed and freed in unpredictable order, a general-purpose allocator ends up with a heap that looks like Swiss cheese: free blocks that are the wrong size for the next request, virtual address space that is committed to the process but not fully used, RSS that climbs and stays high.

Query serving is the opposite. A single query arrives, your code builds a handful of candidate sets, scores them, does some filtering, assembles a result payload, and returns it. The entire scratch space used to compute that result is short-lived. All of it can be discarded the moment the response is sent. The allocations are many and small: temporary score arrays, candidate lists, string slices for tokenization, intermediate filter bitmaps. They live for milliseconds.

The fragmentation problem for queries is not about size mismatch; it is about churn. Thousands of queries per second means thousands of allocation/free cycles, each one interacting with the global heap. Even a fast allocator spends real time managing free lists across that volume.

jemalloc: Arenas and the Thread Cache

jemalloc’s design is built around two mechanisms that address multi-threaded allocation directly: arenas and the thread-local cache (tcache).

By default, jemalloc creates 4 * ncpus independent arenas. Each arena has its own free lists, its own bookkeeping, and its own lock. Threads are assigned to arenas in round-robin at startup, so threads in different arenas never contend on each other’s state. The lock granularity drops from one global lock to one lock per arena, and with enough arenas, most lock operations are uncontested.

Above the arena sits the tcache. When a thread frees an object, it goes into the tcache first: a per-thread bin organized by size class. The next allocation of the same size class pulls from the tcache with no lock, no atomic, no shared state at all. Only when a bin fills does jemalloc flush it back to the arena in a single lock acquisition. The fast path for most allocations in a long-running service is: compare bin pointer against bin end, store pointer, increment.

jemalloc also has a configurable decay model for returning memory to the OS. Dirty pages (freed but not yet returned) are tracked with a decay timer (opt.dirty_decay_ms, default 10 seconds), after which jemalloc calls madvise(MADV_DONTNEED) to release them. A background decay thread, opt-in since jemalloc 5.1, runs independently of allocation activity so the decay happens even during idle periods.

In Rust, you get all of this via tikv-jemallocator:

use tikv_jemallocator::Jemalloc;

#[global_allocator]
static ALLOC: Jemalloc = Jemalloc;

That declaration is the entire integration. Every Box, Vec, String, and HashMap in your process now goes through jemalloc.

mimalloc: Segmented Free Lists and the Memory Retention Problem

mimalloc, from Microsoft Research, takes a different approach. Instead of arenas with size-class bins, mimalloc organizes memory into 4 MiB segments, each subdivided into pages of a fixed size. Each page within a segment serves one size class. Free lists are per-page rather than per-size-class globally, which means short free-list walks and good cache locality for the free-list metadata.

The design has real advantages on allocation-heavy benchmarks. The original mimalloc paper showed roughly 7% improvement over jemalloc on Redis workloads and meaningful gains on the Lean theorem prover. On tight allocation microbenchmarks the gap can be 10-30% in mimalloc’s favor. These numbers are genuine.

The issue, which Renault describes directly, is memory retention. mimalloc tends to hold onto memory pages rather than returning them to the OS. When a segment has been partially freed, mimalloc keeps it around on the expectation that future allocations of the same size class will fill it again. This is the right bet for an allocator optimizing for throughput: OS memory reclamation via madvise or munmap is expensive, and re-acquiring pages from the OS is more expensive still.

For a search engine under variable load, this retention behavior means RSS climbs during indexing and does not come back down meaningfully when you stop indexing and switch to query serving. The process holds virtual address space that is fragmented and mostly empty. Depending on your deployment environment, this can trigger OOM conditions or simply waste memory capacity that other services on the same host need.

mimalloc offers mi_collect(true) and related calls, but the retention tendency is more deeply baked into its segment model than jemalloc’s decay model. jemalloc’s approach of treating memory return as a first-class, configurable concern produces better behavior in workloads with distinct high-allocation phases followed by steady-state serving.

The bumpalo Answer: Remove Query Work from the Global Heap Entirely

The insight that changes the picture is that query scratch space does not need to participate in the global heap at all. If every allocation made while serving a query will be freed when the query finishes, you can use an arena allocator and reset the entire arena in one operation.

bumpalo implements exactly this. The Bump type owns a slab of memory and hands out allocations by bumping a pointer forward. There are no free lists, no size classes, no bookkeeping per-object. Allocation is a bounds check and a pointer increment. Freeing individual objects is a no-op. When you are done with the arena, reset() restores the bump pointer to the start, reusing the underlying memory for the next query without any OS interaction.

use bumpalo::Bump;
use bumpalo::collections::Vec;

fn serve_query<'bump>(
    query: &str,
    bump: &'bump Bump,
) -> SearchResults {
    // All scratch allocations come from the arena
    let mut candidates: Vec<u32> = Vec::new_in(bump);
    let mut scores: Vec<f32> = Vec::new_in(bump);
    let mut tokens: Vec<&str> = Vec::new_in(bump);

    tokenize_into(query, &mut tokens);

    for token in &tokens {
        fetch_candidates(token, &mut candidates);
    }

    score_candidates(&candidates, &mut scores);

    // Build results that escape the arena using the global allocator
    SearchResults::from_scores(&candidates, &scores)
}

fn handle_request(query: &str) -> SearchResults {
    let bump = Bump::new();
    serve_query(query, &bump)
    // bump drops here; entire arena freed in one deallocation
}

bumpalo::collections::Vec is bumpalo’s arena-aware vector. It is layout-compatible with std::vec::Vec but backed by the Bump allocator via new_in. The SearchResults type above must own its data on the global heap, not hold references into the arena; Rust’s borrow checker enforces this via the 'bump lifetime parameter, rejecting any attempt to let an arena reference escape at compile time.

For a high-concurrency server, the common pattern is to keep a pool of Bump arenas and reuse them across requests. bumpalo’s reset() preserves the backing allocation while clearing the arena:

let mut bump = Bump::with_capacity(512 * 1024); // pre-allocate 512 KiB

loop {
    let query = receive_query();
    let results = serve_query(&query, &bump);
    send_response(results);
    bump.reset(); // O(1), no OS interaction, reuses backing memory
}

The throughput improvement over routing query scratch space through the global allocator can be substantial. From the global allocator’s perspective, the entire query is one allocation (the arena’s backing buffer, amortized across many resets) and one free. Thousands of small internal allocations disappear from its view entirely.

The C++ Parallel

This is not a new idea. C++ programmers working on latency-sensitive systems have been doing this for years. C++17 standardized it with std::pmr::memory_resource and std::pmr::monotonic_buffer_resource, which are the standard library’s equivalent of bumpalo’s Bump:

std::array<std::byte, 512 * 1024> stack_buf;
std::pmr::monotonic_buffer_resource arena(
    stack_buf.data(), stack_buf.size()
);

std::pmr::vector<uint32_t> candidates(&arena);
std::pmr::vector<float> scores(&arena);

// arena falls out of scope: all memory released in one operation

monotonic_buffer_resource does what bumpalo does: bump a pointer, never free individual allocations, release the backing buffer on destruction. The std::pmr containers accept a memory_resource* at runtime, so the allocator relationship is dynamic rather than static.

The structural difference from the Rust version is where the safety guarantee lives. In C++, nothing prevents you from letting an arena-allocated pointer escape its lifetime; that is a use-after-free waiting to happen. In Rust, the 'bump lifetime on arena-borrowed references is checked at compile time. The borrow checker rejects escaping references before the code runs.

The Nightly Allocator Trait

On nightly Rust, the Allocator trait formalizes what bumpalo does informally today. Standard library collections that are Allocator-generic can accept any conforming allocator:

#![feature(allocator_api)]

use std::alloc::Allocator;
use bumpalo::Bump;

fn collect_tokens<A: Allocator>(
    input: &str,
    alloc: A,
) -> std::vec::Vec<&str, A> {
    let mut tokens = std::vec::Vec::new_in(alloc);
    // tokenization logic
    tokens
}

// With the bump arena
let bump = Bump::new();
let tokens = collect_tokens(query, &bump);

// With the global allocator
let tokens = collect_tokens(query, std::alloc::Global);

The new_in pattern is already stable in bumpalo’s own collections. When Allocator stabilizes in the standard library, the same pattern will work with std::collections without needing bumpalo’s parallel collection types. The two-tier pattern becomes idiomatic rather than crate-specific.

What the Combination Buys in Practice

Meilisearch’s settled approach, as Renault describes it, is jemalloc as the global allocator for indexing work combined with bumpalo arenas for query-scope scratch space. The combination addresses both workload profiles: jemalloc’s arena model and decay mechanics handle the fragmented, long-lived indexing allocations; bumpalo removes query scratch space from the global allocator’s view entirely.

The mimalloc experiment taught a specific lesson: raw allocation throughput is not the binding constraint, and an allocator that retains memory pages can produce worse outcomes than a slower one that releases them. RSS in production is a first-class resource, especially on shared hosts.

The bumpalo lesson is more fundamental: when you have a clear scope boundary, an allocator that reflects that boundary buys you both throughput and simplicity. The global allocator never sees the allocations bumpalo handles. There is no fragmentation from query scratch space because the arena either holds its backing buffer or resets it; it never fragments it.

The mimalloc-rust crate makes the experiment easy to reproduce. Swap in one global allocator declaration, run your workload benchmark, and watch RSS over time. The performance numbers in isolation may look good. The memory retention behavior under a realistic indexing-then-serving workload is what tells the real story.

Was this interesting?