When RSS Lies: What Meilisearch's Allocator Benchmarks Actually Reveal
Source: lobsters
Clement Renault, a core Meilisearch engineer, recently published a detailed investigation comparing three allocators across Meilisearch’s indexing workload: jemalloc via tikv-jemallocator, bumpalo, and mimalloc. The benchmark results are interesting. The architectural lesson underneath them is more interesting.
The short version of the findings is that jemalloc reduces fragmentation compared to glibc’s ptmalloc, bumpalo delivers excellent performance for scoped intermediate allocations inside the indexing pipeline, and mimalloc is fast but peaks higher on this particular workload. What the article actually surfaces, though, is a problem that trips up many operators running long-lived Rust services: RSS is not a reliable signal for whether your allocator is misbehaving, and the default configuration of jemalloc on Linux is specifically designed to make RSS look worse than it is.
The MADV_FREE Problem
On Linux 4.5 and later, jemalloc prefers MADV_FREE over MADV_DONTNEED when releasing pages back to the OS. The distinction matters enormously for what you see in /proc/<pid>/status or in your container’s memory metrics.
With MADV_DONTNEED, the kernel immediately zeroes the pages and removes them from the process’s resident set. RSS drops at once. With MADV_FREE, the kernel marks the pages as available for reclaim but leaves them mapped. The pages stay in RSS until the kernel actually needs the physical RAM for something else. From the kernel’s perspective this is smarter: if the application asks for those pages again soon, it can just reuse them without a page fault. From an operator’s perspective, it looks like the process is holding memory it no longer needs, and in a containerized environment with memory limits, it can trigger OOM kills on memory that is technically free.
This is exactly what Meilisearch’s investigation found. After a large indexing job completes and jemalloc has internally freed the pages, RSS stays elevated. The allocator is not leaking. The kernel is being lazy about reclaiming pages it has already been told it can have back. The result looks like a leak and is not one.
jemalloc’s Three-Phase Decay Pipeline
jemalloc handles freed memory through a staged decay process. When the application calls free(), the pages are marked dirty within jemalloc’s arena. After dirty_decay_ms milliseconds (default: 10,000ms), jemalloc issues madvise to the kernel, transitioning those pages to “muzzy”. After another muzzy_decay_ms milliseconds (also 10,000ms by default), the pages are fully released.
This means with default settings, memory freed at the end of an indexing burst can take up to twenty seconds before jemalloc even tells the kernel it is available. And even then, because MADV_FREE is used, RSS may not drop until the kernel has a reason to reclaim it.
There is a second problem layered on top: without a background thread, the decay logic only runs when new allocations trigger a check. An idle server after a burst never purges. The background thread feature exists precisely for this scenario, but it must be opted into at both compile time and runtime.
Enabling it in Rust with tikv-jemallocator looks like this:
#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
#[allow(non_upper_case_globals)]
#[export_name = "_rjem_malloc_conf"]
pub static _rjem_malloc_conf: &[u8] =
b"background_thread:true,dirty_decay_ms:1000,muzzy_decay_ms:0\0";
Setting dirty_decay_ms to 1000 and muzzy_decay_ms to 0 pushes jemalloc to aggressively purge dirty pages within a second and skip the muzzy phase entirely. Combined with background_thread:true, this configuration means that within a few seconds of an indexing burst finishing, jemalloc has issued its madvise calls and the kernel has everything it needs to reclaim the pages. RSS will still lag if the kernel chooses MADV_FREE semantics, but the gap closes substantially in practice because physical memory pressure causes reclaim quickly.
This is not a new finding. Redis has documented jemalloc configuration for similar reasons, and TiKV, whose team maintains the tikv-jemallocator crate, configures jemalloc with background threads by default in their own deployments. The Meilisearch investigation usefully quantifies the effect on a search engine’s specific workload.
bumpalo Is Not a General Allocator Replacement
bumpalo by Nick Fitzgerald occupies a fundamentally different position than jemalloc or mimalloc. It is not a general-purpose allocator. It is a bump allocator: each allocation advances a pointer forward through a contiguous chunk of memory, with no per-object bookkeeping at all. Allocation is essentially free. The cost is that individual deallocation is impossible. The only supported free operation is dropping the entire Bump arena, which releases all of its memory at once.
This makes bumpalo useless as a global allocator but extremely useful for bounded, batch-scoped phases of work. Meilisearch’s indexing pipeline has exactly this shape: parse documents, build intermediate inverted index structures, write to LMDB, discard everything intermediate. The intermediate phase allocates a large number of small objects with the same lifetime. Routing that phase through a Bump arena means all of those allocations are pointer bumps, and the entire arena is dropped in one operation after the LMDB write transaction commits.
The design pattern is worth being explicit about:
let bump = bumpalo::Bump::new();
// allocate many intermediate objects into &bump
let terms: Vec<&str> = tokenize_into(&bump, &document);
let entries: Vec<Entry> = build_index_entries(&bump, &terms);
commit_to_lmdb(&mut wtxn, &entries)?;
// entire bump arena dropped here, all memory freed at once
Nothing about this pattern requires the global allocator to handle thousands of small short-lived allocations. The pressure on jemalloc or mimalloc is reduced exactly where the allocation rate is highest, and the memory is returned as soon as the arena goes out of scope rather than waiting on any decay pipeline.
mimalloc’s Architecture and Where It Fits
mimalloc, developed by Daan Leijen at Microsoft Research and described in the paper “mimalloc: Free List Sharding in Action”, uses a different approach than jemalloc’s arena model. Its central innovation is free list sharding: instead of one free list per size class, each individual page (roughly 64KB) has its own local free list. Each thread has a thread-local heap containing pages organized by size class, and cross-thread frees go to a per-page thread_free list that the owning thread collects periodically.
This sharding reduces contention dramatically for multi-threaded allocation-heavy workloads. mimalloc also uses a segment alignment trick: segments are aligned to their own size (typically 4MB on 64-bit), which means the segment metadata can be located from any pointer with a single bitwise mask, keeping per-object overhead near zero.
Meilisearch’s benchmarks found mimalloc faster for raw throughput in some phases but with higher peak RSS than tuned jemalloc. This fits what mimalloc’s architecture predicts: the segment-level granularity of OS memory return means a partially-filled segment holds all of its pages even if most objects within it have been freed. jemalloc’s slab and arena structure can return memory at finer granularity once configured aggressively.
For workloads that are uniformly allocation-heavy across many threads with high object churn and relatively steady memory use, mimalloc is frequently the fastest option available. For workloads with sharp allocation bursts followed by long idle periods, jemalloc with tuned decay settings is a better fit. Meilisearch, being a search server with periodic heavy indexing jobs interspersed with read queries, falls into the second category.
The Shape of the Problem
The allocator comparison in the Meilisearch article is a good reminder that the right allocator is determined by the shape of your allocation pattern, not just by throughput benchmarks on synthetic workloads. A general allocator optimized for steady-state multi-threaded allocation will behave differently than one designed around bursty batch workloads in a long-lived server process.
Search engine indexing is a particularly awkward allocation pattern. It is not a steady-state server workload. It is not a short-lived process that exits when done. It is a burst of write-heavy work inside a process that also needs to handle read queries, often on constrained memory budgets in cloud or container deployments. The allocator needs to be fast during the burst and then get out of the way.
jemalloc handles this well when configured to do so, bumpalo handles the innermost allocation-heavy phases without any decay overhead at all, and the combination of both is more capable than either alone. The article’s real finding is that the defaults are not enough, the metrics can mislead you, and the configuration surface exists precisely for situations like this one.