Your Allocator Choice Determines How You Debug the Next Memory Problem

Clément Renault published a detailed post-mortem on Meilisearch’s allocator experiments that reads, on the surface, like a performance comparison. Dig deeper and it is really about something else: three distinct failure modes, each requiring a completely different diagnostic approach, and each made harder or easier to diagnose depending on which allocator you chose. The throughput differences between jemalloc, mimalloc, and a bump allocator matter. The observability differences matter more.

Most allocator comparisons cite the mimalloc-bench suite, which shows mimalloc outperforming jemalloc by 10-30% on allocation-heavy synthetic benchmarks. In a vacuum those numbers look decisive. In production, they often tell you nothing about the failure you are about to diagnose at 2 AM.

Why Meilisearch Is a Good Allocator Test Bed

Meilisearch runs a workload that stresses allocators in two distinct and conflicting ways. Indexing produces many varied-size, short-lived allocations: inverted index builders, field norm tables, prefix trees, sort buffers. The size distribution is irregular, which is the classic fragmentation trigger. Search produces smaller, more uniform allocations bounded to individual queries. The system uses LMDB as its storage engine, which relies on mmap for data access, meaning some of the process RSS comes from OS page cache entirely outside the heap allocator’s jurisdiction.

That last point is underappreciated. Before you can even compare allocators on a metric like RSS, you need to separate mmap-backed memory from heap-allocated memory. A process that indexes a large dataset and then searches it will have RSS inflated by both page cache reads and allocator retention. Without that separation, every allocation experiment starts with a measurement that conflates multiple causes.

The glibc baseline gave them RSS/allocated ratios around 3:1 after an indexing cycle. Three bytes resident for every one byte in a live object. This is textbook fragmentation: freed allocations of varied sizes leave gaps in pages that cannot be returned to the OS. You are paying for memory that is not doing useful work.

Jemalloc’s Failure Mode Is Designed In

Jemalloc dropped that ratio to around 1.3:1 through its size-class system, which buckets small allocations into fixed-size slots packed tightly into runs. That alone makes it worth the dependency on tikv-jemallocator:

[dependencies]
tikv-jemallocator = "0.5"
tikv-jemalloc-ctl = "0.5"

#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

But jemalloc’s failure mode is that after a large indexing burst, the process RSS stays elevated even after the allocations are freed. Jemalloc holds freed pages in per-arena dirty lists and thread-local caches, deliberately. Future allocations draw from those retained pages without a kernel call, which is faster. The process is not fragmented and it is not leaking. It is speculatively retaining capacity.

The decay timers control how aggressively jemalloc returns pages. By default dirty_decay_ms and muzzy_decay_ms are both 10 seconds. You can tune them:

MALLOC_CONF=narenas:4,background_thread:true,dirty_decay_ms:5000,muzzy_decay_ms:10000

Setting dirty_decay_ms=0 forces immediate page return but eliminates the caching benefit. Setting background_thread:true (available since jemalloc 5.0) handles decay asynchronously and avoids stalls during allocations.

Here is what makes jemalloc diagnosable: the mallctl API, exposed in Rust through tikv-jemalloc-ctl, lets you read stats.allocated versus stats.resident at runtime:

use tikv_jemalloc_ctl::{epoch, stats};

// refresh the stats epoch first
epoch::mib().unwrap().advance().unwrap();

let allocated = stats::allocated::mib().unwrap().read().unwrap();
let resident  = stats::resident::mib().unwrap().read().unwrap();
let ratio = resident as f64 / allocated as f64;

That ratio tells you at any moment whether you are looking at fragmentation, retention, or an actual leak. No other allocator tested for Meilisearch offered this. You can expose it as a Prometheus gauge, alert on it, correlate it with indexing activity. This is the diagnostic capability that keeps jemalloc as Meilisearch’s Linux default despite mimalloc’s better raw numbers.

Bumpalo’s Failure Mode Is Invisible to Standard Profilers

Bumpalo is a bump allocator: it maintains a slab and satisfies allocations by bumping a pointer forward. Individual deallocation is a no-op. The entire arena resets when the Bump is dropped.

Allocation cost is around 2-5 nanoseconds versus 20-100 nanoseconds for a general-purpose allocator. For hot paths like query parsing, facet computation, or temporary sort buffers scoped to a single request, this is a meaningful difference. The pattern works correctly when the arena’s lifetime matches the operation’s lifetime:

fn handle_query(query: &str) -> Vec<Result> {
    let bump = Bump::new();
    let parsed = parse_query(&bump, query);
    let candidates = compute_facets(&bump, parsed);
    rank_and_return(candidates)
    // bump drops here; all arena memory reclaimed at once
}

The failure mode Meilisearch hit was a Bump arena wrapped in an Arc and stored in a struct that lived for the full lifetime of an open index. Every search request grew the arena and nothing ever shrank it. RSS climbed steadily with query volume. No allocations returned errors. No individual allocation looked suspicious. A heap profiler would show memory growing in the arena but couldn’t tell you that the arena was never being reset, because from the allocator’s perspective it wasn’t leaking, just accumulating.

Diagnosis requires tracing ownership through the struct graph to find where the Bump’s Drop is actually called, or never called. This is not something stats.resident from jemalloc helps with, because the arena sidesteps the global allocator entirely for individual allocations. The fix is enforced lifetime discipline: the arena must be created at the start of an operation and dropped at the end. If you cannot guarantee that, a bump allocator is wrong for that use case regardless of its allocation cost.

Bumpalo’s design here is actually correct; the Rust borrow checker prevents references into the arena from outliving the arena itself. The bug was about the arena’s own lifetime, not the references it vends. That distinction matters when you are searching for the cause.

Mimalloc’s Failure Mode Only Appears Under Mixed Load

Mimalloc (2019, Microsoft Research) introduced free list sharding: instead of per-size-class free lists, it maintains per-page free lists, which means each thread has a local free list for each page it owns that no other thread touches. Cross-thread deallocations go onto a separate per-page generic free list and are processed by the owning thread on its next allocation from that page. This eliminates most lock contention and is the mechanism behind mimalloc’s benchmark results.

Mimalloc uses madvise(MADV_FREE) aggressively to return pages to the OS, so RSS after an indexing burst drops faster than jemalloc’s decay-based approach. In Meilisearch’s experiments, mimalloc produced the lowest settled RSS of any allocator tested.

The failure appeared only under mixed workloads where indexing and search traffic overlapped. Mimalloc’s delayed free mechanism processes cross-thread deallocations on the owning thread’s next allocation. When indexing threads produce a burst of cross-thread frees and search threads are simultaneously allocating, search threads periodically stall to process pending cross-thread deallocations from the indexing side. The p99 latency degraded. The median did not. This is the kind of failure that benchmarks miss, because mimalloc-bench runs workloads in isolation, not interleaved.

Mimalloc also offers limited introspection compared to jemalloc. Basic statistics are available but there is no mallctl-equivalent namespace, no custom arena support, and no extent hooks API for advanced integration. When the p99 regression appeared, diagnosing its source required comparing profiles between isolated and mixed workloads rather than reading it out of the allocator directly.

The Criterion That Benchmark Suites Don’t Capture

Meilisearch’s final configuration was jemalloc as the Linux global allocator, bumpalo for short-lived hot paths with enforced per-operation scoping, and mimalloc recommended for Windows builds where fast OS page return matters and the mixed-workload latency issue is less acute.

The result that stands out is not which allocator is fastest. It is that jemalloc’s mallctl namespace was a decisive factor in keeping it as the default despite mimalloc’s better settled RSS. The ability to compute stats.resident / stats.allocated at runtime, expose it as a metric, and correlate it with workload phases transforms a memory dashboard from a mystery signal into an actionable one.

When Rust removed jemalloc as the default global allocator in version 1.28, the stated reasons were binary size, reduced platform porting surface, and the fact that most programs don’t need jemalloc. Those are real concerns. The tradeoff was that services with allocation-heavy workloads lost a default that had decent fragmentation behavior and excellent observability. Opting back in via tikv-jemallocator is not just a performance decision. It is a decision about what you will be able to see when something goes wrong six months after deployment.

The allocator that looks fastest in a benchmark and the allocator that makes your next memory incident resolvable in an hour rather than a week are not always the same allocator. The Meilisearch experiments make that concrete in a way that synthetic benchmarks cannot.