· 7 min read ·

When the Bump Allocator Won't Let Go: Lessons from Meilisearch's Three-Allocator Test

Source: lobsters

Meilisearch is one of the more technically serious Rust projects in production, not because it does anything exotic with unsafe code, but because it runs a genuinely demanding workload. A search engine at Meilisearch’s scale has to process document ingestion at high throughput, maintain in-memory data structures for LMDB-backed indexes, and serve queries with tight latency targets, all in a single long-running process. Memory management matters here in a way it does not for a short-lived CLI tool or a web service that restarts every deploy.

Clément Renault’s recent deep-dive into Meilisearch’s allocator choices is valuable precisely because it comes from someone who actually watched RSS grow in ways that did not match expectations. The three allocators examined, jemalloc, bumpalo, and mimalloc, each represent a distinct theory of how memory should be managed, and their behavior in a search engine exposes those theoretical differences in a concrete and instructive way.

Why the Default Allocator Fails Here

Rust on Linux ships with the system allocator by default, which on glibc systems means ptmalloc2. ptmalloc2 is adequate for many workloads, but it fragments badly under the kind of access pattern a search engine produces.

The indexing pipeline for a search engine like Meilisearch involves millions of small, mixed-lifetime allocations: string buffers for term normalization, vectors for posting lists, hash maps for deduplication and scoring. These are allocated and freed in overlapping, non-LIFO order across multiple threads. ptmalloc2 handles this with size-class bins and per-thread arenas, but its fragmentation behavior over hours of mixed traffic degrades steadily. You end up with a heap full of small holes that satisfy future small allocations but cannot coalesce into large contiguous blocks. RSS grows well past the working set and stays there.

This is the core reason Meilisearch moved to jemalloc, and why that choice has held up. jemalloc’s extent-based allocation, where memory is managed in large extents subdivided and tracked with careful per-size-class metadata, produces far lower fragmentation on long-running, mixed-size workloads. The tikv-jemallocator crate wires it in as Rust’s global allocator with a small amount of boilerplate:

use tikv_jemallocator::Jemalloc;

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

After that, all Box, Vec, String, HashMap, and anything else that allocates on the heap routes through jemalloc. On indexing workloads with millions of small allocations per batch, the RSS difference between jemalloc and the system allocator is measurable enough to matter operationally.

Where Bumpalo Enters

Bumpalo is a different class of allocator entirely. The bumpalo crate provides a thread-local arena allocator: you create a Bump arena, allocate objects into it with lifetimes tied to the arena, and free everything at once when the arena drops. The backing mechanism is a linked list of chunks; when a chunk fills, a new one is allocated from the global allocator, jemalloc in Meilisearch’s case, and allocation continues.

use bumpalo::Bump;

let bump = Bump::new();
let x = bump.alloc(42u32);
let terms = bump.alloc_slice_copy(&["hello", "world"]);
// everything is freed when `bump` drops at end of scope

For parsing and intermediate representation tasks in an indexing pipeline, bumpalo is a legitimate choice. You build a parse tree, walk it to construct index structures, then drop the arena. The per-allocation cost is near zero: no free-list lookup, no per-allocation metadata, just a pointer increment bounded by chunk capacity. For temporary data structures that all share the same lifetime, it is the right tool.

The design limit is that arena growth is one-directional. Chunks allocated from the underlying allocator remain allocated for the lifetime of the Bump instance. Calling bump.reset() ends all object lifetimes and permits fresh allocations, but the chunks themselves stay put. The RSS contribution of the arena does not decrease.

Why This Looks Like a Leak

In a long-running service, the distinction between “memory allocated but available for future use” and “memory allocated and unreachable” matters for monitoring and capacity planning. RSS-based memory tracking, the metric most production dashboards expose, cannot distinguish the two.

Consider a pattern that appears in indexing workers: a Bump arena is held across multiple indexing batches and reset between them to avoid repeated chunk allocation overhead. On the first batch, the arena grows to, say, 150 MB of backing chunks as it processes a large document set. After reset(), those 150 MB are retained inside the arena’s chunk list. The next batch draws from those existing chunks and allocates new chunks only if it needs to exceed the previous high-water mark. Over several batches, the arena grows monotonically to the size of the largest single batch it has ever processed.

From a memory dashboard’s perspective, this looks like a leak. Nothing is unreachable; the chunks are owned by the arena and valid. But nothing is returned to the operating system either. The process RSS reflects the peak arena usage across all batches, not the current allocation load.

Rust’s borrow checker enforces that you do not use memory after freeing it. It does not enforce that you free memory in a timely fashion. The compiler is satisfied as long as the arena outlives everything allocated from it. What the arena does with its backing chunks once objects are dropped is not a safety question, so the type system has no opinion on it.

The fix is straightforward but requires deliberate attention: drop and recreate the Bump instead of resetting it, or structure work so arenas live only within the scope of a single batch. Neither is complicated, but neither is automatic.

mimalloc as the Contender

Microsoft Research’s mimalloc, published in 2019, occupies a different point in the design space from jemalloc. Its core insight, described in detail in the original paper, is to shard free lists per page rather than globally per size class. Each logical 64 KB page handles one size class; within a page, each thread maintains a local free list for frees it performs itself, a thread-free list for remote frees that require no lock to push, and a main list that serves allocation requests. Popping from the local list requires no synchronization at all.

The mimalloc Rust crate wires this in as a global allocator in the same way as tikv-jemallocator. On pure allocation-throughput benchmarks, mimalloc often matches or edges past jemalloc: the original paper measured 7% faster than TCMalloc on redis and 14% faster than jemalloc on cfrac. For a workload that looks like a burst of allocations followed by a burst of frees, mimalloc’s page-local free lists reduce contention effectively.

What mimalloc does not have is jemalloc’s observability infrastructure. jemalloc exposes a deep mallctl API that lets you query live allocation stats, trigger heap profile dumps in jeprof-compatible format, and inspect arena state at runtime. When memory usage behaves unexpectedly in a jemalloc process, you can interrogate it without restarting. mimalloc’s statistics interface is more limited. For a production search engine where memory behavior needs to be diagnosed under live traffic, that observability gap matters in a way that benchmark throughput numbers do not capture.

Fragmentation behavior over extended runtimes is also worth scrutinizing. mimalloc performs well on benchmark suites designed around typical application allocation patterns, but a search engine running for days combines indexing spikes, query traffic, LMDB memory-mapped I/O, and background compaction. The fragmentation properties of any allocator are harder to predict on that kind of long-duration mixed workload than on a structured benchmark.

The Split-Allocator Pattern

What Meilisearch’s experiments point toward is not a single allocator that wins uniformly. The productive outcome is jemalloc as the global allocator, because its fragmentation behavior and observability hold up over extended operation with mixed traffic, and bumpalo used carefully for well-scoped intermediate work where the arena drops explicitly at the end of each logical unit of work rather than being held and reset.

This split is natural to express in Rust. The global allocator handles all heap allocation by default; bumpalo arenas are explicit opt-ins at specific call sites. The required discipline is the same discipline good systems code requires in any context: identify the lifetime of your data, and match your allocation strategy to it. An indexing batch is a natural scope for a bumpalo arena; a worker thread that processes many batches is not, unless the arena is scoped inside the per-batch work and dropped before the next batch begins.

mimalloc deserves consideration if allocation throughput is the dominant bottleneck and the service has less need for runtime memory diagnostics. For a persistent service that needs to be understood and debugged in production, jemalloc’s instrumentation is a concrete advantage.

The “leaky” label in Renault’s article title is accurate but not alarming. Bumpalo arenas are not leaking in the safety sense; no memory is unreachable. They are retaining backing chunks that will eventually be freed when the arena drops. The distinction matters because the right response is structural: scope arenas to match the actual lifetime of the work they support, rather than holding them long and relying on reset() to keep costs down. In a search engine that processes batches with varying sizes, that structure also protects against unbounded arena growth driven by occasional large batches setting a high-water mark that the arena never relinquishes.

Was this interesting?