· 6 min read ·

The Copy That Cost Three Times: LMDB's Overflow Pages and the Vector Indexing Tax

Source: lobsters

Meilisearch’s vector storage stack

Meilisearch stores all of its data in LMDB (Lightning Memory-Mapped Database), the embedded B-tree key-value store developed by Howard Chu at Symas Corporation. LMDB underpins not just Meilisearch’s document store and inverted indexes but its vector search capability as well. The vector side is implemented through arroy, a Rust crate Meilisearch maintains for approximate nearest neighbor search. arroy stores its index structure and raw embedding vectors as LMDB key-value pairs, accessed via the heed safe Rust wrapper around the LMDB C API.

Every embedding stored during indexing becomes a value written to LMDB. That is where the problem starts.

How LMDB organizes data on disk

LMDB maps a file directly into the process address space using mmap. The file is organized as fixed-size pages, defaulting to 4096 bytes (the OS page size on most platforms). A B-tree of those pages stores all key-value data: branch pages hold separator keys and child page pointers, leaf pages hold the actual key-value pairs for a range of keys.

When a value is small enough, it lives inline in the leaf page alongside its key and neighboring pairs. Multiple key-value entries share a single 4096-byte page, with offsets tracked in a small header at the top of the page.

When a value cannot fit into the remaining space in a leaf page, LMDB allocates what it calls overflow pages: a consecutive sequence of pages dedicated entirely to that one value. The leaf page stores a page number pointing to the first overflow page rather than the value bytes. How large is too large depends on how much space is left in the leaf page after placing the key, but the general rule is that any value larger than roughly one third of the page size will not fit alongside other entries and ends up in overflow pages. On a 4096-byte page, that threshold is somewhere around 1365 bytes.

Embedding vectors are always larger than this threshold. A 768-dimensional float32 vector (BERT, many sentence transformers) is 3072 bytes. OpenAI’s text-embedding-3-large outputs 1536 dimensions, 6144 bytes. Cohere’s Embed v3 models produce 1024 dimensions, 4096 bytes. Anthropic’s voyage-3-large uses 1024 dimensions by default and up to 2048. Every embedding stored in arroy goes through the overflow page path. There is no configuration option that avoids it short of quantizing to lower-precision formats.

The write path and where the extra copy happens

The normal mdb_put() call works like this for a non-overflow value: LMDB finds or allocates a dirty leaf page in the current write transaction, copies the value bytes into the available space on that page, and updates the page’s entry count and offset table. The page is written on transaction commit.

For overflow values, the steps are:

  1. Calculate how many pages are needed: ceil(value_size / page_size)
  2. Allocate that many consecutive pages from the free list or end-of-file
  3. Copy the value bytes across those pages, one page at a time
  4. Store the first overflow page number in the leaf node

The copy in step 3 is the issue. The application holds a buffer containing the serialized vector. LMDB copies from that buffer into its memory-mapped pages. The buffer is then freed. For a single 6144-byte vector this is fast. At indexing time, Meilisearch processes thousands or hundreds of thousands of embeddings, and the cost of allocating a buffer, serializing the vector into it, copying it into LMDB’s pages, and freeing the buffer repeats for every one of them. That accumulates.

MDB_RESERVE: the zero-copy escape hatch

LMDB provides a mechanism to eliminate this copy: the MDB_RESERVE flag on mdb_put(). Instead of passing a pointer to data, you pass a size. LMDB allocates the space, sets data.mv_data to point into the allocated region, and returns. You then write into that pointer directly.

MDB_val key, data;
key.mv_data = "some_key";
key.mv_size = 8;
data.mv_size = 6144;

/* LMDB allocates the space and fills in data.mv_data */
mdb_put(txn, dbi, &key, &data, MDB_RESERVE);

/* Write directly into LMDB's memory-mapped pages */
float *dest = (float *)data.mv_data;
memcpy(dest, source_vector, 6144);

With this approach, the intermediate application buffer disappears. If the code generating the embedding can write into an arbitrary output pointer, the data goes straight from its origin into LMDB’s pages without ever being staged in a separate allocation. For code that builds embeddings in-place (packing float values into a buffer sequentially), you can point it directly at data.mv_data and skip even the final memcpy.

The problem, documented by the Meilisearch team, is that MDB_RESERVE did not work correctly for overflow pages in the version of LMDB they were using. The flag’s semantics applied correctly for values that fit into normal leaf pages but broke down when the value required the overflow allocation path. The pointer returned did not point into the correctly mapped region, or the implementation performed a second copy despite the flag being set, negating the optimization entirely. Every vector write was going through the slow path regardless of whether MDB_RESERVE was specified.

The fix was a patch to LMDB’s C source to correctly propagate MDB_RESERVE semantics through the overflow page code path. With the patch applied, writing a 6144-byte vector does a single allocation at the page level, and the application writes directly into the mapped address. The result was roughly a 3x improvement in vector indexing throughput.

On the Rust side, with heed

heed exposes MDB_RESERVE through its RwTxn API. The ergonomics require knowing the value size before calling, which is straightforward for fixed-size embedding formats: dimensions * mem::size_of::<f32>(). Once you have a mutable byte slice pointing into LMDB’s pages, you can write a vector directly:

// Reserve space; heed returns a &mut [u8] into LMDB's mmap'd region
let reserved: &mut [u8] = txn.put_reserve(&db, &key, byte_len)?;

// Cast and write without an intermediate Vec<u8>
let floats = bytemuck::cast_slice_mut::<u8, f32>(reserved);
for (dst, src) in floats.iter_mut().zip(embedding.iter()) {
    *dst = *src;
}

bytemuck makes the cast safe as long as the alignment is correct, which it is because LMDB page data is page-aligned. This pattern eliminates both the allocation and the copy that the standard put path would incur.

Why this surface area was invisible for so long

General-purpose key-value workloads rarely exercise the overflow page path. Configuration data, inverted index postings, document metadata, session state: most of these values are tens or hundreds of bytes, well within the leaf page capacity. LMDB’s overflow path has existed since the beginning, but it was never a hot path for the users who drove LMDB’s development and testing.

OpenLDAP uses LMDB as its database backend, and directory entries are compact. Embedded applications using LMDB for local storage work with small records. The MDB_RESERVE + overflow combination was a code path that almost nobody exercised frequently enough to notice the bug.

Vector stores changed that entirely. Meilisearch’s arroy stores every indexed embedding as an LMDB value, and every one of those values is an overflow value. A dataset of one million documents with 1536-dimensional embeddings is one million overflow page writes. The bug that was invisible in traditional workloads was a dominant cost here.

This is a broader pattern in storage engine development. RocksDB recognized the same pressure from large values and built BlobDB, which separates values above a configurable size threshold from the LSM-tree and stores them in separate blob files. This avoids write amplification from compaction on large values, but it adds complexity and read path overhead. LMDB’s architecture is different enough that there is no direct equivalent: the B-tree structure does not have compaction, and overflow pages are a natural extension of the existing model. The fix was to make the existing mechanism work correctly rather than to add a new storage strategy.

What the 3x number means in practice

A 3x throughput improvement on vector indexing is significant because vector indexing is the slow part of building a Meilisearch index with embeddings enabled. Keyword indexing involves tokenization, normalization, and inverted index construction. Vector indexing involves all of that plus storing the raw embedding and updating the ANN data structure. Making the storage layer 3x faster on the dominant workload shifts where the time goes and enables indexing larger datasets in reasonable time without scaling hardware.

The improvement came from fixing a bug, not from a algorithmic change. The HNSW graph structure, the serialization format, the Rust abstractions in heed and arroy: none of it changed. One incorrect behavior in LMDB’s C source was silently imposing a 3x penalty on every vector write.

For anyone building on embedded storage engines for vector or other large-value workloads: the engine’s behavior under normal workloads, with small records, is not a reliable guide to its behavior with embedding-sized data. The overflow path is a different code path, and it may have different characteristics, different bugs, and different optimization history than the common case. That gap is worth testing explicitly before committing to an architecture.

Was this interesting?