Dirty, Muzzy, and Retained: What jemalloc Knows About Your Memory That RSS Doesn't

A service’s RSS is 3.2 GB. A heap dump shows 900 MB of live objects. The allocator is holding the other 2.3 GB, and understanding exactly where that memory is, and why, requires knowing how jemalloc models memory state internally.

Most developers’ mental model of heap memory has two states: allocated (in use by the application) and freed (returned to the OS). jemalloc’s internal model is more granular than that, and the difference matters for diagnosing memory behavior in production.

The Five States of a Freed Allocation

When your code calls free() or drops an allocation in Rust, jemalloc does not immediately call munmap or madvise. The freed memory passes through several states:

tcache: The per-thread cache holds recently freed small objects in a LIFO stack per size class. A freed object may sit here until the thread’s cache overflows or the GC cycle runs (every 512 allocation events by default). No lock acquired; no arena involved. This is purely per-thread state.

Dirty: Once an object is returned to the arena (either directly for large allocations, or via a tcache flush for small ones), the backing pages enter a dirty state. jemalloc holds them as available for future allocations. Dirty pages are counted in stats.resident but not in stats.allocated. The OS sees them as used.

Muzzy: After dirty_decay_ms milliseconds (10 seconds by default), dirty extents receive MADV_FREE on Linux. This tells the kernel it can reclaim the pages under memory pressure, but the virtual address range stays mapped. Muzzy pages may still show in RSS depending on kernel reclaim activity.

Retained: After muzzy_decay_ms (another 10 seconds by default), extents receive MADV_DONTNEED, which more strongly hints at reclamation. RSS should drop here, though virtual memory remains mapped for potential reuse.

Returned: Eventually the virtual address range itself may be released via munmap, though jemalloc tends to retain mapped ranges for performance.

The background thread introduced in jemalloc 5.0 runs the dirty-to-muzzy-to-retained transitions asynchronously, so this purge work doesn’t land on allocation threads as latency spikes.

Reading the Stats

The four most useful mallctl statistics correspond directly to these states. In Rust, using the tikv-jemalloc-ctl crate:

use tikv_jemalloc_ctl::{epoch, stats};

fn heap_stats() {
    // jemalloc caches stats internally; advance the epoch to refresh
    epoch::mib().unwrap().advance().unwrap();

    let allocated = stats::allocated::mib().unwrap().read().unwrap();
    let active    = stats::active::mib().unwrap().read().unwrap();
    let resident  = stats::resident::mib().unwrap().read().unwrap();
    let retained  = stats::retained::mib().unwrap().read().unwrap();

    println!("allocated: {:.1} MB", allocated as f64 / 1_000_000.0);
    println!("active:    {:.1} MB", active    as f64 / 1_000_000.0);
    println!("resident:  {:.1} MB", resident  as f64 / 1_000_000.0);
    println!("retained:  {:.1} MB", retained  as f64 / 1_000_000.0);
    println!("overhead:  {:.1} MB (resident - allocated)",
             (resident - allocated) as f64 / 1_000_000.0);
}

The epoch advance is easy to forget and produces stale results if omitted. In C:

#include <jemalloc/jemalloc.h>

void heap_stats(void) {
    uint64_t epoch = 1;
    size_t sz = sizeof(epoch);
    je_mallctl("epoch", &epoch, &sz, &epoch, sz);

    size_t allocated, active, resident, retained;
    sz = sizeof(size_t);
    je_mallctl("stats.allocated", &allocated, &sz, NULL, 0);
    je_mallctl("stats.active",    &active,    &sz, NULL, 0);
    je_mallctl("stats.resident",  &resident,  &sz, NULL, 0);
    je_mallctl("stats.retained",  &retained,  &sz, NULL, 0);

    printf("allocated: %.1f MB\n", allocated / 1e6);
    printf("resident:  %.1f MB\n", resident  / 1e6);
    printf("overhead:  %.1f MB\n", (double)(resident - allocated) / 1e6);
}

The interpretive key:

stats.allocated: bytes in live allocations. What your application’s data structures actually occupy.
stats.active: allocated bytes plus alignment padding within pages. Slightly higher than allocated.
stats.resident: actual RSS from jemalloc-managed memory. Includes dirty and muzzy pages.
stats.retained: cumulative virtual memory that has been released back to OS accounting but whose address ranges jemalloc has kept mapped.

The number to watch in production is resident - allocated. When this is stable, decay is keeping pace with the workload’s freed-object rate. When it trends upward continuously, freed pages are accumulating faster than they decay, eventually pushing RSS into OOM territory. When it spikes and drops rhythmically, you have a bursty workload whose decay settings don’t match the batch cadence.

Emit these as metrics. They tell you things about your service that nothing else can.

Decay Configuration for Real Workloads

Default decay timers (10 seconds dirty, 10 seconds muzzy) were calibrated for services with fairly steady allocation rates. Two common workloads break that assumption.

ML inference and batch processing: A batch arrives, memory spikes for object construction and intermediate results, the batch completes, everything should free cleanly. With default timers, pages from that batch stay resident for up to 20 seconds after the batch is done. If batches arrive every 5 seconds, you spend most of your time at peak RSS even when the service is effectively idle between batches. Shorter timers help:

MALLOC_CONF="dirty_decay_ms:3000,muzzy_decay_ms:3000" ./inference-service

But faster decay introduces a tradeoff. If the next batch arrives before purged pages have been refaulted, those page faults add latency to the batch’s first allocations. The correct value depends on the batch interval and how latency-sensitive the cold-start is. It is worth measuring both RSS and batch latency under several settings rather than picking a number from intuition.

Latency-critical services that cannot tolerate variance: Some services prioritize tail latency consistency over RSS efficiency. Decay and the subsequent page faults are a source of latency jitter. Disabling decay entirely trades RSS for predictability:

MALLOC_CONF="dirty_decay_ms:-1,muzzy_decay_ms:-1" ./latency-service

With decay disabled, freed pages stay resident until explicitly reclaimed, and allocations always return pages that are already warm in the TLB. This works if the service has a predictable steady-state memory footprint. For services whose working set grows over the process lifetime, it’s a path to OOM.

This is precisely the problem Meta’s renewed jemalloc investment addresses under the “decay tuning for bursty workloads” area. The current design uses global decay parameters; different components of the same process (request handling, background work, ML batches) may need different settings. Per-arena decay parameters, part of what Meta is building, would let a single process apply appropriate decay policies to different allocation domains independently.

The Profiler: jemalloc’s Most Underused Feature

Meta’s announcement lists profiling improvements as one of its four investment areas, and this is the part with the broadest practical impact beyond large fleet operators.

jemalloc’s sampling-based heap profiler can be activated at runtime without recompilation. The profiler intercepts allocation calls and records call stacks for a sampled subset, controlled by lg_prof_sample (sample one allocation per 2^N bytes of cumulative allocations):

# Profile with ~512 KB average sampling interval, dump profile on exit
MALLOC_CONF="prof:true,prof_prefix:/tmp/heap,lg_prof_sample:19,prof_final:true" \
  ./your-service

The output is pprof-compatible. Analyzing it with jeprof or pprof:

jeprof --show_bytes --pdf ./your-service /tmp/heap.0.heap > heap.pdf

This gives you a flamegraph showing which allocation sites are responsible for the most retained bytes. For finding what’s holding memory in a C++ service, it’s the most tractable tool available. The sampling overhead at lg_prof_sample:19 is low enough that many teams run it continuously in production on a subset of servers.

The prof_recent API, introduced in jemalloc 5.x, extends this toward always-on production sampling. Rather than a cumulative profile that grows unboundedly, prof_recent maintains a fixed-size ring buffer of the most recent allocation stacks:

// Configure ring buffer size at startup
size_t recent_alloc_max = 1000;
je_mallctl("prof.recent_alloc_max",
           NULL, NULL,
           &recent_alloc_max, sizeof(recent_alloc_max));

// Dump recent allocations to a file for inspection
je_mallctl("prof.recent_alloc_dump", NULL, NULL, NULL, 0);

This gives a bounded-overhead snapshot of what the allocator has been doing recently, without the cost of a full heap profile. Meta’s investment includes extending this API with richer query capabilities and lower sampling overhead, moving toward practical always-on production profiling.

The broader vision is heap observability that integrates naturally with the rest of a service’s metrics pipeline. Right now, using jemalloc’s profiling effectively requires knowing which mallctl keys exist, remembering to advance the epoch before reading stats, and understanding the allocator’s internal terminology. Meta’s improvements should make this more accessible without requiring deep allocator expertise.

Why This Matters Beyond Meta

The memory state model and profiling capabilities described here are available in current jemalloc (5.3.0 as of writing) and work today in any C, C++, or Rust service that links the library. Meta’s work will improve the tooling around these features, but the fundamentals are already there.

For Rust, tikv-jemallocator and tikv-jemalloc-ctl provide idiomatic access. Redis documentation has recommended jemalloc for years because of its fragmentation characteristics. RocksDB users benefit directly from better decay tuning. The ecosystem depending on jemalloc is wide, and improvements to observability and configuration flow through all of it once they land in the upstream repository.

The immediate practical value for most developers is simpler: add the four stat queries to your metrics collection, watch the resident - allocated gap, and set decay parameters based on your actual workload profile. The allocator has been tracking this information the whole time.