Caching Without Locks: Using thread_local to Patch Legacy C++ Bottlenecks

Working in a codebase you can’t easily refactor is a situation most developers know well. The interfaces are fixed, the call sites are numerous, and somewhere in the middle there’s a function getting called far too often with arguments that produce the same result each time.

Daniel Lemire wrote about this back in December 2025, focusing on a specific pattern: using a thread_local cache to wrap a bottleneck you can’t remove at the source. It’s worth revisiting.

The problem

The scenario Lemire describes is common in legacy C++ code: a function that accepts an index and performs a lookup on a std::map. Maps give you O(log n) lookups, which is fine for occasional access, but if you’re calling that function thousands of times per second with the same inputs, you’re paying that cost repeatedly for no reason.

You might reach for a global cache, but then you need locks. A mutex per lookup is often worse than the original problem.

Where thread_local fits

thread_local storage gives each thread its own instance of a variable. That means you get the caching benefit without any synchronization overhead, because no two threads share the same cache object.

int lookupByIndex(const MyMap& m, int index) {
    thread_local std::unordered_map<int, int> cache;

    auto it = cache.find(index);
    if (it != cache.end()) {
        return it->second;
    }

    int result = expensiveLookup(m, index);
    cache[index] = result;
    return result;
}

Each thread builds its own cache over time. The first call for a given index pays the full cost; subsequent calls on the same thread hit the local map. No locks, no contention.

What to watch for

Thread-local caches have a few real considerations. Cache invalidation is the obvious one: if the underlying data can change, your cached values become stale. In Lemire’s original context, the data is stable enough that this isn’t a concern, but that assumption doesn’t hold universally.

There’s also memory usage. Each thread accumulates its own cache independently, so in a system with many threads and a wide key space, you can end up with significant overhead. For workloads with bounded inputs, this is manageable; for unbounded ones, you’d want some form of eviction.

thread_local initialization is per-thread and still has cost on first access. For very hot paths where the cache starts cold, that initialization shows up in the numbers. Lemire’s benchmarks in the original post make the tradeoffs concrete.

My take

This is the kind of optimization I find genuinely satisfying, not because it’s clever, but because it’s honest about the constraints. You have an interface you can’t change, a bottleneck you can measure, and a targeted fix that doesn’t require restructuring anything. The scope of the change matches the scope of the problem.

In my own work, I’ve used thread_local for per-thread string buffers and locale state, where the alternative would have been either repeated allocation or global locking. The pattern generalizes well anywhere you have repeated, deterministic computation with stable inputs.

If you’re profiling a C++ codebase and finding hotspots in lookup-heavy functions, it’s worth checking whether the call site could benefit from a thread-local cache before reaching for anything more complex.