When Your Abstraction Becomes an Optimization Wall: std::function in Tight Loops

Daniel Lemire’s writeup on function call overhead uses a clear example: a plain function calling another plain function, and the overhead this introduces in a tight loop. That scenario is easy to reason about. The harder case in production C++ is when you have introduced an abstraction that creates an identical opacity boundary without ever explicitly writing a call to an unknown function.

std::function<void(float)> is the most common example. It wraps any callable behind a type-erased interface. The convenience is real. So is the cost, and the cost is less visible than it should be.

What std::function Does to the Optimizer

At the implementation level, std::function uses type erasure to store a callable of any concrete type behind a uniform interface. A small callable may be stored directly inside the std::function object via the small buffer optimization; larger callables are heap-allocated. In both cases, invocation goes through a stored function pointer or an equivalent indirect dispatch mechanism that the optimizer at the call site cannot see through.

The consequence is identical to the non-inlined call scenario Lemire describes. The compiler must treat the invocation as opaque: the stored callable might read and write arbitrary memory, might have side effects, cannot be reordered around. Auto-vectorization requires visibility into the entire loop body to confirm that loop iterations are independent and that the operation maps to a SIMD instruction. A call through std::function blocks that analysis entirely.

// This loop will not be auto-vectorized
std::function<float(float)> transform = [](float x) { return x * 2.5f; };
for (int i = 0; i < n; i++) {
    arr[i] = transform(arr[i]);
}

// This loop will be auto-vectorized
auto transform = [](float x) { return x * 2.5f; };
for (int i = 0; i < n; i++) {
    arr[i] = transform(arr[i]);
}

In the first version, the lambda’s type is erased when stored in std::function. At the call site, the compiler sees an indirect invocation through an opaque pointer. In the second version, auto deduces the lambda’s unique concrete type, making the full body visible to the compiler at the call site.

With -O3 -mavx2, the second loop generates VMULPS instructions processing eight floats per cycle. The first generates a call instruction per element. You can verify this on Godbolt Compiler Explorer in a few minutes by toggling between the two forms and looking at the disassembly.

The raw overhead of a std::function call runs roughly 5 to 15 nanoseconds for a tiny callable, combining the indirect dispatch cost with the destroyed inlining opportunity. That is already comparable to or worse than the non-inlined direct call Lemire benchmarks. But the larger loss is SIMD: a scalar loop doing one element per cycle versus an AVX2 loop doing eight per cycle is the gap between the two code paths, and std::function forecloses the vectorized option regardless of the callable’s complexity.

Templates Preserve the Full Optimization Surface

C++ function templates require their definitions to be visible at the point of instantiation, which in practice means appearing in headers. The concrete callable type is fully known when the template is instantiated, and the compiler has the body available to inline. For small callables, automatic inlining at -O2 handles this without further annotation; the same cost-model thresholds that govern ordinary function inlining apply here.

template <typename F>
void apply(float* arr, int n, F transform) {
    for (int i = 0; i < n; i++) {
        arr[i] = transform(arr[i]);
    }
}

// Caller:
apply(arr, n, [](float x) { return x * 2.5f; });

Here F is deduced as the lambda’s concrete type. The loop compiles with the lambda body visible, the vectorizer proceeds, and the abstraction compiles away. No heap allocation, no indirect call, no opaque boundary.

C++20 concepts let you constrain the type parameter without changing the performance profile:

#include <concepts>

template <std::invocable<float> F>
void apply(float* arr, int n, F transform) {
    for (int i = 0; i < n; i++) {
        arr[i] = transform(arr[i]);
    }
}

The concept documents the interface contract and produces better compiler error messages when the constraint is not met. The instantiation behavior is identical. Each distinct callable type passed to apply produces a separate instantiation, which is the tradeoff: monomorphization increases binary size, and for a function called with many different lambda types across a large codebase, the code size growth can create icache pressure of its own. The compiler’s cost model accounts for this; for very large callables, it may decline to inline even when the type is fully visible.

Virtual Dispatch After Spectre

Virtual functions are the other common source of an opaque call boundary. The dispatch sequence, loading the vptr from the object, loading the function pointer from the vtable slot, and executing an indirect call, costs 5 to 10 cycles over a direct call on a warm cache. That is comparable to the call overhead Lemire quantifies for non-inlined direct calls.

Since the Spectre and Meltdown disclosures in January 2018, indirect branches carry an additional cost on hardened systems. Retpoline mitigations, standard in the Linux kernel and present in various user-space security-focused builds, replace indirect branches with a serializing sequence that blocks speculative execution through the branch target. Under retpoline, each virtual call costs 30 to 80 cycles per call regardless of cache state or prediction accuracy. Agner Fog’s microarchitecture documentation covers the per-microarchitecture breakdown. A hot loop over a polymorphic interface in a retpoline-patched binary can lose the majority of its throughput to mitigation overhead.

Devirtualization avoids this entirely when the compiler can prove the concrete type at the call site. The final specifier is the clearest signal:

class FastTransform final : public Transform {
    float apply(float x) override { return x * 2.5f; }
};

void process(FastTransform* t, float* arr, int n) {
    for (int i = 0; i < n; i++) {
        arr[i] = t->apply(arr[i]);  // devirtualized: FastTransform is final
    }
}

With final, the indirect dispatch disappears. The method call resolves to a direct call, which the compiler can then inline and vectorize through the usual path. Link-time optimization extends devirtualization across translation unit boundaries via whole-program type analysis; if LTO determines only one concrete implementation of a virtual method is visible in the entire binary, it will devirtualize speculatively with a runtime type guard.

Where to Draw the Line

None of this is an argument against std::function or virtual interfaces. It is an argument for knowing where they appear in the runtime profile.

A callback registered once at initialization and invoked rarely costs nothing measurable to store in std::function. A filtering predicate called 50 million times per second in a processing loop is a different situation. The design pattern most performance-sensitive C++ libraries use separates these concerns explicitly:

class Pipeline {
public:
    // Cold path: configured once, type erasure is acceptable
    void set_error_handler(std::function<void(std::string_view)> handler);

    // Hot path: template for full inlining
    template <typename F>
    void process(float* data, int n, F transform);
};

simdjson, from Lemire and colleagues, takes the hot-path visibility argument to its conclusion. The library defines #define really_inline __attribute__((always_inline)) inline and annotates essentially every function on the hot parse path with it. The goal is to present the optimizer with a single large inlined function body covering the entire parsing loop, enabling SIMD vectorization throughout. simdjson reaches 2.5 to 3.5 GB/s JSON parsing throughput; parsers built with conventional function decomposition and the same underlying algorithm typically measure around 0.5 GB/s. The difference is not algorithmic. It is compiler visibility over the hot path.

C++26 introduces std::function_ref, a non-owning callable wrapper without heap allocation or ownership semantics, useful for passing callbacks by reference without the overhead of constructing a std::function. It is lighter for transient callback passing, but it still uses type erasure internally. Invocation still goes through an indirect call. For passing callbacks to configuration APIs or event handlers, it is a better default than std::function; for hot inner loops, the inlining picture is unchanged.

The Pattern

std::function, virtual dispatch, and non-inlined direct calls are all instances of the same root cause: the optimizer cannot transform what it cannot see. The visible overhead of 5 to 15 nanoseconds per call is only part of the cost. The invisible overhead is the SIMD vectorization, constant folding, and loop optimization that become possible only when the call boundary disappears.

For tight loops that process arrays or perform repeated per-element operations, the choice between a template parameter and a std::function is a performance decision. The template compiles the abstraction away; std::function preserves it at runtime and pays for that preservation on every iteration. Check the assembly with -fopt-info-inline on GCC or -Rpass=inline on Clang to confirm which path the compiler took. If you see a call instruction inside your innermost loop, the vectorizer saw it too and stopped there.