Type Erasure at the Wrong Layer: What std::function Does to Your Tight Loops

Most discussions of function call overhead in C++ focus on direct calls: the CALL/RET pair, register saves, stack frame setup. Daniel Lemire’s analysis on isocpp.org covers that ground well, showing that even trivial function calls can dominate tight loops when inlining fails. What gets less attention is how std::function makes the problem substantially worse, and why the idiomatic C++ solution for “pass any callable” is a poor choice when that callable gets invoked in a hot path.

What std::function Actually Is

std::function<R(Args...)> is a type-erased callable wrapper. It holds any callable that matches a given signature: lambdas, function pointers, member function pointers, or arbitrary objects with operator(). The implementation stores the callable in a small internal buffer for small callables or on the heap for larger ones, and dispatches calls through a function pointer stored internally alongside the object.

The dispatch path is the performance story. Every invocation of a std::function goes through an indirect call: load the internal function pointer, call through it. The indirect call is opaque to the compiler. Even if the std::function holds a lambda with a body of two instructions, the compiler cannot see through the indirection at the call site and inline the body. The optimizer treats it as an external function with unknown behavior, which is exactly the category of call that Lemire’s article describes as dominating tight loop runtime.

The consequence is identical to any other non-inlined call from a loop: auto-vectorization fails, alias analysis assumptions reset, constant propagation stops at the boundary.

Here is the concrete pattern that trips developers up:

// This looks like a generic, reusable transform. It is also an optimization wall.
void transform(std::vector<float>& v, std::function<float(float)> fn) {
    for (float& x : v) x = fn(x);
}

// The call site looks innocent:
transform(data, [](float x) { return x * x; });

With -O3, GCC and Clang will not vectorize the loop in transform. The fn(x) call goes through std::function’s internal dispatch, which the optimizer treats as an indirect call to unknown code. Inspect the assembly and you find a scalar loop with an indirect call per element, regardless of how simple the actual lambda body is. Compiler Explorer makes this immediately verifiable: paste the code, add -O3 -mavx2, and look for call inside the loop body.

Measuring the Cost

The overhead has two components. First, the indirect call itself: loading the function pointer from the std::function object and jumping through it adds roughly 5 to 15 cycles beyond a direct call, depending on branch predictor state. A warm, monomorphic call site with a stable function pointer is toward the lower end of that range; a cold or polymorphic site is toward the upper end.

Second, and more important, the blocked SIMD vectorization leaves the loop running at around 0.3 to 0.5 nanoseconds per element for scalar arithmetic instead of 0.04 to 0.12 nanoseconds per element for AVX2-vectorized code. For a loop processing one million floats, that difference is roughly 400 microseconds on the slow path versus under 100 microseconds with vectorization, a 4 to 8x gap depending on the specific operation.

Under Spectre mitigations, the indirect call overhead grows further. With retpoline active, which has been the default on patched Linux kernels since early 2018, indirect calls go through a trampoline sequence that prevents speculative execution of arbitrary code. Phoronix measured retpoline overhead at 10 to 30% on workloads with frequent indirect calls, and on some benchmarks substantially higher. A loop calling through std::function per element on a Spectre-mitigated system pays that penalty on every iteration.

The C++ Core Guidelines note that std::function “can impose a non-trivial overhead for each invocation” and recommend template parameters where performance matters. The guidance exists; it does not always reach developers who learned to use std::function as the default callback type.

The Fix: Template Parameters

Making the callable type a template parameter gives the compiler full visibility into the callable’s body and eliminates the dispatch overhead entirely:

// The optimizer sees through the callable's type and can inline its body.
template<typename Fn>
void transform(std::vector<float>& v, Fn fn) {
    for (float& x : v) x = fn(x);
}

// Same call site, radically different assembly:
transform(data, [](float x) { return x * x; });

With this change, -O3 -mavx2 causes the loop to auto-vectorize to vmulps operating on eight floats per iteration. The loop body changes from a scalar indirect call to a single vector multiply instruction. The assembly change is visible immediately on Compiler Explorer by toggling the function signature.

C++20 adds a cleaner syntax for constraining templated callables without the ceremony of a separate template parameter:

void transform(std::vector<float>& v, std::invocable<float> auto fn) {
    for (float& x : v) x = fn(x);
}

This reads clearly as a constrained template, and the compiler treats it identically to the explicit template parameter version for optimization purposes. The callable’s type is known at every call site and the function is instantiated concretely.

The trade-off is compile time and binary size when transform is called with many different callable types. For most code, this is acceptable. For a library where exposing the implementation is not feasible and header-only design is undesirable, LTO becomes the alternative path to giving the compiler visibility across translation unit boundaries.

When You Actually Need Type Erasure

There are genuine cases where std::function is the right tool. Storing callbacks registered at runtime, managing heterogeneous collections of callables, and plugin systems where callable types are not known at compile time all require type erasure. The mistake is using std::function for every callback situation, including the ones where the callable type is fixed at compile time and invocation happens in a hot path.

For cases requiring type erasure without per-call heap allocation overhead, std::function_ref (adopted for C++26 in P0792) provides a non-owning type-erased reference to a callable. It is smaller than std::function, cannot allocate, and is copyable but not owning. For C++23 and earlier, the TartanLlama function_ref and function2 libraries implement the same concept with additional capabilities. None of these eliminate the indirect call; they reduce object overhead and simplify lifetime semantics without changing the fundamental dispatch mechanism.

When the callable set is small and closed at compile time, std::variant with std::visit avoids the indirect call for the innermost dispatch:

using Op = std::variant<Multiply, Add, Clamp>;

void transform(std::vector<float>& v, Op op) {
    std::visit([&](auto& fn) {
        for (float& x : v) x = fn(x);
    }, op);
}

std::visit generates one lambda instantiation per variant alternative. The dispatch uses a jump table over a small closed set, and inside each branch the compiler has a concrete callable type it can inline. Each branch is a separately vectorizable loop; the compiler emits vectorized code for each alternative. This adds code size proportional to the number of alternatives but keeps the inner loop fully visible to the optimizer.

Finding the Problem Before Fixing It

Before adding template parameters or switching to a different callable type, it is worth confirming that inlining failure is actually the bottleneck. Clang’s -Rpass-missed=loop-vectorize emits per-loop diagnostics with a specific reason for each vectorization failure. A loop containing an std::function invocation typically reports something like:

foo.cpp:7:5: remark: loop not vectorized: call instruction cannot be vectorized

GCC’s equivalent is -fopt-info-vec-missed. Both flags attach to specific source lines, so identifying which loop failed and why is straightforward. Running these on a bottleneck before reaching for __attribute__((always_inline)) or restructuring the code often clarifies that the problem is exactly what it looks like: an opaque call inside the hot path.

The underlying mechanism is the same one Lemire’s article describes for direct function calls. Any construct that creates an indirect call the compiler cannot reason about becomes an optimization boundary at exactly the wrong place. In C++, std::function is the most common example in production code, but function pointers stored in structs, virtual methods in polymorphic base classes, and type-erased interfaces all create the same obstacle for the same reasons. The diagnostic tools identify them uniformly; the fix depends on whether the type erasure is genuinely necessary or is incidental to the design.