Why C++ Libraries Live in Headers: The Inlining Constraint Behind Modern C++ Design

The call barrier that Daniel Lemire describes in his isocpp.org article is a constraint that touches the design of entire libraries. The mechanical cost of a call, a few cycles per iteration, matters in tight loops. The compiler’s inability to optimize across an opaque boundary matters everywhere. Over decades, this constraint has shaped how C++ code is written at scale: why so many libraries are header-only, why templates exist as a performance primitive, why std::function has a reputation for being slower than it looks, and why Rust handles this differently by default.

The `std::function` Cost

std::function is the standard library’s type-erased callable wrapper. It accepts anything callable: functions, lambdas, member function pointers, objects with operator(). The cost for this generality is indirection. Internally, std::function stores the callable in a small buffer for small callables or on the heap for larger ones, and dispatches through a virtual-table-like mechanism. The compiler sees a call through an opaque pointer and cannot inline the body.

#include <functional>

// Template version: compiler can see through the callable body
template<typename F>
void apply_template(float* out, const float* in, int n, F f) {
    for (int i = 0; i < n; ++i)
        out[i] = f(in[i]);  // inlineable, vectorizable
}

// std::function version: opaque dispatch, no inlining possible
void apply_erased(float* out, const float* in, int n,
                  std::function<float(float)> f) {
    for (int i = 0; i < n; ++i)
        out[i] = f(in[i]);  // one call per element, scalar only
}

The template version lets the compiler see the callable’s body at the call site. A passed lambda compiles as if its body were written directly in the loop, enabling the same inlining and vectorization opportunities. The std::function version calls through a pointer at runtime; the compiler cannot see what that pointer holds, so vectorization cannot happen.

Community benchmarks and notes in Abseil’s callable documentation consistently show std::function call overhead in the range of 5 to 10 nanoseconds per call versus sub-nanosecond for an inlined lambda. For a tight loop over millions of elements where the callable body is trivial, the call overhead swamps the work.

The standard library itself reflects this constraint. std::sort, std::transform, std::for_each, and all the algorithms that accept callbacks take them as template parameters, not as std::function. This is not a stylistic preference; it is a prerequisite for the compiler to emit competitive code. Wherever generality matters more than throughput in an API, std::function is appropriate. Wherever throughput is the constraint, a template parameter is the tool.

Templates as Forced Visibility

C++ templates are generally understood as a generics mechanism. At the compiler level, they are also a forced visibility mechanism. A function template must be fully defined at each point of instantiation, which means its body must appear in a header or in the same translation unit as its callers. That visibility is exactly what enables inlining.

The Eigen linear algebra library is almost entirely header-only and achieves performance competitive with hand-tuned BLAS implementations. The technique is expression templates: rather than computing intermediate results into temporary arrays, Eigen builds a lazy expression tree and evaluates the entire chain in a single loop. A statement like c = a + b * 2.5f generates a single pass over the arrays, not three. This fusion requires the compiler to see every operation in the chain at once. If any piece were compiled into a separate translation unit, the fusion would not be possible.

C++20 ranges use the same approach. A pipeline like:

auto result = input
    | std::views::transform([](float x) { return x * 2.5f + 1.0f; })
    | std::views::filter([](float x) { return x > 0.0f; })
    | std::ranges::to<std::vector>();

looks like three sequential passes. With the full source visible at the call site, the compiler can fuse these into a single loop with the lambda bodies inlined. The result is code that reads like a comprehension and compiles like a hand-written loop. If transform and filter took their callables as std::function rather than template parameters, this fusion would not be possible and the three-pass cost would be literal.

The header-only, template-heavy nature of high-performance C++ libraries is a direct response to the visibility constraint. Libraries like range-v3, {fmt}, and Abseil’s hot-path internals follow this pattern not by convention but because the alternative forfeits the optimizations that make them worth using.

CRTP: Polymorphism Without the Barrier

For cases where virtual dispatch would introduce the opaque indirect call, the Curiously Recurring Template Pattern provides static polymorphism. The pattern threads the derived type through the base class as a template parameter, letting the compiler resolve method calls to a concrete type at compile time.

// Virtual dispatch: opaque indirect call per element
struct EffectBase {
    virtual float process(float x) const = 0;
    void apply(float* out, const float* in, int n) {
        for (int i = 0; i < n; ++i)
            out[i] = process(in[i]);  // indirect call, no vectorization
    }
};

// CRTP: static dispatch, compiler sees process_impl body
template<typename Derived>
struct EffectCRTP {
    void apply(float* out, const float* in, int n) {
        for (int i = 0; i < n; ++i)
            out[i] = static_cast<const Derived*>(this)->process_impl(in[i]);
    }
};

struct Gain : EffectCRTP<Gain> {
    float factor;
    float process_impl(float x) const { return x * factor; }
};

The static_cast in the CRTP version resolves to a known type at compile time. The compiler can inline process_impl into apply, and the loop over n elements becomes a vectorization candidate. The virtual version cannot achieve this without Profile-Guided Optimization or the final keyword.

final, introduced in C++11, provides a lighter-weight alternative for concrete leaf classes. Marking a class or method final tells the compiler that no derived type will override it, which is enough information to devirtualize calls through pointers or references of that concrete type without requiring template metaprogramming.

How Rust Approaches This

Rust’s default behavior reflects a different set of trade-offs around the same constraint. Within a single crate, the compiler performs aggressive inlining by default, similar in effect to enabling LTO over the whole compilation unit. Across crate boundaries, the equivalent of the C++ translation unit problem applies: the compiler sees only a compiled artifact unless #[inline] is present on the function.

// Without #[inline]: body is not available to callers in other crates
pub fn scale(x: f32) -> f32 {
    x * 2.5 + 1.0
}

// With #[inline]: body is included in the compiled crate for callers to use
#[inline]
pub fn scale_inlineable(x: f32) -> f32 {
    x * 2.5 + 1.0
}

// Forces inlining regardless of size
#[inline(always)]
pub fn scale_always(x: f32) -> f32 {
    x * 2.5 + 1.0
}

Rust also supports LTO through Cargo profiles. lto = true in [profile.release] enables full cross-crate LTO; lto = "thin" uses LLVM’s ThinLTO for faster link steps while preserving most of the cross-crate inlining benefit.

The practical difference from C++: Rust makes inlining opt-in at crate boundaries but defaults to more aggressive whole-crate optimization than C++ achieves without LTO. C++ requires either header-only definitions or explicit LTO to reach the same cross-module visibility. Rust libraries that expose hot paths in performance-critical paths, like rayon or packed_simd, annotate those paths with #[inline] for the same reason Eigen puts everything in headers.

The Design Tension

The standard advice for readable, maintainable code is to decompose logic into small functions with clear responsibilities. The inlining constraint pulls in the opposite direction: every function boundary in a hot inner loop is a potential optimization gate. These goals conflict directly in numerical and data processing code.

The resolution is that these contexts rarely overlap in the same codebase. Application logic handling Discord bot commands, HTTP routing, and database queries is not exercising this trade-off at a meaningful scale. The visibility constraint becomes material only in code that processes large arrays in tight loops: numerical computation, audio and video processing, physics simulation, data serialization at high throughput.

For that category, the practical habits follow from the constraint. Hot, small functions called in inner loops belong in headers or in the same translation unit as their callers. LTO, specifically ThinLTO for non-trivial codebases, turns the entire build into a single optimization domain. std::function belongs in APIs where generality matters more than throughput; template parameters belong where throughput is the priority. And the assembly output, checked through Compiler Explorer before and after a change, tells you whether the compiler actually did what you expected.

Lemire’s article focuses on the call itself. The design implications extend further: the header-only library, the template callback, the CRTP base class, and the Rust #[inline] attribute are all responses to the same underlying constraint. Understanding that constraint makes these patterns look like natural consequences rather than arbitrary conventions.