How C++'s `inline` Keyword Lost Its Meaning

Daniel Lemire’s post on isocpp.org demonstrates something elegant: a two-line add function called from add3 compiles, with optimization, to three instructions instead of two full call-and-return sequences. The compiler saw the function body, decided the call was not worth it, and collapsed everything. What the article leaves implicit is the part worth dwelling on: the inline keyword you have been writing on functions to achieve this effect has nothing to do with whether it happens.

A Brief History of a Misleading Keyword

The inline keyword appeared in early C++ as a hint to the compiler: please substitute this function’s body at every call site instead of generating a call. In the early 1990s, compilers were simple enough that the hint was useful. They lacked cost models sophisticated enough to make their own decisions, so programmer annotation filled the gap.

By the late 1990s that had changed. Compiler optimization passes became capable of estimating function size, call frequency, and caller code-growth, and they started making better-informed inlining decisions than programmers could express through annotations. The hint became vestigial.

What the keyword does instead, and what it has done as its primary purpose since the C++98 standard, is grant an exemption from the One Definition Rule. Normally a function can be defined in exactly one translation unit. Define int square(int x) { return x * x; } in a header included by ten .cpp files, and you get a linker error about ten conflicting definitions. Mark the function inline, and the linker understands all ten definitions are identical and merges them. No error.

This is why so many small functions in headers carry the inline specifier. The purpose is preventing ODR violations, not requesting inlining from the compiler. cppreference’s documentation on the inline specifier states this plainly: compilers are free to use inline substitution for any function not marked inline, and are free to generate function calls to any function that is marked inline. The keyword imposes no obligation in either direction.

What Actually Decides Whether a Function Gets Inlined

Compilers use threshold-based cost models. GCC at -O2 evaluates the approximate instruction count of a candidate function against --param max-inline-insns-single, which defaults to 400 weighted instructions, and limits how much a caller can grow via --param inline-unit-growth, which defaults to 20%. Clang uses a similar inline threshold of around 225 at -O2, tunable with -mllvm -inline-threshold=<N>. Both compilers raise these thresholds at -O3 and factor in loop nesting depth and call frequency from profile data.

The critical prerequisite for any of this is visibility: the compiler must be able to see the function body at the call site. If a function is defined in a separate .cpp file compiled as a separate translation unit, the compiler sees only the declaration. It must generate a real function call. Inlining is impossible regardless of keyword annotations.

When you define a function in a header, the body is visible at every call site, and the compiler considers it for inlining based on its own heuristics. The inline keyword enables this arrangement by granting the ODR exemption that allows the definition to exist across multiple translation units without a linker error. But visibility is what drives inlining, not the keyword.

The Vectorization Consequence

This distinction matters because inlining is not itself the optimization. It is the gate to other optimizations, and the most significant one in tight loops is auto-vectorization.

Consider a loop applying a scalar function to a float array:

float bias_and_scale(float x) {
    return x * 2.5f + 1.0f;
}

void transform(float* data, size_t n) {
    for (size_t i = 0; i < n; ++i)
        data[i] = bias_and_scale(data[i]);
}

When bias_and_scale is visible at the call site, the compiler inlines the body, sees the full loop, recognizes the fused multiply-add pattern, and on AVX2-capable hardware emits vfmadd213ps, processing eight floats per instruction per clock. When bias_and_scale lives in a separate object file compiled without LTO, the compiler sees an opaque call inside the loop. It cannot prove the function is free of side effects or pointer aliasing. The auto-vectorizer gives up and emits a scalar loop: one call per element.

The throughput difference is typically 4-8x. The 3-5 cycle overhead of the call instruction itself, documented in Agner Fog’s optimization manuals, is almost irrelevant by comparison. The SIMD width you are leaving on the table is the real cost.

You can observe this directly on Compiler Explorer: define a function in the same file, compile with -O2 -mavx2, and look for packed instructions like vfmadd213ps or vpaddd in the loop. Add __attribute__((noinline)) to the function and recompile. The vectorized instructions disappear and call instructions appear inside the loop body.

Controlling Inlining Explicitly

When profiling identifies a specific hot-path function that the compiler is declining to inline, __attribute__((always_inline)), or [[gnu::always_inline]] in C++ attribute syntax, bypasses the cost model and forces inlining at every call site regardless of function size. MSVC uses __forceinline. These exist for the cases where you have measured the impact and the compiler’s default heuristic is wrong for your workload.

[[gnu::always_inline]] inline float fast_reciprocal(float x) {
    return 1.0f / x;
}

For a kernel that calls several small helpers, GCC’s __attribute__((flatten)) recursively inlines all calls within the annotated function, bypassing per-callee budget limits. A hot loop calling clamp(normalize(quantize(x))) on each element can be collapsed to a single visible body, giving the vectorizer a clean view of the entire computation.

The inverse is equally useful. __attribute__((noinline)) and [[clang::noinline]] prevent inlining, which serves two practical purposes: keeping cold-path code out of the L1 instruction cache to reduce pressure on the hot path, and preserving stack frames for readable profiler output. Firefox and Chrome have both documented cases where deliberately reducing inlining aggressiveness in their hot loops improved throughput because instruction cache pressure from over-inlined code outweighed the savings from avoided calls.

Cross-File Inlining With LTO

When moving a function to a header is impractical, link-time optimization bridges the translation unit boundary. With -flto, GCC and Clang emit intermediate representation into object files instead of machine code. At link time the backend processes the full program IR and can inline and optimize across files as if they were one.

Clang ThinLTO (-flto=thin) builds lightweight per-module summaries and performs cross-module inlining based on those summaries without loading the entire program into memory at once. It delivers most of full LTO’s runtime benefit at substantially lower link-time cost, which makes it practical for large codebases. On hot cross-module call paths, LTO typically yields 5-15% runtime improvement, sometimes significantly more when the inlined body unlocks vectorization.

The Practical Upshot

If you have been writing inline on performance-sensitive functions expecting it to prevent function calls, you have been writing ODR bookkeeping, not optimization instructions. The compiler’s cost model decides. Visibility at the call site decides. For functions that must be in headers to be visible, the inline specifier serves its actual purpose of avoiding linker errors. For functions where you have profiled and confirmed the compiler is wrong, always_inline is the real lever. For functions that need to stay in .cpp files but sit on hot cross-module paths, LTO is the appropriate tool.

The Lemire example with add and add3 is the clearest possible demonstration that, given visibility, the compiler will eliminate the call and produce better code than you wrote. Understanding that visibility, not the inline keyword, is what grants the compiler that opportunity changes how you think about header placement, translation unit structure, and build flags.