When Inlining Stops Being a Hint and Starts Being a Prerequisite

The standard mental model for function call cost goes like this: calling a function saves arguments to registers or the stack, transfers control, sets up a frame, does work, tears down the frame, and returns. On modern x86-64, a direct call is maybe 3-5 cycles. That sounds negligible, and for most call sites it is. In a tight numeric loop processing millions of elements, though, that is almost never where the performance goes.

Daniel Lemire’s recent writeup makes the case concisely: function calls are cheap but not free, and their real cost in tight loops often has nothing to do with the call mechanics. The cost is what the call prevents your compiler from doing.

What the `inline` Keyword Actually Means

Most C++ programmers learn that inline tells the compiler to substitute the function body at the call site, avoiding call overhead. That was the original intent when inline was borrowed from C89 into C++98. In modern compilers, the keyword is primarily an ODR (One Definition Rule) relaxation mechanism, not an inlining command.

The C++ standard says the inline specifier indicates that inline substitution is to be preferred, but that an implementation is not required to perform this inline substitution. What inline actually guarantees: the function can be defined identically in multiple translation units without violating the ODR, and the linker will merge those definitions. It does not guarantee the compiler will substitute the body at call sites. GCC and Clang regularly decline to inline functions marked inline when they are too large or called indirectly, and routinely inline functions not marked inline when they are visible, small, and called in a hot path.

When you need an actual guarantee rather than a hint, you want __attribute__((always_inline)) in GCC/Clang:

__attribute__((always_inline)) inline float scale(float x) {
    return x * 2.0f;
}

If the compiler cannot honor always_inline, for example on an indirect call, it emits an error rather than silently falling back to a normal call. Clang also accepts [[clang::always_inline]] in the C++11 attribute syntax; MSVC has __forceinline for the same purpose.

The shift in what inline means is not just trivia. It explains why naive header-based C++ libraries with inline everywhere do not always produce maximally fast code: the compiler accepted the ODR relaxation but chose not to inline based on its own heuristics.

The Optimization Barrier

A non-inlined call to a function whose definition the compiler cannot see is, from the optimizer’s perspective, an opaque side-effectful operation. Unless the function is marked noexcept, it might throw. Unless it takes no pointer or reference arguments and touches no globals, it might read or write any memory the caller holds. The optimizer must conservatively respect all of these possibilities.

Alias analysis breaks down across call boundaries. If you have a live variable in a register before a call, the compiler must assume the called function might have modified the underlying memory, so it reloads from memory after the call returns instead of reusing the cached register value.

Constant propagation stops at call boundaries. Calling compute(42) with compute defined in another translation unit emits a real call with 42 in a register. With compute inlined, the compiler may fold the entire result to a constant at compile time, eliminating the computation entirely.

Loop-invariant code motion is blocked when calls appear to have side effects. The compiler cannot hoist a call out of a loop unless it can prove the call is pure, and it can only do that after inlining reveals the body.

These matter, but none of them are the dominant factor in numeric or data-processing workloads.

Vectorization Is Where It Hurts

Modern CPUs are massively parallel at the instruction level. AVX2 can process 8 floats or 8 32-bit integers per SIMD instruction; AVX-512 doubles that. GCC and Clang will auto-vectorize loops at -O2 and above, converting scalar loops into SIMD loops automatically, but only when they can see the entire loop body.

A function call in the loop body is a vectorization killer. Consider:

float external_scale(float x);  // defined in another translation unit

void process_slow(float* a, float* b, int n) {
    for (int i = 0; i < n; i++)
        b[i] = external_scale(a[i]);  // one scalar call per iteration
}

inline float scale(float x) { return x * 2.0f; }

void process_fast(float* __restrict__ a, float* __restrict__ b, int n) {
    for (int i = 0; i < n; i++)
        b[i] = scale(a[i]);  // inlined → vmulps ymm0 → 8 floats per instruction
}

With -O3 -march=native and AVX2, process_fast emits something like:

.loop:
    vmulps  ymm0, ymm1, YMMWORD PTR [rsi+rax]
    vmovups YMMWORD PTR [rdi+rax], ymm0
    add     rax, 32
    cmp     rax, rdx
    jne     .loop

The slow version emits one call external_scale per loop iteration. The throughput difference on AVX2 is roughly 8x; on AVX-512 it approaches 16x. None of that gap comes from saving the 3-5 cycle call overhead. It comes entirely from what the compiler can prove about the loop once the function body is visible.

This is the same mechanism behind std::sort’s well-known performance advantage over C’s qsort. The qsort comparator is a function pointer resolved at runtime; the compiler cannot inline it, so it emits a scalar indirect call per comparison. The std::sort comparator is a template parameter resolved at compile time; the compiler inlines it, sees the full comparison logic, and can apply branch elimination and register-level optimizations across comparisons. On large inputs, std::sort with a lambda consistently runs 2-3x faster than qsort on the same data, a difference that has persisted across decades of hardware improvements.

Register Spilling Compounds the Problem

There is a secondary cost to non-inlined calls that compounds in loops: caller-saved register spilling. On the x86-64 System V ABI, the registers rax, rcx, rdx, rsi, rdi, r8, r9, r10, and r11 are caller-saved. Before any call, the compiler must save any live values in those registers to the stack and reload them after the call returns. In a loop body with multiple live variables, this generates extra store and load pairs around every call, consuming memory bandwidth and polluting cache lines.

Inlining eliminates this cost entirely because the compiler performs register allocation across the combined function body and only spills what it genuinely cannot keep live across iterations.

Cross-Translation-Unit Inlining via LTO

Source-level inlining only works when the function definition is visible in the same translation unit. For large codebases organized across many TUs, that is a real constraint, and putting everything in headers is not always practical.

Link-Time Optimization addresses this. With -flto on GCC or Clang, the compiler emits LLVM IR or GIMPLE into object files instead of machine code, and the linker performs a full optimization pass over the merged IR before emitting the final binary. This enables inlining across TU boundaries at link time without requiring source changes.

ThinLTO (-flto=thin) is the scalable version developed at Google: rather than a single monolithic optimization pass over all merged IR, each module is optimized in parallel using function summaries exported from other modules. Chrome reports 3-9% speedup from ThinLTO; LLVM itself gains around 5%. For large codebases where annotating every hot function in headers is impractical, ThinLTO recovers most of the cross-TU inlining benefit automatically and in parallel.

The one limitation neither Full LTO nor ThinLTO can overcome is inlining across dynamic library boundaries. If external_scale lives in a .so or .dll, no amount of LTO lets the compiler see its body. This is one reason performance-critical C++ libraries, from Abseil to Eigen, ship their hot paths in headers rather than compiled translation units.

Measuring What Inlining Is Actually Buying You

The compiler’s inlining heuristics at -O2 are deliberately conservative to keep compile times reasonable and prevent code-size blowup. For functions you know sit in hot loops, __attribute__((always_inline)) provides the guarantee that inline only hints at.

For existing codebases where you cannot easily annotate every hot path, -O3 -march=native -flto=thin is often the most effective combination. It raises the compiler’s inlining threshold, enables target-specific SIMD, and recovers cross-TU inlining at link time.

When you want to measure the actual contribution of inlining to performance, -fno-inline is the correct flag to disable all inlining across a build. The difference between your benchmark with and without that flag tells you what inlining is contributing beyond raw call overhead, and that gap in numeric loops is almost always dominated by vectorization loss, not the cycles spent on call and return mechanics.

Lemire’s point cuts to the core of how to think about this: function call cost in modern systems is primarily a compiler visibility problem, not a CPU microarchitecture problem. The CPU handles a direct call in a handful of cycles. The compiler’s inability to see across the call boundary can cost an order of magnitude in throughput. The cycles from the call instruction are a rounding error; the SIMD instructions that never got emitted are where the time went.