· 8 min read ·

Inlining Is Not About Removing the Call

Source: isocpp

The surface metric is seductive. A direct function call on modern x86-64 hardware costs roughly 3 to 5 cycles: the call instruction pushes a return address onto the stack and jumps to the target, ret pops it back. At 3 GHz that is maybe 1 to 2 nanoseconds. For most code, this is irrelevant noise.

But in tight loops, the call itself is the least of your problems. The real overhead is what the call prevents the compiler from doing. Daniel Lemire’s recent post on isocpp.org illustrates this with a simple add/add3 pair, and the conclusion generalizes much further than the example suggests.

What the call instruction actually costs

On x86-64, the System V ABI designates rax, rcx, rdx, rsi, rdi, r8, r9, r10, and r11 as caller-saved registers. If you have live values in any of those when you call a function, the caller must spill them to the stack before the call and reload them after. Each spill-reload pair involves a store and a load, and if the reloaded value misses the L1 cache, you add 4 to 40 extra cycles depending on which cache level you hit. A loop with four live variables in caller-saved registers can accumulate 15 to 25 cycles of spilling overhead per iteration before the call body even executes.

Agner Fog’s optimization manuals put the raw call/ret throughput at 3 to 5 cycles on Skylake and Zen 3 when the CPU’s return stack buffer is warm and the instruction cache holds the target. Cold instruction cache misses add 50 to 200 cycles on top of that. For deeply recursive code or code that calls into many different functions, icache pressure alone can dominate runtime.

None of this is the main story.

Vectorization: where the real cost lives

Modern compilers auto-vectorize loops using SIMD instructions. An AVX2 loop over a float array processes eight elements per iteration. An SSE2 loop processes four. A well-vectorized loop over a million floats completes in roughly 0.3 to 0.5 milliseconds. The same loop, scalar, takes 2 to 4 milliseconds. That is a factor of six to ten from a single optimization decision, and inlining is often the gate that decision is locked behind.

The auto-vectorizer needs to see the entire loop body as a single unit. When a loop iteration calls a function that is not inlined, the compiler treats that call as an opaque memory barrier. It cannot assume the function has no side effects. It cannot analyze the callee’s memory accesses for aliasing. It cannot fuse the callee’s arithmetic with surrounding loop operations into a SIMD kernel. The loop stays scalar.

// scale() is in a separate translation unit — compiler cannot see its body
float scale(float x) { return x * 2.5f; }

void process(float* arr, int n) {
    for (int i = 0; i < n; i++)
        arr[i] = scale(arr[i]);  // scalar loop: call blocks vectorization
}

With scale visible at the call site, either by placing it in the same translation unit or via LTO, the vectorizer sees:

void process(float* arr, int n) {
    for (int i = 0; i < n; i++)
        arr[i] = arr[i] * 2.5f;  // vmulps ymm0, ymm0, ymm1 — 8 floats per iteration
}

GCC’s -fopt-info-vec flag reports “loop not vectorized: function call in loop body” when this situation occurs. Clang produces equivalent diagnostics via -Rpass-missed=loop-vectorize. These are underused flags. Running them on a hot path before worrying about anything else is a good habit.

What inline actually does in C++

Most developers assume inline is a hint that tells the compiler to inline the function. The C++ standard says otherwise.

inline relaxes the One Definition Rule. It permits a function to be defined in multiple translation units without causing a linker error, as long as all definitions are identical. This is why small utility functions in header files are marked inline: without it, including the header in multiple .cpp files produces duplicate symbol errors at link time.

Whether the compiler actually inlines a specific call is determined entirely by its cost model. GCC’s inlining documentation describes the relevant parameters: max-inline-insns-single (default approximately 400 estimated instructions at -O2) and inline-min-speedup (default 10 percent). Clang uses an inline threshold of 225 at -O2, raised to 275 at -O3. Both compilers ignore the inline keyword when the cost model decides a function is too large. Conversely, both freely inline functions that are not marked inline, provided the body is visible and fits within threshold.

The actual forced-inline mechanisms are __attribute__((always_inline)) on GCC and Clang, and __forceinline on MSVC. These override the cost model entirely. Use them sparingly: forcing a large function inline everywhere increases code size, pollutes the instruction cache, and can hurt performance for reasons that are harder to diagnose than the original call overhead.

The C++17 extension of inline to variables, enabling inline static data members, further illustrates that the keyword is fundamentally about linkage semantics rather than call-site expansion.

LTO: inlining across translation unit boundaries

The classic limitation of inlining is that it only works within a single translation unit. A function defined in math_utils.cpp is opaque to renderer.cpp at compile time without additional intervention.

Link-Time Optimization solves this. With -flto on GCC or Clang, the compiler emits GIMPLE IR (GCC) or LLVM bitcode (Clang) into object files instead of native code. At link time, a plugin performs whole-program optimization across all translation units, enabling inlining, constant propagation, and dead-code elimination across the entire binary.

Full LTO requires loading all IR into memory simultaneously, which is expensive for large codebases. LLVM’s ThinLTO, available via -flto=thin, addresses this by generating lightweight per-module summaries and compiling modules in parallel. ThinLTO captures roughly 80 to 90 percent of full LTO’s performance benefit at three to five times faster link speeds. Firefox, Chrome, and LLVM itself ship with ThinLTO in their release builds. Typical improvements on real-world C++ programs fall in the 5 to 15 percent range, with programs rich in small cross-TU utility functions seeing larger gains.

GCC’s equivalent, WHOPR, is enabled via -flto=N where N is the thread count.

Virtual calls and what Spectre changed

Virtual function calls add a layer of indirection that compounds all of the above. The dispatch sequence loads the vtable pointer from the object, loads the function pointer from that vtable, then performs an indirect call through a register. Two pointer dereferences precede the call, and if either is cold in cache, you add 4 to 40 cycles per miss before any useful work begins.

The larger problem is indirect branch misprediction. At a monomorphic call site where one concrete type dominates, the CPU’s branch target buffer stays warm and the effective overhead is close to a direct call. At a genuinely polymorphic site with multiple concrete types rotating through, the predictor thrashes. A mispredicted indirect branch on Skylake costs 15 to 30 cycles.

Spectre mitigations worsened this substantially. Retpoline, the dominant mitigation strategy deployed after 2018, replaces indirect jumps with a return-based trampoline to prevent speculative execution of the indirect target. Even with a warm branch target buffer, retpoline-mitigated indirect calls cost 15 to 30 cycles, compared to 3 to 5 cycles for a direct call on unmitigated hardware. Virtual call-heavy code on mitigated systems can show 2x to 4x regressions relative to pre-Spectre baselines, purely from the mitigation overhead on call dispatch.

Compilers can devirtualize virtual calls when the concrete type is provably known at compile time: a stack-allocated object of a final class, or with LTO and -fwhole-program-vtables enabled. With PGO data, compilers can emit speculative devirtualization: an inline guard that tests the most common concrete type and dispatches directly for that type, falling back to virtual dispatch only for the rest.

Constant folding: the least appreciated dimension

This is where Lemire’s analysis is most useful, and where developers most often underestimate the value of function visibility.

When the compiler can see a function body at the call site, it can propagate constants from the call arguments into the function body. A function computing a value from a known constant may reduce entirely to a single immediate. A function branching on an argument may have entire branches eliminated because the compiler can prove which path is taken.

inline bool fits_in_byte(int x) {
    return x >= 0 && x < 256;
}

// Caller passes a value the compiler can prove is always in [0, 255]
// With visibility: the entire call folds to 'true', the branch disappears
// Without visibility: a real comparison and branch remain

Without that visibility, the compiler must treat all arguments as runtime unknowns. The function’s internal branches remain. The arithmetic remains. The result must be computed and returned through a register. Every secondary optimization that would have cascaded from seeing the full body simply does not happen.

This is the framing that matters. Inlining is not primarily about eliminating a call instruction. It is about providing the compiler enough context to do the optimizations it could not otherwise prove were safe. The 3-cycle call is a rounding error. What sits behind that call, invisible to the optimizer, is where the real cost lives.

Practical guidance

For performance-critical code paths, the priority ordering is roughly:

  1. Keep hot functions in the same translation unit as their callers, or enable LTO. This is the highest-leverage change and requires no source modification.
  2. Use -fopt-info-vec (GCC) or -Rpass-missed=loop-vectorize (Clang) on hot loops before doing anything else. If a loop fails to vectorize because of an opaque call, you now have a specific target.
  3. Prefer concrete types and final classes at high-frequency virtual call sites to enable devirtualization. final is a performance annotation as much as a semantic one.
  4. Reserve __attribute__((always_inline)) for cases where profiling confirms the compiler is refusing to inline something it should, and where the function is genuinely small.
  5. Measure with LTO off and on as separate configurations. The improvement is often larger than expected and comes at essentially zero source-code cost.

The cost of a function call has two components: the cycles the call instruction consumes, and the optimizations the call boundary prevents. The first is measured in single-digit nanoseconds. The second is measured in multipliers on loop throughput. Most performance tuning attention goes to the first. Most of the performance lives in the second.

Was this interesting?