The Real Cost of a Function Call Is What the Compiler Can No Longer See
Source: isocpp
The number people quote when discussing function call overhead is typically 3 to 10 cycles on modern x86. That figure is roughly correct for a direct call with a warm instruction cache: CALL pushes the return address and jumps, the callee runs its prologue, does some work, runs an epilogue, and RET pops back. On a 3 GHz processor you are looking at about a nanosecond, maybe three at the outer limit. In isolation, that is fast enough to seem negligible.
Daniel Lemire’s recent post on isocpp.org shows something more interesting: a trivial increment function forced non-inline with __attribute__((noinline)) runs four to eight times slower than its inlined equivalent inside a tight loop. The function body is identical in both cases, and the raw call overhead is the same. What changed is that the compiler lost visibility into the operation, and that lost visibility blocked several optimizations that have nothing to do with the CALL instruction itself.
What Breaks at a Call Boundary
Modern compilers perform several analyses that require full visibility into the code being executed. An opaque function call breaks all of them simultaneously.
Auto-vectorization is the most consequential. SIMD instructions on x86 operate on 256-bit or 512-bit registers, processing 8 or 16 floats in a single instruction. The vectorizer generates this code only when it can inspect the entire loop body and determine that operations on adjacent elements are independent and of compatible types. An opaque function call inside a loop makes each iteration a black box. The calling convention passes a single float in XMM0; a vectorized version would need YMM0 carrying eight values at once. Those are fundamentally different interfaces, and the compiler cannot bridge them automatically.
Alias analysis degrades to worst-case assumptions. After an opaque call, the compiler must assume the callee may have modified any memory it could plausibly reach. Values held in registers must be flushed to memory; values in memory must be reloaded on the next access. Every pointer in scope becomes a potential hazard.
Constant propagation stops at the boundary. If you call a function with a compile-time-known value, the compiler could, with visibility, evaluate the result statically. Without it, the return value is unknown and downstream computations cannot be simplified.
Loop invariant code motion becomes conservative. A computation inside a loop that does not depend on the loop variable can normally be hoisted above the loop. If it involves an opaque call, the compiler cannot prove the call has no side effects, so it stays inside and executes on every iteration.
None of these are exotic optimizations. They are standard passes that run at -O2 on every loop in your codebase. A single opaque call disables all of them for the affected loop.
The Assembly Makes It Concrete
Consider a simple array squaring operation with a non-inlined helper:
__attribute__((noinline))
float square(float x) { return x * x; }
void square_array(float* dst, const float* src, int n) {
for (int i = 0; i < n; i++)
dst[i] = square(src[i]);
}
With -O3 -mavx2, GCC generates a scalar loop:
.loop:
vmovss xmm0, [rsi + rax*4]
call square
vmovss [rdi + rax*4], xmm0
inc rax
jl .loop
One element per iteration. Remove the noinline attribute, or write the multiplication directly in the loop body, and the same compiler at the same flags emits:
.loop:
vmovups ymm0, [rsi + rax]
vmulps ymm0, ymm0, ymm0
vmovups [rdi + rax], ymm0
add rax, 32
jl .loop
Eight floats per iteration. With AVX-512 the factor is sixteen. You can verify this interactively on Compiler Explorer by toggling the noinline attribute; the vectorized inner loop appears and disappears as you flip it. Measured throughput differences in practice range from 4x to 20x depending on arithmetic intensity and memory access patterns. That is not call overhead. That is the vectorizer running when it has what it needs.
Lemire’s earlier work on inlining and vectorization established this relationship clearly. The SIMD code path does not exist at all when the loop body is opaque. His follow-up on raw function call cost benchmarks the scalar penalty directly, putting numbers to what the assembly already suggests.
Controlling Inlining Explicitly
The inline keyword in C++ does not control whether a function is inlined. Its actual role is an ODR exemption, allowing a function to appear in multiple translation units without a linker error. Compilers have documented this for decades. GCC’s threshold at -O2 is around 600 pseudoinstructions; Clang uses a cost unit model with a default threshold around 225, with substantial bonuses when inlining would enable constant folding.
For actual control, the platform-specific attributes are the only reliable mechanism:
#if defined(__GNUC__) || defined(__clang__)
# define FORCE_INLINE __attribute__((always_inline)) inline
# define NO_INLINE __attribute__((noinline))
#elif defined(_MSC_VER)
# define FORCE_INLINE __forceinline
# define NO_INLINE __declspec(noinline)
#endif
GCC also provides __attribute__((flatten)), which forces the compiler to inline every call within the annotated function recursively. It is useful when a hot kernel composes several small operations that individually fall below the inlining threshold but collectively prevent vectorization. The C++ attribute syntax [[gnu::always_inline]] is portable across GCC and Clang without a macro:
[[gnu::always_inline]] inline float square(float x) { return x * x; }
simdjson, the SIMD JSON parser that Lemire co-authored, defines really_inline as __attribute__((always_inline)) inline and applies it throughout the hot parsing path. The result is 2.5 to 3.5 GB/s of parsing throughput; conventionally structured parsers with opaque call boundaries typically achieve around 0.5 GB/s on the same inputs. The difference is optimizer visibility, not algorithmic cleverness.
Cross-Module Inlining with LTO
Within a single translation unit, the compiler has full visibility. Across .cpp file boundaries it does not, and function calls to code defined in another file are opaque by default. This means a small helper in a separate source file can silently prevent vectorization of every loop that calls it, without any warning.
Link-time optimization addresses this. Full LTO (-flto) stores intermediate representation in object files and runs a whole-program optimization pass at link time, enabling inlining and alias analysis across the entire codebase. ThinLTO (-flto=thin) uses per-function summaries to enable cross-module inlining in parallel per module, achieving 80 to 90% of full LTO’s benefit with substantially lower link overhead. The LLVM ThinLTO documentation covers the configuration details.
clang++ -O2 -flto=thin foo.cpp bar.cpp -o app
Measured speedups on call-heavy code range from 5% to 20%. Combined with profile-guided optimization, browsers like Chrome and Firefox see 10 to 15% improvements over plain -O2 builds, and both use ThinLTO in production.
One cross-language note worth flagging: Rust functions without #[inline] are not inlined across crate boundaries even with LTO enabled, because the IR is not exported into crate metadata without the attribute. A frequently called utility function in a library crate silently prevents vectorization in all downstream callers unless marked #[inline] or #[inline(always)]. The semantics differ enough from C++ that it surprises developers who cross between the two.
The Tradeoff: Instruction Cache Pressure
Inlining every hot function unconditionally is not the right strategy. Each inlined call copies the function body to the call site, growing the binary’s code size. The L1 instruction cache on modern hardware is 32 to 64 KB per core. Once the hot path no longer fits, instruction fetch misses add 10 to 40 cycles each, which can exceed the overhead of the calls you were trying to eliminate.
Both Firefox and Chrome have documented cases where reducing inlining aggressiveness improved throughput due to instruction cache pressure. Agner Fog’s CPU optimization guides discuss this tradeoff in depth with specific measurement methodology. The workable heuristic is to favor force-inlining functions with fewer than roughly 10 instructions when they appear inside confirmed hot loops. Profile first with something like perf stat -e instructions,L1-icache-load-misses before reaching for always_inline broadly.
Diagnostic Output
The compilers will report inlining decisions on request:
# Clang: every inline decision and every missed opportunity
clang++ -O2 -Rpass=inline -Rpass-missed=inline file.cpp
# GCC equivalent
g++ -O2 -fopt-info-inline -fopt-info-inline-missed file.cpp
# Missed vectorizations (frequently shows "call instruction cannot be vectorized")
clang++ -O3 -Rpass-missed=loop-vectorize file.cpp
g++ -O3 -fopt-info-vec-missed file.cpp
The missed vectorization output is particularly actionable. It identifies the specific call site preventing the vectorizer from running. At that point the options are: force-inline the function, restructure the code to move the call outside the loop, or accept scalar performance as intentional.
The Right Mental Model
A function call in a tight loop is not a 3-nanosecond tax. It is a boundary at which the optimizer resets its assumptions and starts fresh with no knowledge of what just happened. Auto-vectorization requires loop body visibility, so it stops. Alias analysis requires call-site information, so it pessimizes. Constant propagation requires operand values, so it stalls. The compiled code on the far side of that boundary is conservative by necessity.
Inlining removes the boundary. The optimizer sees one continuous body, applies all its analyses jointly, and frequently produces code that is qualitatively different from the scalar equivalent. The overhead number that matters is not the cycle count of the CALL instruction. It is the factor-of-ten throughput difference between a scalar loop and a vectorized one, which only exists when the compiler can see your code.