The Real Cost of a Function Call Is What the Compiler Can No Longer See
Source: isocpp
Daniel Lemire’s recent post on isocpp.org opens with a clean example: a function add3 that calls add twice, versus an equivalent that just does x + y + z inline. The mechanical point is correct. The deeper story is more interesting.
The CALL and RET instruction pair costs somewhere between 6 and 10 cycles on a modern Skylake core, per Agner Fog’s instruction tables. Add in register spills for caller-saved values (RAX, RCX, RDX, RSI, RDI, R8-R11 in the System V AMD64 ABI), and a non-trivial call can run you 15-25 cycles per iteration before the callee executes a single instruction. That is the number people usually think about.
It is not the number that matters most.
What the Compiler Cannot See Across a Call Boundary
A function defined in a separate translation unit is opaque to the compiler processing the caller. The compiler does not know whether the callee reads global state, writes to pointers visible to the caller, or has any side effects at all. So it must assume the worst on all counts.
This has several consequences. The compiler cannot reorder a computation across the call. It cannot hoist a loop-invariant expression out of the loop if the call might observe or modify the result. It cannot eliminate a duplicate call even if you call the same function with identical arguments, because the second call might have different side effects than the first. And it cannot vectorize the loop.
That last point is where the real performance gap lives.
Consider a loop that adds two arrays element-by-element:
int add(int x, int y) { return x + y; }
void sum_arrays(const int* a, const int* b, int* out, int n) {
for (int i = 0; i < n; i++) {
out[i] = add(a[i], b[i]);
}
}
If add is defined in a separate .cpp file with no LTO, the compiler has no choice but to generate a scalar loop with a function call on every iteration:
.loop:
mov edi, DWORD PTR [rbx+rax*4]
mov esi, DWORD PTR [r12+rax*4]
call add
mov DWORD PTR [r13+rax*4], eax
add rax, 1
cmp rax, rcx
jne .loop
Make add visible at the call site — put it in a header, or use LTO — and the compiler sees a body it can reason about. With -O3 -mavx2, the vectorizer takes over:
.loop:
vmovdqu ymm0, YMMWORD PTR [rbx+rax]
vpaddd ymm0, ymm0, YMMWORD PTR [r12+rax]
vmovdqu YMMWORD PTR [r13+rax], ymm0
add rax, 32
cmp rax, rdx
jne .loop
Eight 32-bit integers per iteration, no function call anywhere. Throughput goes from roughly 1.0 ns per element to 0.08-0.12 ns per element. The call overhead — those 15-25 cycles — is not the gap. The vectorization difference is. For a million-element array, that is the difference between 2-4 milliseconds and under 400 microseconds.
AVX-512, where available, doubles the width again.
The Vectorizer’s Actual Constraint
GCC and Clang’s loop vectorizers require that every operation in the loop body be something the vectorizer can reason about. An opaque function call is both an alias analysis barrier and a side-effect barrier. The vectorizer does not know whether the function modifies any pointer in scope, so it cannot legally reorder or batch the operations. You can ask GCC to tell you about this directly:
g++ -O3 -fopt-info-vec-missed hot_loop.cpp
The output will include messages like "Function call may clobber memory" for every loop the vectorizer attempted and gave up on. Clang has an equivalent:
clang++ -O3 -Rpass-missed=loop-vectorize hot_loop.cpp
These diagnostic modes are underused. They make the optimization barrier concrete rather than theoretical.
Vectorization is also not the only thing a call blocks. Inlining enables common subexpression elimination across what was formerly a call boundary, loop invariant code motion for hoisting computations out of loops, dead store elimination, constant propagation when call arguments are compile-time constants, and improved register allocation. The function call is an optimization wall in every direction, not just for SIMD.
What inline Actually Does
The C++ inline keyword does not tell the compiler to inline a function. This is one of the more persistent misconceptions in the language. What it actually does is grant an ODR (One Definition Rule) exemption, allowing a function to be defined in multiple translation units with identical definitions without a linker error. That is why functions defined in headers are marked inline — to prevent the linker from complaining when the header appears in a dozen .cpp files.
The compiler is under no obligation to honor the inline keyword as an inlining hint, and by the late 1990s, compilers developed cost models sophisticated enough that they stopped paying much attention to it anyway. GCC’s inliner applies a budget of roughly 400 weighted pseudo-instructions at -O2 per its parameter documentation. Clang uses a cost model around 225 units at -O2. Both adjust these numbers based on context — a function that would unlock vectorization gets a bonus that increases the threshold.
If you want to force the issue, GCC and Clang offer __attribute__((always_inline)), which bypasses the cost model entirely:
__attribute__((always_inline)) inline int add(int x, int y) { return x + y; }
This fails loudly with a compile error if the function genuinely cannot be inlined (recursive functions, for instance). The simdjson library defines its own really_inline alias for this attribute and applies it throughout its hot path, achieving 2.5-3.5 GB/s JSON parsing throughput where a version with opaque call boundaries achieves around 0.5 GB/s.
MSVC has __forceinline. The C++11 standard attributes [[gnu::always_inline]] and [[clang::always_inline]] work on their respective compilers.
The Cross-Translation-Unit Problem: LTO
The fundamental issue is visibility. A function’s body must be visible at the call site for inlining to happen at compile time. That is the actual gate, not the inline keyword.
For large codebases where you cannot or do not want to put everything in headers, Link-Time Optimization solves this. With -flto, GCC emits GIMPLE IR into object files instead of machine code, and Clang emits LLVM bitcode. At link time, the full program IR is available for a whole-program optimization pass, enabling inlining, constant propagation, and devirtualization across translation unit boundaries.
Full LTO has substantial link-time cost. Clang’s ThinLTO (-flto=thin) builds lightweight per-module summaries at compile time, then performs cross-module inlining using those summaries in parallel at link time. It captures roughly 80-90% of full LTO’s runtime benefit at 3-5x faster link speeds. Chrome, Firefox, and the Linux kernel ship release builds with ThinLTO enabled, reporting 10-15% runtime improvements over plain -O2 on hot paths.
# ThinLTO with Clang
clang++ -O2 -flto=thin foo.cpp bar.cpp -o app
# CMake
set_property(TARGET my_target PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)
Rust handles this differently. The #[inline] attribute serializes a function’s MIR into crate metadata, making it available for inlining across crate boundaries without requiring whole-program LTO. Without it, even Vec::push cannot be inlined from application code into another crate by default. The standard library applies #[inline] pervasively on small functions for this reason.
When You Do Not Want Inlining
Aggressive inlining has a real cost: binary size and instruction cache pressure. The L1 instruction cache on Skylake is 32 KB. Inlining a function at N call sites creates N copies of its body. Agner Fog’s optimization manuals document cases where inlining across cache line boundaries caused measurable regressions compared to a non-inlined version with better hot code density. Firefox and Chrome have both documented cases where reducing inlining aggressiveness improved throughput on L1i-constrained workloads.
There is also __attribute__((noinline)), which is useful for: keeping cold-path code (error handlers, rarely-taken branches) out of the instruction cache; preserving readable stack traces and clean per-function profiling boundaries; and creating deliberate optimization barriers in microbenchmarks, where without it the benchmark may be measuring an inlined version with zero call overhead rather than the overhead you intended to measure.
__attribute__((noinline)) void report_error(const char* msg);
GCC’s __attribute__((cold)) combined with noinline places the function in a .cold section of the binary, completely outside the hot code cache.
The middle path, when you cannot inline but want to give the optimizer more room, is the const and pure attributes:
[[gnu::const]] double fast_rsqrt(double x); // pure math, no memory reads
[[gnu::pure]] int count_valid(const Record* r, int n); // reads memory, no writes
These let the compiler hoist calls out of loops, eliminate duplicate calls with identical arguments, and reorder freely — without requiring the function body to be visible. The SLEEF library uses these attributes so the vectorizer can replace scalar math calls with SIMD equivalents without needing to see the function internals.
The Practical Summary
The mechanical cost of CALL/RET is real but secondary. What matters is that a function call is an optimization boundary. The compiler cannot vectorize across it, hoist across it, or eliminate redundancy across it. The difference between a scalar loop with call overhead and a vectorized loop is not 15-25 cycles per iteration — it is an 8-12x throughput multiplier on modern AVX2 hardware.
For hot loops, make the function body visible at the call site. Put it in a header, use LTO, or apply always_inline. Use vectorization diagnostic flags to confirm what the compiler actually did. For cold paths and profiling boundaries, noinline is your tool. And if you are building with a separate compilation model you cannot change, ThinLTO costs almost nothing in build time and recovers most of what the translation unit boundary takes away.