What the Compiler Cannot See: Function Calls as Optimization Boundaries

The benchmark in Daniel Lemire’s article on isocpp.org uses the simplest possible case: a function that adds two integers, forced to never be inlined with __attribute__((noinline)). In a tight loop, that function runs at about 1.5 to 2.5 nanoseconds per iteration. The inlined version runs at about 0.3 ns. The arithmetic is identical and the CPU is the same; the five- to eight-times slowdown is not the cost of a function call but the cost of what the compiler stops doing once it cannot see through the call.

What the CPU actually does

A function call on x86-64 involves the CALL instruction pushing an 8-byte return address, a jump to the callee, some stack frame setup in the prologue, and the reverse on return. Modern Intel microarchitecture predicts return addresses via the Return Stack Buffer, which on Golden Cove holds 24 entries. When call depth stays within RSB capacity, return prediction is nearly perfect and the misprediction penalty is effectively zero.

The raw cost is 6 to 12 cycles under ideal conditions: 4 to 8 for the call/return pair, more if the callee saves and restores callee-saved registers. At 3 GHz, 12 cycles is about 4 nanoseconds. That is a real cost in a loop running millions of iterations. It is also not the dominant one.

The optimizer stops at the call boundary

Modern compilers work on entire function bodies at once. Loop vectorization, in particular, requires the optimizer to confirm that iterations are independent, that memory reads and writes do not alias problematically, and that the operations in the body map to SIMD instruction sequences. A non-inlined call breaks every one of those assumptions.

When the compiler encounters a call to an external function, it must assume the worst: the callee may read or write any memory reachable through any pointer, it may have side effects on global state, and it creates a potential dependency between iterations. The vectorizer sees the loop, cannot model the call, and abandons the attempt.

The performance gap in Lemire’s benchmark is not primarily those 6 to 12 cycles of call overhead. It is 8-wide AVX2 vectorization that never happens, turning 1 element per cycle into 8 for a scalar loop that could have been a single VADDPS. That is the difference between 0.3 ns and 2.0 ns per iteration.

The ABI enforces the wall

The reason the compiler cannot vectorize through a call is not purely a conservative analysis decision. The System V AMD64 ABI, which governs function signatures on Linux and macOS, defines that a single float argument is passed in XMM0. A vectorized call would need to pass 8 floats simultaneously, requiring a completely different function signature. The ABI has no mechanism for that without changing what the function looks like from the outside.

Inlining eliminates the ABI boundary entirely. There is no longer a function being called; there is just code emitted at the call site, visible to every optimization pass that runs afterward. The loop vectorizer sees a body with no external calls, independent iterations, no aliasing, and produces SIMD code.

On Windows x64, there is an additional cost. The calling convention mandates a 32-byte shadow space on the stack before every call, even zero-argument ones, allocated and cleaned up by the caller on every iteration. That is four extra 8-byte stores per call that do not exist on Linux. For hot loops calling functions with five or more arguments, the Windows ABI spills a fifth argument to the stack that System V would keep in a register. These are per-iteration costs that function-level profilers do not surface.

Abstraction overhead: std::function and virtual dispatch

The same analysis applies to type-erased callable wrappers. A std::function<float(float)> stores its target behind an opaque interface and invokes it through a function pointer. The compiler cannot inline through that indirection; the type of the actual callable is invisible at the call site, and vectorization is impossible for exactly the same reason it fails with a non-inlined external call.

// This can be vectorized:
auto transform = [](float x) { return x * 2.0f; };
for (int i = 0; i < N; ++i) data[i] = transform(data[i]);

// This cannot:
std::function<float(float)> transform = [](float x) { return x * 2.0f; };
for (int i = 0; i < N; ++i) data[i] = transform(data[i]);

The lambda body is identical. Wrapping it in std::function erases the concrete type, the optimizer loses visibility, and throughput drops by the same factor as a forced non-inline call. std::function_ref, coming in C++26, removes the ownership semantics and heap allocation while keeping the indirection. It is lighter, but for hot inner loops the vectorization barrier remains.

Virtual dispatch carries an additional post-Spectre cost on older hardware. Without eIBRS support, indirect branches on patched kernels go through retpoline, a software trampoline that serializes the branch. The cost rises to 30 to 80 cycles per indirect call regardless of prediction accuracy. Intel Cascade Lake and later, with Enhanced IBRS, bring this back to 4 to 6 cycles. Virtual dispatch benchmarks from before 2020 on hardened systems are not reliable numbers to cite.

Cross-language: Rust, Go, and visibility

Rust uses the LLVM backend, so the inlining machinery is identical to Clang at the IR level. The distinction is crate boundaries: without #[inline], a function’s IR is not embedded in crate metadata, and another crate calling it cannot inline it even if the optimizer would otherwise choose to. A utility function in a library crate, exported as part of a hot API, silently blocks vectorization in all downstream users unless it carries #[inline]. The Rust standard library marks hot paths with #[inline] aggressively for this reason.

Rust also benefits from something C++ lacks by default: noalias semantics on mutable references, derived automatically from the borrow checker. When the compiler can confirm that two pointers in a loop do not alias, vectorization succeeds in cases where C++ code must be pessimistic without explicit __restrict__. Inlining and aliasing analysis reinforce each other.

Go takes a structurally different approach. The gc compiler uses a budget of roughly 80 weighted AST nodes to decide whether to inline a function. There is no always_inline equivalent. If a function exceeds the budget, the only option is to simplify it. Go 1.17 switched from a stack-based to a register-based calling convention, yielding 5 to 15 percent improvement on integer-heavy benchmarks without changing the optimizer at all. Go 1.21 added PGO support, letting the inliner exceed its normal budget for call sites identified as hot in profile data. Crucially, Go has no auto-vectorizer in the standard gc compiler. There is no SIMD throughput to lose from missed inlining; libraries like gonum use hand-written assembly for SIMD numeric kernels.

Cross-translation-unit inlining and LTO

Within a single translation unit, the compiler sees everything and makes good decisions. The problem is that real programs span multiple files, and a function defined in one .cpp file is opaque to the optimizer when compiled separately from the file that calls it. Link-time optimization solves this by deferring optimization to link time, when the entire program’s IR is visible.

ThinLTO, available in Clang with -flto=thin, performs cross-TU inlining at link time with roughly full-LTO quality but much better incremental build performance through a cache. Google, Mozilla, Apple, and Meta all run ThinLTO in production. Chromium reports 10 to 15 percent improvement over plain -O2 from the combination of cross-TU inlining and downstream dead code elimination.

For the hottest paths, __attribute__((always_inline)) (GCC/Clang) and __forceinline (MSVC) force inlining unconditionally, bypassing the cost model. GCC also has __attribute__((flatten)), which recursively inlines all callees of the marked function. The simdjson project uses a really_inline macro mapping to always_inline across its entire hot parse path, and achieves 2.5 to 3.5 GB/s parsing throughput where a conventionally structured JSON parser reaches around 0.5 GB/s. That is not a measurement of how cheap function calls are; it is a measurement of what vectorization is worth once inlining makes it possible.

Where this matters in practice

The gap between inlined and non-inlined code in Lemire’s benchmark only shows up in tight inner loops running hundreds of millions of iterations. For most application code, function call overhead is immaterial. The places where it matters are numeric kernels, parsers, serializers, compression, rendering, and anything running a short loop body across a large dataset.

The practical guidance is narrow: keep hot inner loop bodies visible to the optimizer, prefer templates and concrete lambdas over std::function in hot paths, annotate public library functions with #[inline] in Rust, and run ThinLTO on release builds of large C++ projects. Compiler diagnostics help identify missed opportunities. GCC’s -fopt-info-vec-missed and Clang’s -Rpass-missed=loop-vectorize report every loop the vectorizer considered and rejected, including the reason.

Lemire’s benchmark confirms something that CPU performance counters alone will not show: compilers optimize by seeing the full loop body, and call boundaries limit what they can see. The raw call cycles are a minor tax. The vectorization and transformation opportunities that evaporate at every opaque boundary are the actual cost.