Inlining Across Boundaries: Why Function Call Cost Is Really an Optimization Visibility Problem
Source: isocpp
Daniel Lemire has a recent post on isocpp.org showing that function calls in tight loops are not free, and that compilers eliminate the overhead through inlining. The post is a good primer. What it leaves open is the more interesting question: what exactly is the compiler losing when it cannot inline a call, and how does that loss compound across different languages and build configurations?
The short answer is that the 4–8 cycles of call/ret overhead on x86-64 are rarely the dominant cost. What hurts is the visibility boundary the call creates.
The Call as an Opacity Barrier
Consider the canonical example from Lemire’s article:
int add(int x, int y) {
return x + y;
}
int add3(int x, int y, int z) {
return add(add(x, y), z);
}
When add is inlined into add3, the compiler sees the whole computation and can fold it into x + y + z. When it is not inlined, each call crosses an ABI boundary: the compiler must assume the callee may clobber any caller-saved register, may alias any writable memory, and produces an opaque result. The optimizer has to work around those assumptions, not through them.
That constraint matters most in loops. A loop that calls a non-inlined function in its body is a loop the compiler cannot vectorize. Auto-vectorization requires the loop body to be fully visible at the IR level. The vectorizer needs to know: are the memory accesses contiguous, are there loop-carried dependencies, and can the scalar operations be expressed as SIMD instructions? An opaque function call answers none of those questions.
Here is what the difference looks like concretely. Take a loop that applies a small transformation to an array:
__attribute__((noinline))
float bump(float x) { return x + 1.0f; }
float sum_bumped(float* a, int n) {
float total = 0;
for (int i = 0; i < n; i++) total += bump(a[i]);
return total;
}
With -O3 and noinline, the inner loop emits a scalar call bump per iteration. Remove noinline and the compiler inlines bump, sees that it is a simple addss, and auto-vectorizes the loop to process eight floats at once on AVX2 using vaddps. The vectorized version runs roughly 8–16 times faster on large arrays, not 1.001 times faster. The call overhead itself was never the point.
Agner Fog’s optimization manuals document the x86 instruction latencies in detail. A direct, well-predicted CALL/RET pair costs around 3–5 cycles. A vaddps on a 256-bit register processes eight floats in 4 cycles with 0.5 cycle throughput. The ratio is not even close.
How the Compiler Decides to Inline
At -O2 and -O3, GCC and Clang inline eagerly within a translation unit. Clang uses an LLVM IR cost model with a default threshold around 225 instructions; GCC uses its own internal “insns” metric. Both expose diagnostic flags that let you see what they decided.
With Clang:
clang++ -O2 -Rpass=inline -Rpass-missed=inline foo.cpp
This emits remarks like:
foo.cpp:12:10: remark: add inlined into main [-Rpass=inline]
foo.cpp:18:10: remark: bar not inlined into main: cost=300 > threshold=225 [-Rpass-missed=inline]
With GCC, -fopt-info-inline-missed does the same. These flags are underused. Running them on a hot path before reaching for __attribute__((always_inline)) often reveals that the function was already being inlined and the performance problem lies elsewhere.
When a function genuinely needs to be forced inline, __attribute__((always_inline)) in GCC/Clang and __forceinline in MSVC will override the cost model. Use these sparingly: they can inflate code size enough to hurt instruction cache behavior, trading one performance problem for another.
The Cross-Translation-Unit Problem and LTO
Inlining within a single .cpp file is straightforward. The difficulty is that real programs span many files. Without link-time optimization, each translation unit compiles independently to machine code before the linker combines them. At link time, there is no IR left to inline across; the optimization window is closed.
LTO reopens that window. With -flto in GCC or -flto=thin in Clang, the compiler emits IR into the object files instead of (or alongside) machine code. At link time, the combined IR undergoes a full optimization pass, including cross-TU inlining, constant propagation, and dead code elimination. Google reports 10–20% performance improvements for production workloads using Clang ThinLTO combined with profile-guided optimization versus plain -O2.
ThinLTO, Clang’s scalable variant, is worth understanding separately. Full LTO merges all modules into one giant IR blob before optimizing, which is slow and memory-intensive. ThinLTO performs a fast summary scan of all modules to build a call graph, then optimizes each module in parallel, importing only the function definitions that analysis identified as worth inlining. The result is comparable code quality with compile-time that scales with available CPU cores. Chrome, Firefox, and the Linux kernel use ThinLTO in production.
Rust’s Specific Footgun
Rust uses LLVM as its backend, so the mechanics of inlining and vectorization are identical to Clang at the IR level. The difference is at the language boundary: crates.
In C++, LTO enables inlining across translation units without any source-level annotation. In Rust, a function without #[inline] will not be inlined across crate boundaries even in an LTO build. The reason is that Rust’s crate metadata includes MIR (mid-level IR) and LLVM bitcode only for functions explicitly marked for cross-crate use. Without #[inline], the function’s definition is not exported into the metadata that downstream crates see during LTO.
// In a library crate — this will NOT be inlined into callers
// even with LTO, unless marked:
#[inline]
pub fn clamp_positive(x: f32) -> f32 {
x.max(0.0)
}
This is a concrete library-authoring mistake. A hot utility function in a widely-used crate, missing #[inline], silently prevents vectorization of every tight loop that calls it in every downstream crate, regardless of the user’s LTO settings. The fix is one line. The cost of missing it is invisible unless someone runs the equivalent of -Rpass-missed=inline and notices.
#[inline(always)] and #[inline(never)] map directly to LLVM’s alwaysinline and noinline attributes, with the same semantics and the same trade-offs as in C++.
Java’s Runtime Answer
Java takes an entirely different approach to this problem. The HotSpot JVM profiles running code and inlines at runtime based on observed call frequencies and receiver types.
The C2 server JIT compiler inlines functions under 35 bytecodes unconditionally (when called from a hot method) and functions under 325 bytecodes if they are frequently called, controlled by -XX:MaxInlineSize and -XX:FreqInlineSize respectively. For virtual method calls, HotSpot tracks receiver type profiles. A call site that has only ever dispatched to one concrete type gets inlined speculatively with a guard:
if (receiver.getClass() == KnownType.class) {
// inlined body
} else {
// deoptimize
}
This converts a costly indirect dispatch into an inlined sequence with a single comparison, enabling escape analysis, null check elimination, and scalar replacement of objects. The -XX:+PrintInlining JVM flag dumps inlining decisions for every JIT-compiled method.
The runtime approach has a real advantage: it uses actual execution data rather than static heuristics. A polymorphic call site that is monomorphic in practice gets the same treatment as a statically-dispatched call. The cost is compilation latency and the complexity of managing deoptimization when assumptions are violated.
Go’s Different Stakes
Go inlines within its cost model (functions under roughly 80 AST-node cost units in recent versions), and -gcflags="-m" reports inlining decisions. The stakes are lower than in C++ or Rust for one specific reason: the standard Go compiler does not auto-vectorize. There is no SIMD throughput to lose when a function fails to inline, because the compiler was never going to generate vaddps in the first place.
This does not mean inlining is irrelevant in Go. Constant propagation, escape analysis (which determines whether a value lives on the stack or heap), and call overhead elimination all depend on inlining. But the 8–16x SIMD multiplier that makes the difference between a slow and a fast numerical loop in C++ or Rust simply does not exist in Go’s compiler output. Libraries like gonum work around this with hand-written assembly routines.
A Practical Summary
The framing of “function call overhead” suggests a minor tax on cycles. The more useful framing is: every non-inlined call boundary is a point where the compiler’s optimization visibility ends. Within a translation unit, modern compilers handle this automatically. Across translation units, LTO is the right tool in C++ and Rust. In Rust specifically, library authors need #[inline] on public hot-path functions for LTO to work across crate boundaries at all.
The diagnostic tools exist and are not hard to use. -Rpass-missed=inline in Clang, -fopt-info-inline-missed in GCC, and -gcflags="-m" in Go all tell you what the compiler decided and, in some cases, why. Running them on a bottleneck before adding manual annotations is the right order of operations.
Lemire’s original article puts the basic mechanism cleanly. The larger point is that the mechanism is not C++-specific, it is not even about the call instruction itself, and the tools to diagnose and fix it exist across the major compiled languages.