Crossing the Function Boundary: Calling Conventions, Inlining, and the SIMD You Never Got
Source: isocpp
The cost of a function call is usually framed in terms of cycles: the setup of a stack frame, the saving of caller-preserved registers, the branch predictor’s attempt to track the return address. On a modern Skylake-class processor, a minimal call-and-return on a hot, predicted path costs around three to five cycles. With register saves, that climbs to ten to fifteen cycles. Daniel Lemire’s article on the subject walks through the basics cleanly, but the cycle count is only part of the story.
The more significant cost is what the compiler cannot optimize once a function boundary stays opaque. A function call is an optimization barrier: the compiler must assume the callee might read or write any accessible memory, clobber any caller-saved register, or produce side effects that cannot be predicted from the call site. That conservative assumption is correct, but it forecloses a class of transformations that modern CPUs depend on for peak throughput. The most significant victim is auto-vectorization.
The Calling Convention Primer
Before getting to vectorization, it helps to understand what the CPU must actually do at a call site. The x86-64 System V ABI (used on Linux, macOS, and most Unix systems) passes the first six integer arguments in rdi, rsi, rdx, rcx, r8, and r9, and the first eight floating-point arguments in xmm0 through xmm7. Return values come back in rax (integer) or xmm0 (float). The Windows x64 ABI uses a different register assignment: rcx, rdx, r8, r9 for the first four integer arguments, with an unconditional 32-byte shadow space that the caller must always allocate even when fewer than four arguments are passed.
The register categories matter because they determine what the compiler must save before a call. Caller-saved registers (rax, rcx, rdx, rsi, rdi, r8-r11, and all XMM registers on System V) must be spilled to the stack if they hold live values the caller needs after the call returns. Callee-saved registers (rbx, rbp, r12-r15) are the callee’s responsibility to preserve. A function with several live values at a call site can easily spend more time on register spills than on the call instruction itself.
The RET instruction pops from the CPU’s Return Stack Buffer, a hardware predictor with 16 entries on Intel and 32 on AMD Zen. A well-predicted RET costs nothing beyond its latency. But indirect calls through function pointers or virtual dispatch rely on the Indirect Branch Predictor, which has fewer entries and higher miss penalties. After Spectre, retpoline sequences replace indirect branches with a trampoline pattern that can add 20 to 50 cycles per virtual call depending on the mitigation mode in effect. This is a non-trivial tax on object-oriented code in security-sensitive environments.
The Vectorization Barrier
Here is the part of function call cost that cycle-count benchmarks tend to understate. Consider a processing loop:
void process(float* out, const float* in, int n) {
for (int i = 0; i < n; i++) {
out[i] = transform(in[i]);
}
}
If transform is defined in another translation unit, or not otherwise inlined, the auto-vectorizer sees a call to an opaque function. It cannot prove that transform is safe to apply to eight elements simultaneously. It cannot reason about aliasing. The loop compiles to scalar code, processing one element per iteration.
Inline transform and the picture changes:
inline float transform(float x) { return x * 2.0f + 1.0f; }
The loop body is now visible, a multiply and an add. The vectorizer recognizes the pattern and emits something like:
vmulps ymm0, ymm1, [rsi + rax] ; 8 floats at once
vaddps ymm0, ymm0, ymm2
vmovups [rdi + rax], ymm0
With AVX2, this processes eight floats per iteration. With AVX-512, sixteen. The throughput difference is not incremental: Lemire’s benchmarks and related measurements consistently show non-inlined loops running at 1.5 to 4 nanoseconds per element, while inlined-and-vectorized equivalents reach 0.05 to 0.08 nanoseconds per element. The gap, roughly a factor of twenty to fifty, comes almost entirely from vectorization being blocked, not from raw call overhead.
The C++ standard library exploits this through templates. std::transform, std::for_each, and comparison-based algorithms accept callable arguments as template parameters rather than function pointers. The lambda or functor gets inlined at the call site, the loop body becomes visible, and the vectorizer goes to work. A std::sort with an inlinable comparator runs two to four times faster than one with an opaque function pointer comparator passing the same comparison logic.
How Compilers Decide What to Inline
LLVM’s inliner assigns costs to IR instructions in the callee and compares against a threshold. The default threshold at -O2 and -O3 is 225 cost units. Several adjustments apply:
- A bonus of up to 10,000 units applies when inlining would allow constant folding because an argument is a known constant at the call site. This is the most powerful bonus because it enables further optimization chains.
- A 150-unit bonus applies when there is only one call site, since inlining carries no code-size cost in that case.
- Cold call sites receive a reduced threshold of 45, keeping rarely executed code from inflating hot code regions.
GCC uses analogous parameters tunable via --param. The max-inline-insns-single parameter defaults to 400 at -O2, and drops significantly at -Os. The inline keyword in C++ does not guarantee inlining; it is a linkage mechanism with historical roots, and modern compilers treat it as a weak hint at best. __attribute__((always_inline)) on GCC and Clang overrides the cost model entirely. __attribute__((flatten)) forces all calls within a function to be inlined recursively, which is useful for entry points that orchestrate several small operations that need to compose into a single visible unit for the vectorizer.
Recursion, virtual dispatch through unknown types, calls through function pointers, and functions marked __attribute__((noinline)) all prevent inlining. The std::function wrapper is a particularly expensive case: it uses type erasure to store any callable behind an internal virtual call interface, preventing devirtualization and inlining. Measurements consistently show std::function adding 20 to 50 nanoseconds per call compared to an equivalent inlined lambda. In a loop over a million elements, that is 20 to 50 milliseconds of overhead for the abstraction layer alone.
Cross-Module Inlining: LTO and PGO
By default, each translation unit compiles independently. The linker sees object files with function symbols but no intermediate representation. A function defined in utils.cpp and called from hot_loop.cpp cannot be inlined regardless of how small it is, unless LTO is enabled.
With full LTO (-flto), the compiler emits LLVM bitcode or GCC IR into object files instead of machine code. The linker runs an optimization pass over the combined IR and inlines across the entire program. The cost is longer link times and higher peak memory. Thin LTO (-flto=thin) addresses this by writing per-function summaries at compile time and loading only inlining candidates at link time. It achieves 80 to 90 percent of full LTO’s benefit at a fraction of the link-time overhead. Most performance-critical C++ projects that care about this, Chromium, Firefox, LLVM itself, use Thin LTO in their release builds.
Profile-guided optimization adds runtime frequency data to the cost model. After an instrumented run, the compiler knows which call sites are hot and can inline past the normal threshold for those sites, while leaving cold sites as calls to keep hot code compact. PGO also enables indirect call promotion: when profiling shows that a virtual call always dispatches to DerivedFoo::method(), the compiler generates a guarded direct call:
if (LIKELY(vtable == &DerivedFoo::vtable))
DerivedFoo::method(this, args); // inlined direct call
else
(*vtable->method)(this, args); // slow path
This turns a non-inlinable polymorphic dispatch into an inlinable monomorphic one for the common case.
The Cross-Language Picture
Rust exposes the cross-crate inlining problem explicitly. Without #[inline], a function in one crate cannot be inlined into another because the compiler omits the function body from crate metadata. Marking a function #[inline] causes the body to be included. #[inline(always)] maps to LLVM’s alwaysinline. Rust’s monomorphization of generic functions handles the common case automatically: a generic function specialized to a concrete type is compiled fresh for that type and inlines freely.
Go’s inliner operates on its own AST-level budget in the gc frontend, not through LLVM. Functions containing defer, goroutine spawns, or certain closure patterns have historically not been inlinable, though recent releases have relaxed many of these restrictions. Go 1.21 added PGO-driven devirtualization of interface calls, a significant addition for a language where interface dispatch is ubiquitous.
Java’s HotSpot JIT starts from a different position than AOT compilers. It observes runtime type information and inlines virtual and interface calls that are monomorphic in practice, even across package boundaries, without source-level annotations. A call site that always dispatches to the same concrete class gets an inlined direct call with a type guard, compiled after roughly 10,000 invocations. If the program later loads a new implementation that changes the profile, the JIT deoptimizes and recompiles. This dynamic inlining is why Java’s steady-state throughput on compute-intensive workloads often matches C++ code that lacks PGO, despite the overhead of managed memory and the JVM runtime.
When Inlining Hurts
Inlining a 20-instruction function at 500 call sites adds 10,000 instructions to the binary. The L1 instruction cache on a modern core is 32 to 64 kilobytes. A hot loop that fits entirely in L1 runs at full throughput. After aggressive inlining, the same loop might exceed L1 capacity and start incurring instruction cache misses on every iteration. The loop becomes slower precisely because the compiler inlined more.
The practical rule: functions under about ten instructions, called frequently in hot paths, are strong candidates for forced inlining. Functions in the 20 to 50 instruction range benefit from inlining when the loop is the known bottleneck. Functions over 100 instructions are usually better left as calls. PGO is the right tool for the ambiguous middle cases because it makes the decision based on measured execution frequency rather than heuristic thresholds.
The register pressure angle is also worth keeping in mind. Merging the register allocation domains of caller and callee can force spills if the combined function needs more registers than x86-64’s 16 general-purpose registers provide. A microbenchmark showing a speedup from inlining may not reflect a real workload where many hot functions compete for registers simultaneously.
The call instruction costs almost nothing. The optimization barrier it creates, between the compiler and the computation it cannot see, is where the real expense lies.