Inlining, Vectorization, and the Real Cost of Function Calls in Tight Loops
Source: isocpp
The premise in Daniel Lemire’s recent post on isocpp.org is simple: function calls are not free. For most code that is an academic concern, but in tight loops the overhead compounds in ways that go far beyond the cost of a single call instruction. What is worth understanding is the full mechanism behind the cost, and more importantly, what inlining unlocks once the compiler can see both sides of a function boundary.
What Happens at the CPU Level
When a processor executes a call instruction on x86-64, several things happen in sequence. The return address gets pushed onto the stack. The instruction pointer jumps to the function body. On return, the address gets popped and the jump reverses. In raw instruction terms, that is one call and one ret, each costing around 1-3 cycles on a modern out-of-order CPU.
For functions doing any meaningful work, that sounds negligible, and it is. The problem lies in what surrounds the call.
The System V AMD64 ABI used on Linux, macOS, and most Unix-like systems divides registers into caller-saved and callee-saved categories. Caller-saved registers (rax, rcx, rdx, rsi, rdi, r8-r11) can be overwritten by the callee, so the caller must save any live values before the call and reload them after. Callee-saved registers (rbx, rbp, r12-r15) must be preserved by the callee, which means the callee has to push and pop them if it wants to use them. Calling even a trivial function forces the compiler to emit spills and reloads around the call site. The stack also needs 16-byte alignment before the call.
For a function that adds two integers, the register bookkeeping can cost more than the arithmetic itself:
; add3 calling add(x, y) and then add(result, z), without inlining.
; The intermediate result must survive the second call,
; so it gets stashed in rbx (a callee-saved register).
push rbx
mov ebx, edi ; save x
add ebx, esi ; x + y
mov edi, ebx ; first arg to second call
mov esi, edx ; z is second arg
call add
pop rbx
ret
When the compiler inlines add, this collapses to a pair of add instructions with no stack manipulation at all. Lemire’s example in the article is exactly this pattern, and the assembly difference is stark.
The Instruction Cache Problem
There is another cost that synthetic benchmarks frequently understate: instruction cache pressure. Modern CPUs have L1 instruction caches of 32-64 KB. When a called function resides in a different cache line from the call site, fetching that cache line may evict other hot code. In a tight loop that calls a small function on every iteration, you are either paying intermittent cache miss penalties or consuming two distinct regions of instruction memory instead of one contiguous region.
This is part of why compiler inlining heuristics track function size. A compiler weighing an inlining decision has to balance code size growth against call overhead elimination. Inlining too aggressively inflates the instruction footprint of hot code paths, which can increase cache misses and undo the benefit. GCC exposes this budget via -finline-limit (default 600 pseudo-instructions) and --param max-inline-insns-auto. Clang has equivalent controls under -mllvm -inline-threshold.
The Real Payoff: Vectorization
The most significant benefit of inlining is not the elimination of call overhead. It is what becomes possible once the optimizer can see the caller and callee together in a single context.
Consider a loop that applies a small function to each element of an array:
int clamp(int x, int lo, int hi) {
if (x < lo) return lo;
if (x > hi) return hi;
return x;
}
void clamp_array(int* arr, int n, int lo, int hi) {
for (int i = 0; i < n; i++) {
arr[i] = clamp(arr[i], lo, hi);
}
}
Without inlining, the compiler sees a loop body containing an opaque function call. It cannot vectorize the loop because it cannot determine whether clamp has side effects, modifies global state, or creates aliasing problems. The auto-vectorizer gives up and emits a scalar loop.
With inlining, the body of clamp is visible inside clamp_array. The compiler can now confirm there are no side effects or aliasing issues, recognize the pattern as a pair of integer min/max operations, and emit SIMD instructions that process 8 or 16 elements per cycle instead of one. On AVX2-capable hardware, vpminsw and vpmaxsw handle this directly. On SSE4.1, pminsd and pmaxsd serve the same purpose for 32-bit integers.
The throughput difference can be a factor of 8 to 16 in elements per cycle, which dwarfs any individual function call cost. Lemire’s post illustrates this with the add example, but the vectorization story is where the numbers become genuinely significant in real workloads.
You can verify this directly with Compiler Explorer: write clamp_array with and without __attribute__((noinline)) on clamp, compile with -O2 -march=native, and compare the assembly. The inlined version produces a vectorized loop body with packed integer instructions; the non-inlined version produces a scalar loop with repeated call/ret pairs. The transformation is not subtle.
The same principle applies to any loop calling a small function with predictable memory access patterns: string processing, pixel operations, physics integration steps, audio sample manipulation. The function boundary is the barrier that prevents the optimizer from reasoning across iterations.
When the Compiler Cannot Inline
Inlining has hard limits beyond the budget heuristics. Recursive functions cannot be fully inlined without loop unrolling or tail-call transformation. Virtual function calls cannot be devirtualized without whole-program analysis or profiling data. Functions in separate translation units are not visible to each other at compile time, which means any function call crossing a .cpp file boundary is opaque unless you use link-time optimization.
LTO addresses the cross-translation-unit case. With -flto on GCC or Clang, all translation units are fed through a combined optimization pass at link time, allowing inlining across .cpp boundaries as if the entire program were compiled as one unit. Clang’s -flto=thin offers a faster approximation via per-module summaries rather than a full program IR, which makes it practical for large codebases where full LTO would impose prohibitive build times. In programs where hot functions are separated across module boundaries, LTO can recover 5-15% runtime performance. The actual gain depends heavily on how much cross-module call overhead is present in hot paths.
Controlling Inlining Explicitly
There are cases where you want to prevent inlining despite the compiler’s inclination. Keeping certain functions out-of-line preserves readable stack traces and enables clean per-function profiling:
__attribute__((noinline)) void record_slow_path(Event e);
The [[clang::noinline]] and [[gnu::noinline]] C++ attributes serve the same purpose without reaching for GCC-specific syntax. Conversely, if a small function is on the critical path and the compiler is declining to inline it due to budget constraints:
__attribute__((always_inline)) inline int fast_path(int x);
always_inline bypasses the inlining budget entirely. The compiler’s inlining budget exists because excessive inlining bloats code size and increases instruction cache pressure; forcing it off for a critical path function makes sense, but doing it indiscriminately trades one performance problem for another.
C++20’s [[likely]] and [[unlikely]] attributes interact with inlining indirectly. Once a function containing branches is inlined, the compiler can use call-site context to estimate branch probabilities, which it cannot do when the callee is opaque. This means inlining and branch prediction hints compose in ways that each individually cannot achieve.
The GCC __attribute__((flatten)) Case
GCC offers a less well-known attribute worth mentioning here: __attribute__((flatten)). Applied to a function, it instructs the compiler to inline all calls within that function, regardless of inlining budget. This is a blunt instrument, but it is useful for hot kernels where you want to guarantee that the compiler sees all operations in one context without having to annotate every individual callee:
__attribute__((flatten))
void process_block(float* data, int n) {
for (int i = 0; i < n; i++) {
data[i] = normalize(quantize(filter(data[i])));
}
}
With flatten, normalize, quantize, and filter are all inlined into the loop body, giving the vectorizer a complete view of the computation. Without it, each call is an optimization barrier.
When It Matters
For most application code, function call overhead is irrelevant. A function that queries a database, formats a string, or handles a network packet will not be measurably affected by the cost of a call instruction. The analysis applies specifically when a function is called millions or billions of times in a loop, when the function body is small relative to the call setup cost, or when the function boundary is preventing vectorization of the surrounding loop.
In practice this comes up in parsers, serializers, signal processing pipelines, image processing kernels, and physics engines: domains where throughput is the primary metric and the hot path runs for millions of iterations. The standard advice is to profile first and not assume, but understanding the mechanism helps you interpret what the profiler is showing. When a small function in a hot loop shows up disproportionately in samples, the question to ask is not just how expensive the function is but whether its presence as an opaque call is preventing the surrounding loop from vectorizing.
Lemire’s article is a useful reminder that the compiler is doing substantial work to make abstraction cheap. When it can see both sides of a function boundary, abstraction is often free in the generated code. When it cannot, because of separate compilation, virtual dispatch, or budget limits, the cost compounds inside loops in ways that go well beyond the call instruction itself.