· 6 min read ·

The Real Reason Your Compiler Inlines Functions

Source: isocpp

When people talk about function call overhead, they usually mean the mechanical cost: pushing arguments onto the stack or loading them into registers, saving and restoring the frame pointer, the call instruction itself, and the eventual ret. On modern x86-64 hardware, a non-inlined call to a trivial function costs somewhere in the neighborhood of 1 to 5 nanoseconds depending on cache state and branch prediction warmth. Daniel Lemire’s recent post on function call costs quantifies this and serves as a useful starting point, but the mechanical overhead is almost beside the point.

The more significant cost is what the compiler cannot see when a call boundary is opaque.

The x86-64 Calling Convention

Before getting to compiler behavior, the mechanics are worth understanding. On Linux and macOS, the System V AMD64 ABI governs how functions are called. Integer arguments go into registers RDI, RSI, RDX, RCX, R8, and R9 in that order; the return value goes in RAX. Windows uses its own x64 ABI with RCX, RDX, R8, R9 for the first four arguments.

What this means in practice: a call like add(x, y) requires the compiler to load x into RDI and y into RSI, emit a call instruction (which pushes the return address onto the stack and jumps), execute the function body, and then read the result from RAX. The callee may need to save certain registers, called callee-saved registers (RBX, RBP, R12 through R15), if it uses them. All of this is overhead the inlined version avoids entirely.

For Lemire’s add3 example, the non-inlined version calls add twice. Each call carries this overhead. The inlined version compiles to something like:

; int add3(int x, int y, int z) - inlined
lea eax, [rdi + rsi]
add eax, edx
ret

Three operands, two instructions. The non-inlined version needs two full call/ret pairs around essentially the same arithmetic.

When Compilers Decide to Inline

GCC and Clang use different heuristics but share the same general approach: they estimate a function’s size in terms of weighted instructions and inline it if the caller-side benefit is worth the code size cost.

GCC exposes this through --param max-inline-insns-single (default 400 weighted instructions) and --param inline-unit-growth (default 20%, limiting how much a caller can grow). Clang/LLVM uses an inline threshold that defaults to 225 at -O2 and scales up at -O3. Both compilers apply additional logic based on call frequency, loop nesting depth, and whether the function sits on a hot path.

The __attribute__((always_inline)) attribute in GCC and Clang (or __forceinline in MSVC) overrides these heuristics and forces inlining regardless of size. The inverse, __attribute__((noinline)), prevents it, which is useful when benchmarking or when you want to preserve readable stack traces in production builds.

// Forces inlining at every call site
__attribute__((always_inline)) inline int add(int x, int y) {
    return x + y;
}

// Prevents inlining - useful for isolated benchmarks
__attribute__((noinline)) int add_noinline(int x, int y) {
    return x + y;
}

A key subtlety: a function defined in a header file is visible at every call site, making it a candidate for inlining. A function defined in a separate .cpp file is not visible during compilation of the caller, so inlining is impossible unless you enable Link-Time Optimization.

LTO and Cross-Translation-Unit Inlining

In a typical C++ build without LTO, each source file is compiled independently. The compiler sees only what is defined or included in the current translation unit. If add lives in math.cpp and add3 lives in main.cpp, the compiler processing main.cpp sees only the declaration of add, not its body. Inlining cannot happen.

LTO solves this by emitting an intermediate representation into object files (LLVM IR for Clang, GIMPLE for GCC), then performing optimization across all of them at link time. With LTO enabled, the linker can inline add into add3 across translation unit boundaries.

GCC enables LTO with -flto. Clang supports both full LTO (-flto) and ThinLTO (-flto=thin). ThinLTO performs a scalable subset of cross-module optimization including inlining, and is usually the practical choice for large codebases because full LTO’s link step can become a memory and time bottleneck.

# Full LTO with GCC
g++ -O2 -flto -o program main.cpp math.cpp

# ThinLTO with Clang
clang++ -O2 -flto=thin -o program main.cpp math.cpp

ThinLTO imports function summaries from all modules and performs inlining based on those summaries without requiring the entire program to be loaded into memory simultaneously. For projects with dozens or hundreds of translation units, this matters.

The Connection to Auto-Vectorization

The connection between inlining and auto-vectorization is where the real performance gap appears. The mechanical overhead of a function call on a tight loop might cost a few cycles per iteration. The vectorization you lose because the compiler cannot see through the call boundary can cost you 4x, 8x, or more depending on the available SIMD width.

Consider a loop summing two arrays element-wise:

// Defined in a separate translation unit (no LTO):
int add(int x, int y) {
    return x + y;
}

// In the caller:
void sum_arrays(int* a, int* b, int* c, int n) {
    for (int i = 0; i < n; ++i) {
        c[i] = add(a[i], b[i]);  // compiler cannot vectorize this
    }
}

Without inlining, the compiler treats add as an opaque call with unknown side effects. It must call it once per iteration, in scalar mode, one integer at a time. With the function body visible, it sees a simple addition and can auto-vectorize using SSE2 (4 ints per iteration), AVX2 (8 ints per iteration), or AVX-512 (16 ints per iteration) depending on the target.

On a machine with AVX2, the vectorized loop looks roughly like:

.loop:
vmovdqu  ymm0, [rdi + rax]       ; load 8 ints from a
vpaddd   ymm0, ymm0, [rsi + rax] ; add 8 ints from b
vmovdqu  [rdx + rax], ymm0       ; store to c
add      rax, 32
cmp      rax, rcx
jl       .loop

Eight integer additions per iteration rather than one. A 4 ns call overhead on a 10 ns-per-element scalar loop costs you roughly 40% of throughput. Missing AVX2 vectorization on the same loop costs you around 87%. The relative weight of these two costs is the opposite of what most people expect when they first think about function call overhead.

You can verify this kind of codegen difference directly with Compiler Explorer, which shows assembly output for any compiler version and flag combination without setting up a local benchmark. It is particularly useful for confirming whether the compiler is actually vectorizing a loop and for checking whether inlining is happening where you expect it to.

Practical Implications

A few habits follow from all of this.

Put hot, small functions in headers. If a function is called in a tight loop and its body is small, defining it in a header gives every caller the chance to inline it. The code size cost is usually negligible for small functions, and the optimization surface is dramatically larger.

Enable LTO on release builds. -flto=thin is straightforward to add and the link-step overhead is manageable for most projects. It turns your entire codebase into a single optimization domain, enabling inlining and other interprocedural optimizations that are otherwise impossible across translation unit boundaries.

Use noinline when benchmarking individual functions. Without it, a microbenchmark may measure an already-inlined version and misreport the cost of calling a function that, in the scenario you actually care about, would be called from a separate translation unit without LTO.

Check the assembly when performance matters. If a hot loop contains a call instruction, that is a signal to investigate. The cause might be a translation unit boundary, a function that exceeded the compiler’s inline threshold, or a missing optimization flag. Compiler Explorer makes this check fast.

The Lemire article quantifies the raw overhead numbers clearly. The broader point is that a function call boundary restricts what the compiler can see, and therefore what it can do. Inlining eliminates the overhead of the call itself, but that is secondary to the optimization opportunities it unlocks: vectorization, constant folding, and dead code elimination across what was previously an opaque boundary. In tight loops, the ability to vectorize is often worth more than any individual cycle-counting exercise.

Was this interesting?