What a Function Call Hides from Your Compiler

The standard mental model of function call cost goes roughly like this: push arguments, jump, do work, return. You pay for the prologue and epilogue, maybe some register spills, and you get back a result. On modern hardware, that is a handful of cycles, so the conventional wisdom is that function calls are cheap and you should not worry about them unless they are pathologically frequent.

That mental model is mostly right for isolated calls. It fails badly in tight loops, and not for the reason most people assume.

The Call Instruction Is Not the Problem

When you call a function inside a loop, the direct overhead — the call and ret instructions, the stack frame setup — is real but often small relative to the work being done. On x86-64, a call followed by a trivial function body and ret runs in roughly 2-5 nanoseconds under ideal conditions. Modern out-of-order processors can pipeline a lot of this. The return stack buffer (RSB) predicts return addresses effectively for shallow call depths, so even prediction overhead is modest in typical code.

The real cost is something else: the function call boundary is an optimization barrier.

When the compiler encounters a function call it cannot see through, it must assume the function does anything. It might read or write any memory. It might have side effects. It might modify global state. Without visibility into the function body, the compiler cannot apply most of the optimizations it would otherwise use.

This is the insight at the core of Daniel Lemire’s write-up on isocpp: a small function like add(x, y) looks like it should cost almost nothing, but when it appears inside a loop, the compiler loses the ability to see what is happening inside that loop. The call is cheap; the lost visibility is not.

What Gets Blocked

Consider a loop that sums an array:

int sum = 0;
for (int i = 0; i < N; i++) {
    sum = add(sum, arr[i]);
}

If add is defined in a separate translation unit, the compiler sees a black box. It cannot use SIMD instructions, because it does not know whether add just adds integers or does something else entirely. It cannot reorder iterations. It cannot apply loop unrolling in a way that crosses the function boundary. It cannot even prove there are no aliasing issues without seeing what add does with its arguments.

Now inline the function, so the compiler sees sum = sum + arr[i] directly. The loop becomes a textbook vectorization candidate. On a system with AVX2, the compiler can process 8 integers at once using 256-bit registers, turning what was a sequential loop into something that runs 4-8x faster. Not because the function call overhead was that expensive, but because making the function opaque disabled every interesting optimization.

The assembly difference is dramatic. The non-inlined version produces a scalar loop with a call instruction in the body. The inlined version, compiled with -O2 or -O3, will often produce code using vmovdqu, vpaddd, and similar SIMD instructions. You can verify this directly on Compiler Explorer by toggling between a version where add is defined in the same translation unit and one where it is declared extern.

Constant Propagation and Dead Code Elimination

Vectorization is the most dramatic example, but inlining enables several other optimizations that compound in practice.

Constant propagation is among the most valuable. When the compiler can see the body of a function and knows that one of its arguments is a compile-time constant, it can simplify or eliminate most of the function body. A function that handles both a zero case and a nonzero case becomes much simpler when the compiler knows it will always be called with a nonzero argument. Without inlining, the compiler has no way to know.

Alias analysis improves too. The compiler must be conservative when it cannot see what a function does with its pointer arguments. Two pointers that appear unrelated might be aliased through some path the compiler cannot trace. With inlining, the compiler can follow the full provenance of each pointer and often prove they do not alias, which unlocks store-load forwarding and eliminates unnecessary memory barriers.

Dead code elimination follows naturally from both. When constant propagation reveals that a branch condition is always true or always false in a given calling context, the dead branch disappears entirely. Without inlining, both branches stay in the compiled code.

The `inline` Keyword and What It Actually Does

There is a persistent misconception that adding inline to a function declaration in C++ forces the compiler to inline it. This was approximately true in early C++, but modern compilers treat inline primarily as a linkage directive, not an optimization command.

What inline does today: it allows a function to be defined in multiple translation units, which is necessary for header-defined functions, and it tells the linker to expect multiple identical definitions and merge them. The compiler can still choose not to inline the function body at a call site, and frequently does so when the function is large or when the call site is not in a hot path.

Compilers have cost models for this. GCC, Clang, and MSVC all weigh the code size increase against the expected benefit. A function with complex control flow may be too expensive to inline everywhere. GCC’s -finline-limit flag controls the maximum size the auto-inliner will consider, with a default around 600 pseudo-instructions.

To actually force inlining, you need compiler-specific attributes. GCC and Clang support __attribute__((always_inline)), and C++11 attribute syntax gives [[gnu::always_inline]] for code targeting those compilers. MSVC provides __forceinline. None of these are standard C++, but they are widely used in performance-critical library code:

[[gnu::always_inline]] inline int add(int x, int y) {
    return x + y;
}

Use these sparingly. Forced inlining everywhere increases code size, which increases instruction cache pressure, which can hurt performance in code that calls many different functions in patterns the hardware prefetcher cannot predict.

Link-Time Optimization Changes the Equation

For years, one of the main constraints on inlining was the translation unit boundary. If add was defined in math.cpp and called from main.cpp, the compiler working on main.cpp had no visibility into add’s body. The function had to be treated as opaque.

Link-Time Optimization (LTO) removes this constraint. With LTO enabled — GCC’s -flto, Clang’s -flto=thin or -flto=full, MSVC’s /GL — the compiler emits an intermediate representation (LLVM IR, or GCC’s GIMPLE) instead of final object code. The linker then runs a full optimization pass with visibility into all translation units simultaneously. Functions can be inlined across what were previously hard boundaries.

The trade-off is compile and link time. A full LTO build of a large project can be substantially slower, because the linking stage now performs the work that was previously spread across many parallel compilation units. Clang’s ThinLTO is a more practical option for large codebases: it uses a summary-based approach that captures enough information for cross-module inlining without requiring the entire program to be loaded into memory at link time.

For library code that ships as compiled objects or shared libraries, LTO is not an option for callers. This is one reason that performance-critical libraries and tight inner-loop utilities are often written as header-only or with inline definitions exposed to callers. The entire point is to give the compiler visibility. OpenBLAS and similar numerical libraries are full of these patterns precisely because the routines they expose must be composable with the caller’s loop structure to benefit from vectorization.

Knowing When to Care

Most code has no need to think about any of this. Application logic, I/O handling, string processing in response to user input — none of these are bottlenecks, and optimizing function call overhead in those contexts is a waste of effort.

The cases where it matters are predictable: inner loops processing large arrays or buffers, numerical kernels, parsers that scan byte-by-byte, compression and hashing routines. These are the places where tight loops run millions of iterations and where SIMD can provide large absolute speedups.

The right workflow is to profile first, identify the hot loops, then look at the generated assembly to confirm whether vectorization happened. Compiler Explorer makes this practical during development. If a loop is hot according to the profiler and the assembly shows a scalar loop with a call instruction inside it, inlining that called function is worth testing.

The underlying principle is worth internalizing independent of any specific optimization target. A function call does not just move a few bytes on the stack; it draws a line around a piece of code and tells the compiler to treat everything inside as a mystery. That boundary is cheap to cross at runtime, but it is expensive in terms of what the compiler is allowed to assume on your behalf.