The Optimization Wall Your Function Calls Build

What a function call actually costs

The cost a function call adds to a program is not one cost. There is the mechanical overhead of the call itself, and then there is the secondary cost: everything the compiler cannot do because the call is there.

The mechanical part is well understood. On x86-64 with the System V ABI, calling a function involves shuffling arguments into registers (RDI, RSI, RDX, RCX, R8, R9 for the first six integer arguments), executing a call instruction that pushes the return address onto the stack, and jumping to the callee. The callee sets up its own stack frame, potentially saves callee-saved registers (RBX, RBP, R12 through R15), does its work, and executes ret. None of this is expensive in isolation. A modern CPU handles a call/ret pair in a handful of cycles, and the branch predictor manages the return address well in predictable patterns.

Daniel Lemire’s article on isocpp.org illustrates the starting point with a minimal example:

int add(int x, int y) {
    return x + y;
}

int add3(int x, int y, int z) {
    return add(add(x, y), z);
}

versus the inlined equivalent:

int add3(int x, int y, int z) {
    return x + y + z;
}

With -O2 or higher, any competent compiler inlines add into add3 automatically because the function is trivially small. The interesting case is when inlining cannot happen.

The visibility problem

When the compiler encounters a call to a function whose definition it cannot see, it must make conservative assumptions. It assumes the callee might read or modify any globally reachable memory, and it cannot reason about whether the function has side effects. This is the correct assumption from a correctness standpoint, but it is devastating for optimization.

Consider this loop:

void sum_arrays(int *result, const int *a, const int *b, int n) {
    for (int i = 0; i < n; i++) {
        result[i] = add(a[i], b[i]);
    }
}

If the compiler can see the definition of add and confirm it is a pure function that reads two integers and returns their sum, it can then analyze whether result, a, and b overlap in memory. If they do not (or if the programmer annotates with __restrict__), the compiler can vectorize the loop.

With AVX2, a vectorized version processes eight 32-bit integers per iteration using vpaddd ymm0, ymm0, [mem]. That is an eightfold increase in throughput on the core computation. On a modern core with 256-bit vector units, this is the difference between saturating the ALU and leaving most of it idle.

If add lives in a separate translation unit without link-time optimization, the compiler sees only the call site. It cannot vectorize because it cannot prove add is pure. It cannot determine that successive iterations are independent. The generated code is a scalar loop with one function call per iteration.

What the assembly actually looks like

To see this concretely, compile the loop above with GCC at -O2 while preventing inlining of add, either by putting it in a separate file or by marking it __attribute__((noinline)). The inner loop looks roughly like:

.L3:
    mov     edi, DWORD PTR [rbx+rax*4]
    mov     esi, DWORD PTR [r12+rax*4]
    call    add
    mov     DWORD PTR [r13+rax*4], eax
    add     rax, 1
    cmp     rax, rcx
    jne     .L3

One call per iteration. Arguments loaded into edi and esi, result read from eax, memory written to result[i]. Now allow the compiler to inline add and see the full loop body:

.L3:
    vmovdqu ymm0, YMMWORD PTR [rbx+rax]
    vpaddd  ymm0, ymm0, YMMWORD PTR [r12+rax]
    vmovdqu YMMWORD PTR [r13+rax], ymm0
    add     rax, 32
    cmp     rax, rdx
    jne     .L3

Eight additions per iteration, no calls, no per-element loads. The loop advances by 32 bytes (eight 4-byte integers) per iteration rather than four bytes. You can verify this kind of transformation yourself with Compiler Explorer, which makes it easy to flip between inlined and non-inlined versions and watch the assembly change in real time.

The function call overhead in the first version is a detail. The call prevented the compiler from ever arriving at the second version.

The -O3 and LTO answers

GCC and Clang use different inlining heuristics at -O2 versus -O3. At -O2, the compiler inlines functions below a certain size threshold, measured in an internal model of instruction cost. At -O3, it enables -finline-functions, allowing inlining of functions that do not carry the inline keyword, at higher size thresholds, using a cost/benefit model that accounts for estimated code growth and call-site hotness.

For cross-translation-unit inlining, the answer is link-time optimization. Passing -flto to GCC or Clang causes the compiler to emit intermediate representation into the object files instead of final machine code. The linker then performs a global optimization pass across all the IR, with the same visibility it would have over a single translation unit.

Clang’s ThinLTO takes a practical approach: it builds per-module summaries during compilation, then imports only the function bodies worth inlining at link time. This allows distributed build systems to parallelize the compilation step while still enabling meaningful cross-module inlining. Full LTO is more thorough but significantly slower at link time since it processes everything globally.

The tradeoffs are build time and binary size. LTO can increase link time substantially and produce larger binaries as function bodies get duplicated at call sites. For a release build of performance-sensitive code, the cost is usually worth paying. For incremental development builds, it typically is not.

Why header-only libraries exist

This analysis explains something about C++ library design that might otherwise seem like mere convenience: the header-only library. Eigen, range-v3, and many other performance-oriented libraries put their implementations in headers. For template code, the language requires this (template definitions must be visible at the point of instantiation). For non-template hot-path code, the same logic applies: if a library wants its functions to be inlining candidates at call sites, those functions must be visible at the call site.

Putting implementations in headers is an architectural consequence of how inlining works. A performance-critical library that ships only compiled .a or .so files cannot be inlined into callers unless those callers use LTO and the library was compiled with LTO support enabled. Header-only design eliminates that dependency entirely.

This matters for library authors more than it might seem. A math library that ships a static archive compiled without LTO will, for most callers, produce scalar loops even for trivial operations. The same library shipped as headers lets the caller’s compiler see the full implementation and vectorize freely.

The inline keyword in modern C++ does not mean “please inline this function.” It is a linkage specifier that allows a function definition to appear in multiple translation units without violating the One Definition Rule. The name is a historical artifact from when the keyword served primarily as an inlining hint. Today, the standard defines it in terms of linkage, and inlining decisions belong entirely to the compiler’s optimizer. Using inline in a header makes the definition available at each call site, which is the prerequisite for inlining, but the decision to inline is the compiler’s.

Explicit inlining control

GCC and Clang both offer attributes for overriding the compiler’s inlining decisions:

__attribute__((always_inline)) forces inlining even at -O0, regardless of function size or heuristics. Useful for genuinely tiny functions in hot paths where predictable overhead matters.
__attribute__((noinline)) prevents inlining. Useful for functions that should appear clearly in profiler output, or for creating deliberate optimization boundaries in benchmarks where the compiler might otherwise eliminate the code being measured.

In C++ attribute syntax: [[gnu::always_inline]] and [[clang::noinline]]. MSVC uses __forceinline and __declspec(noinline) for the same purposes.

always_inline is not unconditionally faster. Inlining increases code size, and larger code increases instruction cache pressure. A function called from many sites, if inlined everywhere, can cause enough I-cache thrashing to outweigh the savings from removing the call overhead. Compilers factor this into their heuristics. GCC exposes inline cost parameters via --param flags such as max-inline-insns-single if you need finer control over where the threshold sits.

The practical picture

The performance gap between a loop that calls a function and a loop that has been inlined and vectorized is often in the range of four to eight times on typical array processing workloads with modern CPUs. This comes from what inlining makes possible: vectorization, loop unrolling, constant propagation across iteration boundaries, and alias analysis that covers the full loop body rather than stopping at a call site.

In tight numerical loops, a function call boundary is an optimization wall. The compiler stops knowing what the code means, defaults to correctness over speed, and generates conservative scalar code. Removing that wall, whether through inlining, LTO, or header-only library design, hands the optimizer the visibility it needs to generate code that actually uses the hardware.