Function Calls Cost the Compiler More Than They Cost the CPU

Daniel Lemire’s recent piece on isocpp.org opens with a two-line example that most C++ programmers would dismiss as trivial:

int add(int x, int y) { return x + y; }
int add3(int x, int y, int z) { return add(add(x, y), z); }

The question is whether the second definition is meaningfully more expensive than writing x + y + z directly. The answer from profiling is usually: not much, especially with optimization flags. Framing the question as “how many cycles does a call cost” misses the more important half of the story. The cycles spent on a function call are the least interesting part of the cost.

What x86-64 actually does for a function call

On the System V AMD64 ABI, a function call involves pushing the 8-byte return address onto the stack, saving any caller-saved registers (RAX, RCX, RDX, RSI, RDI, R8–R11) that are live at the call site, routing the first six integer arguments through RDI, RSI, RDX, RCX, R8, R9, and executing CALL. The callee may save additional callee-saved registers in its prologue before doing any real work. On return, it executes RET, which pops the return address and jumps. The complete round trip runs roughly 3–5 cycles for CALL and RET themselves, plus 5–10 cycles for register bookkeeping around a small function. Agner Fog’s instruction tables document these costs in detail. For the add3 example, the non-inlined assembly looks like this:

add3:
    push   rbx
    mov    ebx, edi       ; save x
    add    ebx, esi       ; x + y in rbx
    mov    edi, ebx       ; first arg to second call
    mov    esi, edx       ; z is second arg
    call   add
    pop    rbx
    ret

Inlined, it becomes:

add3:
    lea    eax, [rdi + rsi]
    add    eax, edx
    ret

Two instructions instead of a push, three moves, a call, and a pop. In isolation this difference is imperceptible. In tight loops, the mechanism that matters is different from register spilling.

The compiler’s view ends at the call boundary

When a compiler encounters a call to a function whose body it cannot see, it must assume the worst. Any pointer in scope may have been modified by the callee. Global state may have changed. The callee may have observable side effects that make it unsafe to reorder operations around it. This assumption is conservative but correct: the compiler has no visibility into what the external function does.

Several optimization passes stop at call boundaries as a result. Constant folding cannot propagate values through an opaque call. Dead code elimination cannot remove code depending on results computed through a call the compiler cannot reason about. Loop-invariant code motion becomes conservative when a loop body contains calls. For numerical code, auto-vectorization stops entirely.

Consider a loop that applies a function element-wise over an array:

__attribute__((noinline))
int scale(int x) { return x * 3; }

void apply(int* data, int n) {
    for (int i = 0; i < n; i++) {
        data[i] = scale(data[i]);
    }
}

The generated inner loop at -O3 with GCC produces a scalar loop with one call per iteration:

.loop:
    mov    edi, DWORD PTR [rbx + rax*4]
    call   scale
    mov    DWORD PTR [r12 + rax*4], eax
    inc    rax
    cmp    rax, rcx
    jne    .loop

Remove the noinline attribute and the compiler can see the function body. With AVX2 available, it emits:

.loop:
    vmovdqu ymm0, YMMWORD PTR [rbx + rax]
    vpslld  ymm1, ymm0, 1
    vpaddd  ymm0, ymm0, ymm1   ; x*3 = x + (x<<1)
    vmovdqu YMMWORD PTR [r12 + rax], ymm0
    add     rax, 32
    cmp     rax, rdx
    jne     .loop

Eight integers processed per iteration instead of one. That is the actual cost of the function call: not the 10-cycle overhead of the call itself, but the eightfold throughput gap that appears when vectorization is blocked. GCC’s -fopt-info-vec-missed surfaces exactly which loops failed to vectorize and why; one of the common messages is "function call may clobber memory". Clang exposes the same information with -Rpass-missed=loop-vectorize.

Why `inline` does not fix this, and LTO does

The inline keyword in C++ has accumulated a reputation it does not deserve. Programmers reach for it hoping to eliminate function call overhead, but the C++ standard specifies that it carries no obligation: the compiler is free to ignore it entirely. What inline actually guarantees is an ODR (One Definition Rule) exemption, allowing a function to be defined in multiple translation units without triggering a linker error. This is why every function body in a header file carries inline: not as an optimization hint, but as a prerequisite for including the header in multiple .cpp files without a linker complaint. cppreference documents this directly: the inline specifier does not affect inlining decisions.

The real mechanism behind inlining is visibility. A compiler can only inline a function if it can see the function’s body at the call site. For functions defined in the same translation unit, this happens naturally at -O2 or -O3 when the compiler’s cost model judges the function worth inlining. GCC’s inline cost model uses a weighted pseudo-instruction budget of around 400 by default (--param max-inline-insns-single); Clang’s threshold sits around 225 at -O2, tunable via -mllvm -inline-threshold. For functions defined in separate .cpp files, no amount of inline annotations helps because the body is never emitted into the calling translation unit’s object file.

This is where link-time optimization (LTO) becomes load-bearing. With LTO, the compiler emits intermediate representation into object files rather than machine code. The linker feeds all IR to a final optimization pass with visibility into the entire program at once:

# GCC full LTO
g++ -O2 -flto -c math.cpp -o math.o
g++ -O2 -flto -c main.cpp -o main.o
g++ -O2 -flto math.o main.o -o program

# Clang ThinLTO -- more practical for large codebases
clang++ -O2 -flto=thin src/*.cpp -o program

Clang’s ThinLTO builds per-module summaries during compilation and imports only the function bodies needed at link time, processing modules in parallel. It captures 80–90% of full LTO’s runtime benefits at a fraction of the link-time cost. Chrome, Firefox, and the Linux kernel all ship with ThinLTO or equivalent. Reported gains on hot cross-module call paths run 5–15% end-to-end, higher when inlining unlocks downstream vectorization. In CMake, one line enables it: set_property(TARGET my_target PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE).

The architectural consequence for library design

The visibility constraint is why a disproportionate share of high-performance C++ libraries are header-only. When a library ships only compiled .a or .so files without LTO, the compiler at the call site has no function bodies to inline, and trivial operations stay scalar.

Eigen achieves BLAS-competitive performance through expression templates. An expression like c = a + b * 2.5f generates a single fused loop, not three passes, because all three operations are visible simultaneously. That fusion requires the entire expression tree to be in scope at once, which requires all of Eigen’s implementation to live in headers. C++20 ranges use the same strategy: std::views::transform takes callable types as template parameters rather than std::function specifically to keep the body visible to the optimizer. A std::function callback inside a loop is opaque and forces a virtual-dispatch-like indirect call per iteration; a lambda passed as a template parameter is inlined and vectorized. The performance gap between the two is measured in nanoseconds per element.

When you need to keep the call

Sometimes a function should not be inlined. Aggressive inlining can grow the instruction cache footprint beyond the L1i limit, which sits at 32 KB on Skylake-class hardware. GCC’s __attribute__((noinline, cold)) moves infrequently-called error handlers to a separate .cold section, improving L1i density for the hot path without manual reorganization.

For functions that cannot be inlined but are called in performance-sensitive loops, GCC and Clang provide __attribute__((const)) and __attribute__((pure)). A const function reads nothing beyond its arguments; a pure function reads memory but does not write. Both annotations let the auto-vectorizer treat the call as safe to hoist or replicate across SIMD lanes, recovering optimization opportunity without paying the code-size cost of full inlining. C++23 formalizes [[noinline]] as a standard attribute for the opposite direction.

For virtual dispatch, the compiler can often eliminate the vtable lookup entirely when it can prove that an object’s dynamic type is known. Marking a class final provides that proof. When final is inappropriate, CRTP delivers static polymorphism with no vtable overhead, at the cost of making runtime polymorphism impossible. Both Spectre-era retpoline mitigations and post-Spectre hardware solutions like Intel CET-IBT are also relevant here: retpoline adds roughly 10–25 stall cycles per indirect call, which makes the devirtualization question concrete rather than theoretical in tight loops.

Lemire’s article frames the question as how expensive a call is, and the cycles-based answer is: not very. The more complete answer is that the call itself is cheap; what it costs is the compiler’s ability to reason across it, and on hot paths that cost compounds into throughput gaps that no amount of micro-optimization on the surrounding code can recover.