Function Calls Are Optimization Barriers, Not Just Overhead

The cost of a function call is often framed as a fixed overhead: the stack frame setup, the register saves, the branch prediction machinery doing its job. Daniel Lemire’s recent post puts concrete numbers on it, somewhere around 10-15 cycles for a direct call on modern x86-64 hardware, or 3-5 nanoseconds at 3 GHz. For most code, that’s noise. For a tight loop running a billion iterations, it’s the entire runtime.

But the nanosecond number understates the problem. The real cost of a non-inlined function call in a tight loop isn’t the call itself; it’s what the compiler cannot do once that call is present. Function calls are optimization barriers, and the optimization they block most often is auto-vectorization, which can be worth a factor of 8 or more in throughput.

What the call actually costs

The x86-64 System V ABI (used on Linux and macOS) has a well-defined calling convention. Integer arguments go in rdi, rsi, rdx, rcx, r8, r9. Floating-point arguments go in xmm0 through xmm7. The callee must preserve rbx, rbp, r12 through r15. The call instruction pushes the return address onto the stack and jumps; ret pops it back. Agner Fog’s calling conventions manual covers the full details.

For a hot call site with a predicted branch, the round-trip cost on a Skylake-class CPU is roughly:

call instruction: ~3 cycle latency
Function prologue and epilogue (push/pop of callee-saved registers): 1-2 cycles each
ret instruction, predicted via the Return Stack Buffer: ~3-4 cycles

Total: somewhere in the 10-15 cycle range for a trivial function that does almost nothing. That matches what Lemire measured when benchmarking a function that returns x + 1 in a tight loop: roughly 10x throughput difference between the non-inlined and inlined versions.

Virtual dispatch and function pointer calls add another 10-15 cycles when the branch predictor misses. Indirect branches are harder for the CPU to predict, and the Branch Target Buffer has limited capacity. In polymorphic call sites where the target changes frequently, you pay the full misprediction penalty every time. Agner Fog’s instruction tables put the misprediction penalty at 15-20 cycles on Skylake-class hardware.

The vectorization barrier

Here’s what the cycle count doesn’t capture. When the compiler sees a loop, it looks for opportunities to auto-vectorize: to process multiple elements per iteration using SIMD instructions. On modern x86-64 with AVX2, that means 8 single-precision floats or 4 doubles at a time. The speedup is real and substantial.

For auto-vectorization to work, the compiler needs to see the entire loop body. It needs to know there are no aliasing violations, no hidden side effects, and crucially, that the per-element operation maps onto SIMD instructions. A non-inlined function call violates all of those assumptions at once. The compiler cannot see inside a function defined in a separate translation unit, so it treats the call as an opaque operation with unknown side effects.

Consider this pattern:

// In utils.c (separate translation unit)
float square(float x) {
    return x * x;
}

// In main.c
float sum_squares(float *a, int n) {
    float s = 0;
    for (int i = 0; i < n; i++)
        s += square(a[i]);
    return s;
}

Without link-time optimization, square is not visible to the compiler when it compiles main.c. The loop cannot be vectorized. The generated assembly calls square once per iteration, scalar.

Now make square visible:

static inline float square(float x) {
    return x * x;
}

float sum_squares(float *a, int n) {
    float s = 0;
    for (int i = 0; i < n; i++)
        s += square(a[i]);
    return s;
}

With -O3 -mavx2, GCC collapses the inlined call and vectorizes the loop. The resulting assembly looks like:

.loop:
    vmovups ymm0, [rdi + rax*4]  ; load 8 floats
    vmulps  ymm0, ymm0, ymm0     ; square 8 floats
    vaddps  ymm1, ymm1, ymm0     ; accumulate
    add     rax, 8
    cmp     rax, rdx
    jl      .loop

Eight elements per iteration instead of one. GCC’s vectorization infrastructure handles more complex cases too, including reductions and loops with conditionals, as long as the body is visible to the optimizer.

The arithmetic on what this means: at ~5 ns per non-inlined call, a loop over 10 million elements takes 50 ms. Inlined and vectorized, processing 8 elements per ~0.5 ns iteration, the same loop takes roughly 625 microseconds. The factor is 80, not 10. The 10x number from the direct call overhead benchmark is the floor, not the ceiling.

Cross-translation unit inlining: LTO

The static inline approach works when the definition lives in the same file or a shared header. When a function is defined in a separate .c file, the compiler sees a hard boundary. The standard solution is Link-Time Optimization.

With GCC’s -flto, the compiler emits GIMPLE IR rather than native object code. The linker then runs a final optimization pass across all translation units, enabling inlining across those boundaries:

gcc -O2 -flto -o program main.c utils.c

For large codebases, Clang’s ThinLTO (-flto=thin) achieves similar results at dramatically lower link time by using a summary-based approach rather than processing all IR at once. The ThinLTO paper from CGO 2017 describes the architecture. Both approaches let the optimizer see across module boundaries and make inlining decisions based on the full call graph, not just what’s visible in one translation unit.

std::sort vs qsort: the canonical example

C’s qsort takes a comparator as a int (*cmp)(const void*, const void*) function pointer. C++‘s std::sort takes a comparator as a template parameter. The functional difference is that the function pointer cannot be inlined; the template instantiation can.

On GCC 13 with -O3, sorting 10 million random integers:

qsort:      ~1.8 seconds
std::sort:  ~0.7 seconds

The roughly 2.5x difference comes almost entirely from inlining. With qsort, every comparison is a call through a pointer, preventing both inlining and any optimization the compiler could apply around it. With std::sort and a lambda comparator, the comparison is inlined, and the compiler can optimize the comparison together with the surrounding swap logic.

This is why the C++ standard library lives in headers. Templates require visible definitions for instantiation, and making everything inlineable is a core part of the zero-overhead abstractions design that Stroustrup described in The Design and Evolution of C++. Every method defined inside a class body is implicitly inline per the standard ([dcl.fct.spec]). std::vector::operator[] compiles to a single mov instruction because the optimizer sees its body and the entire operation collapses to an indexed memory read.

std::function and type erasure

std::function is the counter-example. It stores any callable through type erasure, which means a heap allocation and an indirect call for every invocation. The compiler cannot inline through it.

// Template: inlined and vectorizable
template<typename F>
void apply(F f, std::vector<int>& v) {
    for (auto& x : v) f(x);
}

// std::function: indirect dispatch, not vectorizable
void apply(std::function<void(int&)> f, std::vector<int>& v) {
    for (auto& x : v) f(x);
}

Benchmarks routinely show std::function callbacks running 6-8x slower than equivalent template callbacks in tight loops, entirely because of the inlining barrier. The type erasure is convenient for heterogeneous storage but costly in loops. This is why performance-sensitive code in game engines, parsers, and data processing pipelines avoids std::function in hot paths, preferring templates and lambdas that the optimizer can see through.

Forcing the compiler’s hand

GCC and Clang use cost models for inlining decisions. A function that exceeds the instruction threshold at a given optimization level may not be inlined even when it would be beneficial. Two escape hatches exist.

__attribute__((always_inline)) forces inlining regardless of size or optimization level:

__attribute__((always_inline)) static inline int add(int a, int b) {
    return a + b;
}

MSVC’s equivalent is __forceinline. There is no standard C++ attribute for this yet, though [[gnu::always_inline]] works on GCC and Clang. The compiler will warn if it genuinely cannot honor the request, for instance because the function is recursive.

The inverse, __attribute__((noinline)), prevents inlining. This is useful for benchmarking the non-inlined baseline, for keeping profiler call graphs readable, or for functions that must not be duplicated for code size reasons.

Profile-guided optimization takes a different approach entirely. It collects actual call frequency data from a real workload and feeds it back to the compiler, which then makes inlining decisions based on measured hotness rather than static heuristics:

gcc -O2 -fprofile-generate -o program_inst main.c
./program_inst < representative_workload.txt
gcc -O2 -fprofile-use -o program_opt main.c

PGO also enables speculative devirtualization: if a virtual call site is monomorphic 99% of the time in the profile, the compiler emits a direct, inlineable call guarded by a type check, with the vtable dispatch as a fallback. Google’s documentation on Clang PGO reports 5-15% performance improvements from PGO across real C++ applications, with a significant fraction coming from better inlining decisions.

Putting it together

The lesson from Lemire’s post is that function calls have real, measurable overhead. The more consequential lesson is about what that overhead enables the compiler to avoid. A call in a tight loop is not just 10-15 cycles; it’s a signal to the optimizer that it cannot see what happens next, which forecloses vectorization, constant propagation across the boundary, and a range of other transforms that compound together.

The practical guidance follows directly. Functions that appear in tight loops should be visible to the compiler at the point of use, through headers, static inline, or LTO. Avoid std::function in hot paths; prefer templates and lambdas. Use __attribute__((always_inline)) when the cost model is getting in the way of a known-beneficial inline. And if the workload is stable enough to profile, PGO will make better inlining decisions than any static heuristic.

The 10-15 cycle number is real. The 8x vectorization gain that disappears when the function call is present is more real. Both are worth understanding, and the second one is the reason tight-loop performance in C and C++ rewards paying close attention to what the compiler can see.