What High-Performance C++ Libraries Teach Us About Function Call Overhead

Daniel Lemire’s recent piece on the cost of a function call uses a clean pair of examples to make the point: a function that calls another to add two numbers, versus the same result computed inline. The observation is correct and the examples are clear. But they are necessarily minimal, and they leave a practical question open: what does this look like in a real codebase that cares about performance, and what do teams actually do about it?

The Optimization Boundary

The fundamental issue is visibility. When the compiler compiles a translation unit and encounters a call to a function whose definition lives elsewhere, it must treat that call conservatively. It does not know what the callee reads or writes, whether it modifies any globally reachable memory, or whether reordering iterations of a loop around it is safe. The optimizer assumes the worst on all counts.

For a single call in non-hot code, that assumption is acceptable. For a loop that calls a function on every iteration, the consequences compound. Consider a tight loop over two integer arrays:

void sum_arrays(const int* a, const int* b, int* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = add(a[i], b[i]);
    }
}

If add is defined in another translation unit, GCC with -O3 -mavx2 generates a scalar loop: one element per iteration, one call per iteration. If add is visible at the call site, the compiler generates SIMD code that processes eight integers per instruction:

.L3:
    vmovdqu ymm0, YMMWORD PTR [rbx+rax]
    vpaddd  ymm0, ymm0, YMMWORD PTR [r12+rax]
    vmovdqu YMMWORD PTR [r13+rax], ymm0
    add     rax, 32
    cmp     rax, rdx
    jne     .L3

Agner Fog’s instruction tables put vpaddd throughput at one per clock on Skylake, processing eight integers. The scalar version is limited to one integer per clock regardless of optimization level. The gap is 8x, and it stems from a single architectural constraint: the System V AMD64 ABI passes integer arguments in individual registers, RDI and RSI for the first two. Vectorizing the loop would require passing eight integers simultaneously in a 256-bit YMM register, which implies a completely different calling convention. Without the function body, the vectorizer cannot rewrite the loop. The ABI boundary is, mechanically, where auto-vectorization stops.

On AVX-512 hardware the gap extends to 16x. Benchmarks from simdjson show the effect at scale: conventional JSON parsers with opaque call boundaries sustain roughly 0.5 GB/s; simdjson, which annotates its entire hot path for forced inlining, reaches 2.5-3.5 GB/s on the same workloads.

What the `inline` Keyword Does Not Do

A reasonable instinct is to mark functions with inline and expect the compiler to substitute them at call sites. This has not been a reliable approach since roughly GCC 4.x. The cppreference documentation states plainly that the keyword does not guarantee inline substitution, and GCC’s own documentation says the same. Compilers maintain inlining cost budgets and apply their own heuristics. The keyword is not an instruction; it is a request the compiler is free to ignore.

The keyword’s actual job in modern C++ is the One Definition Rule exemption. Defining a non-inline function in a header included by ten .cpp files produces ten conflicting linker symbols, causing a link error. The inline specifier tells the linker that all copies are identical and should be folded without error. The C++ standard specifies this. What enables inlining is visibility of the function body at the call site; inline in a header enables that visibility without triggering the ODR linker error. Template instantiations get an equivalent exemption from language rules, which is why Eigen, range-v3, and {fmt} can be entirely header-only without explicitly marking every function inline.

How Production Libraries Handle It

always_inline. simdjson defines really_inline as __attribute__((always_inline)) inline and applies it to every function on the hot JSON parsing path. This overrides the compiler’s cost model rather than making a suggestion. The MSVC equivalent is __forceinline. A portable wrapper covers both toolchains:

#if defined(_MSC_VER)
    #define FORCE_INLINE __forceinline
#else
    #define FORCE_INLINE __attribute__((always_inline)) inline
#endif

This is appropriate for functions where inlining is measurably critical and the code size cost per call site is acceptable. It is not appropriate for large functions called in many places, where it can inflate binary size past L1 instruction cache limits and produce net slowdowns.

Header-only layout. Providing function definitions in headers gives the compiler the body everywhere the header is included. This is the inlining prerequisite expressed as a project structure decision, and it is the constraint behind most high-performance template library designs. The performance intent is embedded in where the code lives, not in any annotation.

Link-time optimization. For functions that must live in separate translation units, LTO recovers cross-module inlining at link time. Chromium and Firefox both use ThinLTO in production builds. The LLVM ThinLTO documentation describes the mechanism: summary metadata is generated at compile time, and per-module inlining decisions are made during linking based on that metadata. Google has reported 10-20% improvements over -O2 on production workloads when combining ThinLTO with profile-guided optimization. Enabling it requires only -flto=thin at both compile and link time, or one CMake property:

set_property(TARGET my_target PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)

final and devirtualization. Virtual calls add a separate dimension. Pre-Spectre, an indirect branch through a vtable cost 1-3 cycles with a correctly predicted branch target buffer entry. Post-Spectre, the retpoline mitigation replaced indirect branches with a trampoline that exploits the return stack buffer to trap speculative execution, adding 10-25 cycles per indirect call on hardware without eIBRS. A virtual call that cost 15-20 cycles before 2018 now costs 40-60 cycles on older microarchitectures. Marking concrete classes as final in C++11 allows the compiler to resolve virtual calls statically, inline the method body, and eliminate the indirect branch entirely. Where final is inappropriate, profile-guided optimization inserts a speculative type check: a fast inlined path for the statistically dominant type at runtime, with a vtable fallback for everything else.

Reading the Compiler’s Feedback

Before reaching for always_inline, it is worth confirming what the compiler is doing. Both major compilers can report missed vectorization with standard flags:

# Clang
clang++ -O3 -Rpass-missed=loop-vectorize -Rpass-missed=inline foo.cpp

# GCC
g++ -O3 -fopt-info-vec-missed -fopt-info-inline-missed foo.cpp

GCC’s message for an opaque call in a vectorizable loop is "Function call may clobber memory." Clang produces "call instruction cannot be vectorized." Both identify the same root cause. Compiler Explorer makes verification immediate: put the loop with an external function declaration, observe the scalar assembly, then add the definition or -flto at compile and link and watch the SIMD instructions appear. The assembly is the ground truth; profiler samples show the effect on runtime, not the cause in the IR.

The Practical Upshot

The direct overhead of a call/ret pair on x86-64 is roughly 10-15 cycles, including register spills mandated by the ABI. That is real, especially in tight loops running millions of iterations per second. The vectorization loss is a larger effect and it is not labeled as call overhead in any profiler output. An 8x or 16x throughput gap from blocked SIMD shows up as the loop being slower than it should be, with no obvious attribution.

High-performance libraries have navigated this constraint for years. The patterns are standard toolchain features: header visibility, always_inline annotations, ThinLTO, and final declarations. The relevant skill is building the habit of checking whether vectorization happened in the loops that matter, using the diagnostic flags above, before assuming the performance budget is fully spent.