What the Calling Convention Forces Your Compiler to Forget

The conventional way to frame function call overhead focuses on the wrong unit. You find Daniel Lemire’s analysis on isocpp.org, count 3 to 10 cycles for a round-trip call on modern x86-64, and conclude the overhead is the problem. It is a problem, but not the main one.

The main one is the calling convention.

The ABI as a Contract for Forgetting

A calling convention is a contract between caller and callee about who owns what. On Linux and macOS, the System V AMD64 ABI says the caller places integer arguments in RDI, RSI, RDX, RCX, R8, R9; floating-point arguments in XMM0 through XMM7; and then surrenders its right to expect those register values back. The callee may destroy the caller-saved registers. The return address sits on the stack, aligned to 16 bytes before the call instruction.

This is a correctness contract. It exists so that code compiled from different translation units, different compilers, different languages, can link together without corruption. It achieves this by defining a narrow set of things the compiler is allowed to assume across a call boundary.

The compiler cannot assume:

That a pointer argument does not alias memory it is tracking in registers
That the function has no side effects
That the function reads no global state
That the same inputs produce the same outputs on every call

Without those guarantees, the optimizer must treat the call as a black box. It must save and restore anything it cares about. It must serialize iterations that could otherwise run in parallel. It cannot apply any transformation that requires seeing the full computation across that boundary.

Windows Makes This Concrete

The Microsoft x64 ABI, used on Windows, illustrates how these choices are not inevitable. It is stricter than System V in ways that have direct throughput consequences. The caller must allocate 32 bytes of shadow space on the stack before every call, regardless of how many arguments the function takes. Call a zero-argument function in a tight inner loop on Windows and the hot path includes a stack pointer decrement, 32 bytes of unused reserved space, and a corresponding cleanup on every iteration. Linux pays none of this.

More expensive still: the Windows convention only passes four integer arguments in registers (RCX, RDX, R8, R9) before spilling to the stack, versus six on Linux. A function with five integer parameters reads its fifth argument from memory on every Windows call. In a hot computation loop, that is a stack load per iteration the System V ABI does not require.

This is part of why the same C++ code, compiled with the same compiler at the same optimization level, often shows better throughput on Linux than Windows for compute-intensive loops. The calling convention overhead is paid per iteration, invisibly, by design. You will not find it in any profiler that samples at the function level.

Why SIMD Cannot Cross the Boundary

The vectorization problem follows directly from the ABI contract. A VMULPS ymm0, ymm1, ymm2 instruction multiplies eight single-precision floats simultaneously. For the compiler to use it across N loop iterations, it needs to see N iterations as a single unit of work. It cannot do that when a non-inlined function sits inside the loop.

The reason is not just that the compiler cannot see what the function does. There is no ABI representation for passing eight floats to an arbitrary function and getting eight floats back. The convention defines one float in XMM0, one float returned in XMM0. A vectorized version of the function would need a different signature entirely. Since the compiler cannot rewrite the callee’s signature at the call site, it emits scalar code regardless of how obvious the parallelism is.

// The compiler cannot vectorize this loop.
// One element per call, one call per iteration.
float square(float x) { return x * x; }

void square_array(float* out, const float* in, int n) {
    for (int i = 0; i < n; i++)
        out[i] = square(in[i]);
}

Inline square and the loop body becomes out[i] = in[i] * in[i]. With -O3 -mavx2, GCC and Clang vectorize this into a vmulps ymm0, ymm0, ymm0 loop processing eight elements per iteration. The throughput difference reaches 8x on AVX2 and 16x on AVX-512, far exceeding what you would get from eliminating the call overhead alone. Agner Fog’s processor optimization manuals document this pattern consistently across microarchitectures; the SIMD throughput gap between inlined and non-inlined scalar loops is the dominant factor, not the cycle cost of the call instruction itself.

The Flip Side: Instruction Cache Pressure

The standard advice stops at “inline more.” That advice is incomplete, because aggressive inlining has a cost that profilers almost never surface until it has already caused damage.

Every inlined call site duplicates the function body. A 40-instruction function called from 60 places becomes 2400 additional instructions in the binary. The L1 instruction cache on a modern x86-64 core is 32KB with 64-byte lines, giving 512 usable cache lines. When the combined hot instruction footprint of an aggressively inlined function exceeds what fits in L1i, the loop begins evicting its own code. An L1 instruction cache miss costs 10-15 cycles; an L3 miss costs 40 or more. These penalties are paid per loop iteration just like call overhead, but they show up in different performance counters and are easy to misdiagnose as compute bottlenecks.

You can distinguish the two with perf stat. Intel’s Top-Down Microarchitecture Analysis methodology separates cycles into frontend_bound (the processor could not fetch or decode instructions fast enough) and backend_bound (execution units were the constraint). Instruction cache pressure shows up in frontend_bound:

perf stat -e cycles,instructions,L1-icache-load-misses,\
  frontend_retired.latency_ge_8 ./program

High L1-icache-load-misses relative to total cycles, combined with high frontend_bound, points to code bloat. The fix is the opposite of what Lemire’s article suggests for tight loops: mark cold callees noinline to keep them out of the hot instruction stream.

// Moves this to .text.cold in the ELF binary;
// biases branch prediction to treat the call as unlikely
[[gnu::noinline, gnu::cold]]
void throw_out_of_range(size_t idx, size_t max);

PGO Manages the Tension Automatically

Profile-Guided Optimization is the practical resolution to the conflict between inlining aggressively for hot code and restraining code size for cold code. With PGO data, the compiler has actual execution frequency for every call site. It raises inlining thresholds specifically for hot paths and leaves cold functions in their own code regions.

GCC also provides __attribute__((flatten)), which instructs the compiler to inline all callees of the marked function recursively. This is a less surgical version of always_inline that works well for entry points into a computation-heavy call tree:

__attribute__((flatten))
void process_chunk(uint8_t* data, size_t len) {
    // Every function called here will be inlined
    parse_header(data);
    validate_fields(data, len);
    emit_result(data, len);
}

The simdjson JSON parser uses a similar philosophy across its entire hot parsing path: a macro wrapping __attribute__((always_inline)) applied to every small function in the core loop. The result is a single inlined function body the auto-vectorizer can analyze in full, achieving 2.5-3.5 GB/s throughput on real JSON documents. Conventional parsers organized as clean call hierarchies without that kind of forced inlining typically measure around 0.5 GB/s on the same hardware. The algorithmic work is identical; the difference is entirely optimizer visibility.

ThinLTO for Cross-File Visibility

None of the above matters if the function lives in a different translation unit. A function defined in helpers.cpp and called from main.cpp is completely opaque to the compiler without LTO, regardless of how small it is or what attributes it carries. The compiled object contains only a symbol reference; the body is not available for inlining.

ThinLTO solves this with minimal build overhead:

clang++ -O2 -flto=thin helpers.cpp main.cpp -o app

# With an incremental cache for faster rebuilds
clang++ -O2 -flto=thin \
  -Wl,--thinlto-cache-dir=/tmp/thinlto-cache \
  helpers.cpp main.cpp -o app

Each file compiles to LLVM IR with a lightweight per-module summary. At link time, the linker identifies profitable cross-module inlining candidates, imports the needed IR, and recompiles each module in parallel. Link time increases by roughly 20-50% over a non-LTO build. Reported throughput gains on large C++ codebases are typically 5-15%, almost entirely from cross-TU inlining and the downstream dead code elimination it enables.

The raw call overhead that Lemire measures is real and worth understanding. But the cycle cost of CALL and RET is the most visible and least actionable part of the picture. The calling convention’s requirements for what the compiler must forget, the ABI’s inability to represent SIMD-width function arguments, the code bloat from over-inlining that evicts itself from the instruction cache, and the cross-TU opacity that only LTO resolves: these are where the actual performance work happens.