· 7 min read ·

What a Function Call Actually Costs in a Tight Loop

Source: isocpp

Function calls are cheap. That is what every introductory systems course teaches, and it is roughly true. On a modern x86-64 core with a warm instruction cache and a correctly predicted branch, a direct call plus return costs around 3–5 cycles. Add a standard frame setup and teardown and you land somewhere between 5 and 10 cycles total. At 3 GHz, 10 cycles is about 3 nanoseconds.

That sounds fine until you put it in a loop. If your loop body does 2 nanoseconds of actual work and you pay 3 nanoseconds of call overhead per iteration, the function call is your runtime. Daniel Lemire quantifies this clearly: a trivial function forced non-inline with __attribute__((noinline)) runs 4–8 times slower than the equivalent inlined code, not because the body changed but because the call mechanics dominate at that scale.

What the CPU Actually Does

The call instruction on x86-64 pushes the 8-byte return address onto the stack and transfers control. ret pops it and jumps back. The hardware helps: the Return Stack Buffer (RSB) is a small predictor that shadows the software call stack, typically 16–32 entries deep depending on the microarchitecture. Intel Golden Cove chips have 24 entries; AMD Zen 3 and Zen 4 have 32. Each call writes a predicted return address into the RSB; each ret reads from it. When calls and returns are balanced and nesting depth stays within RSB capacity, return prediction is essentially perfect and the misprediction penalty is zero.

What the RSB cannot fix is the instruction cache. If the callee’s code sits in a cold cache line, you pay an L1 miss of ~4 cycles, an L2 miss of ~12, or an L3 miss of 40 cycles or more before the first instruction of the function even executes. This is why call overhead measurements vary so widely between benchmarks. A function called inside a tight inner loop stays warm. The same function called less frequently, in a codebase large enough to create i-cache pressure, can cost an order of magnitude more per invocation.

// Force the compiler to keep the call instead of inlining
__attribute__((noinline))
uint64_t increment(uint64_t x) { return x + 1; }

// Tight loop benchmark
uint64_t loop_noinline(uint64_t n) {
    uint64_t x = 0;
    for (uint64_t i = 0; i < n; i++)
        x = increment(x);
    return x;
}

// Inlined equivalent — the compiler reduces this to a counter increment
uint64_t loop_inline(uint64_t n) {
    uint64_t x = 0;
    for (uint64_t i = 0; i < n; i++)
        x = x + 1;
    return x;
}

The inlined version benchmarks at roughly 0.3 ns per iteration. The noinline version sits between 1.5 and 2.5 ns. The function body is identical.

How the Compiler Decides to Inline

GCC and Clang both maintain a cost model that estimates whether inlining a given callee is worth the code size increase at the call site. GCC’s threshold is controlled by -finline-limit, defaulting to 600 pseudoinstructions. Clang/LLVM uses an inline threshold of 225 at -O2 and 275 at -O3. Functions marked inline or static receive preferential treatment, but the compiler can still decline if the body exceeds the threshold.

Two attributes override the model entirely:

// Always inline regardless of size
__attribute__((always_inline)) inline int add(int a, int b) { return a + b; }

// Never inline, even if tiny
__attribute__((noinline)) int add_visible(int a, int b) { return a + b; }

Profile-guided optimization improves this substantially. With PGO data, the compiler knows which call sites are actually hot and raises the inlining threshold for those sites specifically, controlling binary size while concentrating the optimization budget where execution time is actually spent. Without PGO, the compiler infers hotness from loop nesting depth, which works reasonably well but misses real-world patterns that only profiling can reveal.

Inlining Enables Vectorization

Removing the call overhead is the obvious benefit. The less obvious benefit is what the optimizer can do once the call boundary is gone.

When a function call is opaque to the compiler, the auto-vectorizer cannot look through it. The calling convention requires the compiler to treat the callee as potentially clobbering all caller-saved registers, including every XMM and YMM register used for SIMD. This means the vectorizer cannot hoist SIMD loads and stores across call boundaries, cannot maintain a vector accumulator across loop iterations, and must serialize what would otherwise be a parallel reduction.

// With noinline, this loop cannot be auto-vectorized
__attribute__((noinline))
float scale(float x, float factor) { return x * factor; }

float sum_scaled(const float* arr, int n, float factor) {
    float acc = 0;
    for (int i = 0; i < n; i++)
        acc += scale(arr[i], factor);  // opaque call, SIMD impossible
    return acc;
}

Remove the noinline attribute and the compiler sees acc += arr[i] * factor repeated N times. That is a fused multiply-add reduction, which GCC and Clang will auto-vectorize into 8-wide AVX2 vfmadd231ps or 16-wide AVX-512 instructions. The transformation is not available through any means short of inlining, because the vectorizer requires visibility into the entire loop body.

On Intel chips through Haswell, there is an additional penalty: calling a function compiled without AVX from an AVX context triggers an AVX-SSE transition that costs 70–100 cycles. The processor must clear the upper half of all YMM registers before the callee can safely use legacy SSE instructions. Inlining eliminates the boundary where the transition would occur.

The ABI Is Not Free Either

The calling convention itself has a cost that varies by platform. The System V AMD64 ABI used on Linux and macOS passes the first six integer arguments in registers (RDI, RSI, RDX, RCX, R8, R9) and the first eight floating-point arguments in XMM0–XMM7. The Windows x64 ABI is more conservative: only four integer registers (RCX, RDX, R8, R9) and four XMM registers (XMM0–XMM3).

Windows x64 also requires the caller to allocate a mandatory 32-byte shadow space on the stack before every call, even for functions that take no arguments at all. There is no equivalent in the System V ABI. For an inner loop calling a function with five integer arguments, Windows code spills the fifth argument to the stack on every iteration; System V keeps it in a register. That extra store-and-load pair compounds at scale.

This is one reason that compute-intensive C++ code compiled for Linux often outperforms the same code compiled for Windows on equivalent hardware, particularly for workloads with many arguments crossing call boundaries.

Virtual Dispatch and the Spectre Tax

Virtual function calls replace a direct call label with an indirect call through the vtable. The sequence is roughly: load the vptr from the object, load the function pointer from the vtable slot, call the pointer. For a warm cache and a monomorphic call site (one concrete type), this costs about 5–10 cycles over a direct call.

The situation changed after Spectre was disclosed in January 2018. Retpoline mitigations, now standard in the Linux kernel and common in user-space hardened builds, replace indirect branches with a serializing sequence that prevents speculative execution through the indirect target. Under retpoline, each virtual call costs 30–80 cycles regardless of cache state or prediction accuracy. A tight loop over a polymorphic interface in a retpoline-patched binary can lose the majority of its throughput to mitigation overhead.

Devirtualization avoids the problem entirely when the compiler can prove the concrete type at the call site. The final specifier is the most direct signal:

class Derived final : public Base {
    void method() override { /* ... */ }
};

void caller(Derived* d) {
    d->method();  // devirtualized: compiler sees 'final', emits a direct call
}

Without final, devirtualization can still occur if the object is allocated in the same scope and the type is visible. Link-time optimization extends this analysis across translation unit boundaries.

Cross-TU Inlining with LTO

Without LTO, a function defined in a different .cpp file is completely opaque at the call site. The compiler cannot inline it, cannot propagate its constants, and cannot see its aliasing behavior. Header-only libraries work around this by putting implementations where the compiler can see them; LTO provides the same visibility without requiring that trade-off.

ThinLTO is the practical choice for most large codebases. Full LTO runs a monolithic whole-program optimization pass at link time, which produces excellent results but scales poorly. ThinLTO emits per-module bitcode with a global function summary, allowing the linker to inline across TU boundaries while keeping each module’s optimization parallel and incremental. The quality difference from full LTO is small in practice:

# Full LTO
clang++ -O2 -flto foo.cpp bar.cpp -o app

# ThinLTO: faster link, similar optimization quality
clang++ -O2 -flto=thin foo.cpp bar.cpp -o app

Typical speedups from LTO on call-heavy code range from 5 to 20 percent. Codebases with many small helper functions distributed across TUs can see larger gains. The noinline attribute takes precedence even under LTO, which is useful when you need clean symbol boundaries for profilers like perf.

When Not to Inline

Inlining everything is not the right goal. Inlining a large function at many call sites copies the function body N times, increasing binary size and i-cache pressure. At some threshold, the code size increase from inlining starts costing more in cache misses than the call overhead it eliminated. The compiler’s default thresholds are conservative for this reason.

The practical workflow for performance-sensitive code: measure first, use noinline in benchmarks to isolate the hypothesis, then let the compiler or LTO handle the inlining decision. Annotating hot paths with always_inline is occasionally justified but should follow profiler evidence, not intuition. The compiler’s cost model is well-calibrated for typical code; the cases where it makes a clearly wrong decision are less common than they appear.

The core point from Lemire’s analysis is that in tight loops, the function call overhead can exceed the work being done, and compilers know how to fix this when given visibility into the callee. Whether that visibility comes from function definitions in headers, from inline or static marking, or from LTO, the outcome is the same: the optimizer sees the full loop body and the overhead disappears, often taking along with it the call boundary that was preventing vectorization.

Was this interesting?