Beyond Call Overhead: How Inlining Enables the Optimizations That Matter

The conventional understanding of function call overhead is correct but incomplete. A call instruction pushes the return address and transfers control; a ret pops it and returns. On modern Intel silicon, that round trip costs 3 to 8 cycles under warm cache conditions, roughly 1 to 3 nanoseconds at 3 GHz. Daniel Lemire’s benchmarks confirm this range for trivially small functions. The framing of his recent article, which uses a simple add function to show compilers typically inline away the overhead entirely, is correct as far as it goes.

But the cycles-per-call number misses the larger story. The call instruction is rarely the bottleneck. What matters is what a function call forces the optimizer to stop doing.

The ABI Contract

Every function call in C++ is governed by a calling convention, an implicit contract between the caller and callee about how arguments and return values are exchanged and who is responsible for preserving which registers. On Linux and macOS, the System V AMD64 ABI passes the first six integer or pointer arguments in rdi, rsi, rdx, rcx, r8, and r9. On Windows x64, only four are passed in registers (rcx, rdx, r8, r9), and the caller is additionally required to allocate 32 bytes of “shadow space” on the stack before every call, regardless of whether the callee uses it.

The critical detail is the register classification. Registers split into caller-saved (volatile: rax, rcx, rdx, rsi, rdi, r8 through r11) and callee-saved (non-volatile: rbx, rbp, r12 through r15). If the caller needs the value in rdi after the call returns, it must save it to the stack first. If the callee needs rbx, it must push it at entry and pop it before returning. These spills and reloads are real loads and stores. They saturate store buffer entries, consume memory bandwidth, and add latency to dependent computations.

A tight inner loop calling even a small external function will, in the worst case, spill all six caller-saved registers holding live values, execute the call, and reload them. That is twelve memory operations the program would not have needed if the function were inlined.

The Optimization Barrier

Register spills are measurable but still not the primary cost in most real programs. The deeper problem is that an opaque function call is an optimization barrier, a point past which the compiler cannot see.

Consider this loop:

// defined in another translation unit, no LTO
int scale(int x, int factor) { return x * factor; }

void apply(int* arr, int n, int factor) {
    for (int i = 0; i < n; i++) {
        arr[i] = scale(arr[i], factor);
    }
}

The compiler knows nothing about what scale does. It must assume scale could read or write any memory reachable through its arguments or through global state. It cannot reorder iterations, cannot reason about independence across calls, and cannot prove that processing multiple elements simultaneously would be safe. The auto-vectorizer, which works by proving that multiple iterations can execute in parallel, gives up immediately.

Now put scale in the same translation unit, or mark it inline:

inline int scale(int x, int factor) { return x * factor; }

void apply(int* arr, int n, int factor) {
    for (int i = 0; i < n; i++) {
        arr[i] = scale(arr[i], factor);
    }
}

After inlining, the compiler sees arr[i] = arr[i] * factor. With -O3 -march=native on a machine supporting AVX2, GCC emits vpmulld, a packed 32-bit integer multiply operating on eight values simultaneously. The loop processes eight elements per iteration rather than one. This is not a marginal improvement. It is a transformation that cannot happen at all without inlining, regardless of how cleverly the surrounding code is written.

What Inlining Actually Gives the Optimizer

The SIMD case is the most dramatic, but inlining enables a whole family of optimizations across what was previously a call boundary.

Constant propagation. If factor is known at compile time (say, the caller always passes 2), the compiler can replace the multiply with an add or shift after inlining. Without inlining, scale is a black box and the constant cannot propagate into it.

Common subexpression elimination. If the same computation appears in both caller and callee after inlining, the compiler can eliminate the duplicate. This is impossible across an opaque call boundary.

Dead code elimination. If a branch inside scale is provably never taken given the argument the caller always passes, inlining lets the compiler see this and remove the dead path entirely.

Exception handling overhead. Every call to a function that might throw requires the compiler to maintain enough bookkeeping to support stack unwinding. Marking functions noexcept eliminates this for individual calls. Inlining eliminates it by removing the call.

These transformations compose. Inlining scale might enable constant propagation, which might fold a branch, which might make the loop body small enough to unroll, which might reveal further CSE opportunities. The optimizer reasons over a region of code; a function call artificially shrinks that region.

Practical Tools

The inline keyword in modern C++ does not, strictly speaking, request inlining. Its actual purpose, per the standard, is to suppress ODR violations when a function definition appears in multiple translation units via a header. Compilers inline functions based on their own cost models at -O2 and -O3, largely ignoring the keyword. You can verify this by examining assembly output with Compiler Explorer.

When you genuinely need to force inlining, __attribute__((always_inline)) on GCC and Clang, or __forceinline on MSVC, will do it. These override the compiler’s size heuristics and produce larger binaries if overused, but they are appropriate for genuinely hot paths where you know the overhead matters and the function is small.

The inverse is __attribute__((noinline)), which prevents inlining entirely. This is useful for benchmarking, to ensure you are measuring what you intend, and for error paths where code size matters more than speed.

For functions with no side effects whose output depends only on their inputs, __attribute__((const)) tells the compiler it can treat calls as mathematical expressions: hoist them out of loops, CSE them, eliminate redundant calls. __attribute__((pure)) is slightly weaker, permitting the function to read but not write global memory. Neither of these requires inlining to take effect, making them valuable for larger functions where inlining would bloat the binary.

Finally, link-time optimization extends the optimizer’s visibility across translation unit boundaries. With -flto (GCC) or -flto=thin (LLVM ThinLTO), the compiler emits intermediate representation into object files, and at link time the optimizer runs over the whole program. Cross-translation-unit inlining, devirtualization, and interprocedural constant propagation all become possible. ThinLTO in particular compiles modules in parallel using summaries, making it practical for large projects where full LTO is too slow. Chrome, Firefox, and LLVM itself ship with ThinLTO enabled in their release builds.

When Inlining Hurts

Inlining is not always the right choice. A function called from many sites, inlined everywhere, bloats the binary and puts pressure on the instruction cache. A large inlined loop body may no longer fit in L1 instruction cache, causing fetch stalls that cost more than the call overhead would have. The compiler’s default heuristics balance inlining benefit against binary size, and for most code they get the trade-off right. GCC’s default limit is roughly 600 “pseudo-instructions” before refusing to inline a function. Clang uses a cost model with a default threshold of 225 units.

Profile-guided optimization shifts this balance intelligently. With PGO (-fprofile-use), the compiler has actual call frequency data and inlines hot call sites even past normal size limits, while leaving cold paths as real calls to keep the binary lean.

The Real Price

The Lemire article makes the right observation that small utility functions like add are effectively free because the compiler handles them transparently. The deeper lesson is that this transparency depends entirely on whether the compiler can see the function body. Where it cannot, whether because the function lives in another translation unit without LTO, because it is virtual and the dynamic target is unknown, or because it was explicitly marked noinline, the optimizer must treat the call as a wall.

The call instruction costs a few nanoseconds under warm conditions. The optimization barrier is harder to quantify because its cost scales with what the optimizer would have done otherwise. In a vectorizable loop, the barrier represents the difference between scalar and 8-wide SIMD execution, a throughput reduction that appears nowhere in a profiler, because profilers measure what the program did, not what it was prevented from doing.