The Optimization Cost Behind Every Function Call

A function call on modern x86-64 hardware costs roughly 4 to 10 cycles in isolation. That sounds trivial, and in most application code it is. But Daniel Lemire’s recent writeup on function call costs is a useful entry point for a deeper question: why do compilers work so hard to inline functions, and what are they actually getting out of it?

The call overhead is real, but it is the least interesting part of the answer.

What Actually Happens When You Call a Function

On x86-64 with the System V ABI (used on Linux and macOS), a simple function call follows a fixed protocol. Integer and pointer arguments go into registers RDI, RSI, RDX, RCX, R8, and R9. The CALL instruction pushes the return address onto the stack and jumps to the callee. The callee sets up a stack frame with PUSH RBP and MOV RBP, RSP, does its work, then tears down the frame with POP RBP before executing RET. The RET instruction pops the return address from the stack and jumps back.

For something as simple as int add(int x, int y) { return x + y; }, Agner Fog’s processor optimization manuals put the round-trip overhead at around 4-8 cycles on modern out-of-order cores. The CALL instruction has a throughput of roughly 1 per cycle on recent Intel and AMD microarchitectures. RET is more expensive because it depends on the Return Stack Buffer, a small hardware predictor that stores predicted return addresses; deep call chains can overflow it, leading to mispredictions that cost 10-20 cycles each.

So yes, there is overhead. But consider what Lemire’s example is really illustrating:

int add(int x, int y) {
    return x + y;
}

int add3(int x, int y, int z) {
    return add(add(x, y), z);
}

Compared directly to:

int add3(int x, int y, int z) {
    return x + y + z;
}

The savings from eliminating two CALL/RET pairs are measurable but modest. What changes the picture entirely is what else the compiler can do once add is inlined and the full expression x + y + z is visible.

The Call Boundary as an Optimization Barrier

A function call, from the compiler’s perspective, is an opacity boundary. Without inlining, the compiler must assume the callee can read and write arbitrary memory, modify global state, and has side effects that prevent reordering. Even with __attribute__((pure)) or __attribute__((const)) annotations in GCC and Clang, the compiler’s ability to optimize across the call is limited.

Inlining collapses that boundary. Once the callee’s body is substituted at the call site, the optimizer sees the combined computation as a single unit. This enables:

Constant propagation: if any argument is a compile-time constant, the compiler can fold it through the inlined body, potentially eliminating entire branches
Dead code elimination: branches in the inlined function that are provably unreachable with the known argument values get removed entirely
Alias analysis: the compiler can reason about whether pointers in the inlined body alias with those in the caller, enabling load/store reordering
Loop optimizations: if the call site is inside a loop, the compiler can apply loop-invariant code motion, unrolling, and vectorization

The last point deserves a concrete example.

Inlining as a Gate to SIMD Vectorization

Consider a loop that applies a per-element transformation:

float scale(float x) {
    return x * 2.5f;
}

void transform(float *arr, int n) {
    for (int i = 0; i < n; i++) {
        arr[i] = scale(arr[i]);
    }
}

If scale is defined in a separate translation unit and link-time optimization is not enabled, the compiler compiling transform has no idea what scale does. It cannot vectorize the loop; every iteration makes a function call, hands over one float, gets one float back. You get n calls per invocation.

With scale inlined, the compiler sees a simple multiply-by-constant inside the loop. It emits AVX2 code using VMULPS, processing 8 floats per instruction in a 256-bit register. With AVX-512 on supported CPUs, that jumps to 16 floats per instruction. The difference between the scalar call-per-element version and the vectorized version can easily be 6-10x on arrays large enough to dominate cache.

You can verify this interactively on Godbolt Compiler Explorer. Mark scale with __attribute__((noinline)) and compile with -O3 -mavx2. The loop body contains a CALL instruction; auto-vectorization cannot proceed. Remove the attribute, and the compiler emits VMULPS instructions with a vectorized loop preamble handling alignment and remainder. The generated code for the two versions looks nothing alike.

The function call overhead you saved was maybe 4 cycles per element. The SIMD throughput improvement is worth several hundred cycles per loop iteration.

How Compilers Decide What to Inline

Neither GCC nor Clang inlines blindly. Both maintain cost models that estimate whether inlining a given call site will produce net benefit.

GCC controls its inliner with -finline-limit (default: 600), which sets an approximate upper bound on the number of internal pseudo-instructions a function can have and still be a candidate. The actual decision also weighs call frequency, estimated speedup, and growth in code size. Individual functions can be forced with __attribute__((always_inline)) or excluded with __attribute__((noinline)).

Clang’s inliner is LLVM’s, and it operates on IR instruction count with a default threshold of around 225 for most call sites. Hot call sites detected via profile data get a higher threshold. The flag -mllvm -inline-threshold=N overrides this. Like GCC, Clang also nets out expected simplifications from the inlining decision: a function that will be reduced to near-nothing after constant folding costs less to inline than its raw size suggests.

C++20 does not standardize a [[always_inline]] attribute. The inline keyword in modern C++ is primarily an ODR (One Definition Rule) annotation telling the linker that multiple identical definitions are allowed; compilers largely ignore it as an inlining hint. For portable forced inlining, __attribute__((always_inline)) on GCC and Clang and __forceinline on MSVC are the practical options. If you need this in a header meant for multiple compilers, a portability macro is the usual approach.

Cross-Translation-Unit Inlining with LTO

The fundamental limitation of inlining is translation unit scope. A function in math.cpp is a black box to main.cpp at compile time, unless you put it in a header. Link-Time Optimization removes this restriction by preserving compiler IR through the object file and deferring final optimization to link time, when all translation units are visible.

GCC’s -flto and Clang’s -flto=thin (ThinLTO) both enable this. ThinLTO is particularly production-friendly: it uses per-module summaries to drive cross-module inlining decisions without loading the entire program’s IR at once, keeping link times manageable. Most major build systems support LTO as a first-class option, and CMake exposes it via CMAKE_INTERPROCEDURAL_OPTIMIZATION=ON.

Static libraries compiled with LTO bitcode (using -flto at compile time) let consumers who also compile with LTO see through the library’s function boundaries. This is how you can ship a library that behaves like a header-only library for optimization purposes without requiring users to compile your source directly.

How Other Languages Handle This

Rust’s approach is explicit. The #[inline] attribute is a hint that the function should be considered for cross-crate inlining, and #[inline(always)] and #[inline(never)] are stronger directives. Without #[inline], functions in one crate are typically opaque to callers in another crate unless LTO is enabled. The Rust standard library aggressively annotates hot functions with #[inline] precisely because library consumers cannot be assumed to have LTO active.

Rust also runs an inlining pass at the MIR (Mid-level Intermediate Representation) level before lowering to LLVM IR. This allows simplifications that interact with Rust’s type system, such as removing dead branches in generic code where type information is known.

Go takes the opposite design position. The Go compiler inlines aggressively by default, with an inlining budget measured in abstract syntax tree complexity (approximately 80 nodes). Users have no attribute equivalent; the //go:noinline compiler directive exists but is used almost exclusively in the Go runtime itself. The tradeoff is that Go users get consistent optimization without thinking about it, but they also cannot force inlining when the heuristic fails.

The Downside: I-Cache Pressure and Code Bloat

Inlining every call site of every function would be counterproductive. Inlining multiplies the instruction footprint of a function by the number of its call sites. For large functions or functions called in many cold paths, the expanded code competes with hotter code for space in the L1 instruction cache.

The canonical sign of over-inlining is a performance regression on workloads where the newly bloated callers are cold: displaced hot code causes more L1 I-cache misses globally, and the cost of those misses outweighs the saved call overhead. This effect is hard to catch without profiling, because synthetic microbenchmarks typically do not exercise the I-cache pressure that realistic workloads produce.

Profile-guided optimization (PGO) addresses this by conditioning inlining decisions on measured call frequency rather than static heuristics. GCC supports PGO via -fprofile-generate and -fprofile-use. Clang uses -fprofile-instr-generate and -fprofile-instr-use (or -fprofile-generate for GCC-compatible instrumentation). When PGO data is present, the inliner promotes hot call sites aggressively and holds back on cold ones, giving you the benefits of inlining where it matters without the code bloat where it does not.

Practical Guidance

For most application code, compiler defaults with -O2 or -O3 handle inlining well enough. Where you should pay attention:

If profiling shows a function dominating runtime, and that function contains an inner loop that calls small helpers, check whether those helpers were inlined. Inspect the assembly or use a compiler report flag like GCC’s -fopt-info-inline.
If the hot function and its callees are in separate translation units, enable LTO or move them together.
If you see CALL instructions inside a tight loop in the assembly output, consider marking the callee always_inline or restructuring so the compiler can see the full computation.
If you are adding a noinline attribute to benchmark a function in isolation, remember that the results represent a pessimistic lower bound; the production path with inlining enabled may look very different.

The central insight from Lemire’s piece scales well beyond the toy example: the call overhead itself is small enough that saving it rarely matters. What matters is whether the compiler has the visibility to apply the transformations that actually move performance numbers. Inlining is how you give it that visibility.