The Real Cost of a Function Call Is Not the Call Itself

Daniel Lemire’s recent isocpp post on function call cost uses a tidy example to make the point that calling add(x, y) in a loop is not the same as writing x + y inline. The compiler can eliminate the overhead by inlining, and modern compilers do this well. But the article leaves the most interesting part on the table: the cycles spent on CALL and RET are almost never what you should worry about. The real story is about what a non-inlined function call prevents the compiler from doing.

What a Function Call Actually Costs

On a modern x86-64 CPU like Intel Skylake, a predicted direct call and return costs roughly 6 to 10 cycles. The CALL instruction pushes the return address and jumps; the RET pops and jumps back. With the CPU’s Return Stack Buffer (RSB) warm, RET is nearly free for prediction purposes. Add in the callee’s prologue and epilogue, the register saves and restores mandated by the System V AMD64 ABI, and a trivial function call costs somewhere between 6 and 12 cycles round-trip under ideal conditions.

That is real overhead. At 3 GHz, 10 cycles is about 3.3 nanoseconds. Run a loop 100 million times and you’ve spent 330 milliseconds just on call mechanics. Agner Fog’s instruction tables make this concrete: CALL has a throughput of one per cycle but serializes the pipeline; RET has similar latency when the RSB is cold or the call depth exceeds its capacity (typically 16 to 32 entries on modern Intel).

But here is the thing: a scalar loop without any function calls runs at roughly 1 cycle per element for simple arithmetic. The call overhead adds maybe 4 to 8 cycles on top of that, making it 5 to 9 cycles per iteration. That is a slowdown, but it is not catastrophic.

The real catastrophe is what you lost before you even started counting cycles.

The Vectorization Barrier

Modern CPUs have SIMD units that can process multiple data elements in a single instruction. AVX2, available on Intel processors since Haswell (2013) and AMD since Zen 1 (2017), operates on 256-bit vectors, which is eight single-precision floats or four doubles at a time. A loop that adds arrays of floats can run at 0.1 to 0.25 cycles per element when the compiler generates VMULPS or VADDPS instructions.

That 0.1 cycles/element versus 6+ cycles/element is a 30 to 60x difference in throughput.

The compiler’s auto-vectorizer needs to see the entire loop body to apply this transformation. It needs to confirm that loop iterations are independent, that there are no hidden side effects, and that the operation maps cleanly onto a SIMD instruction. When a function call appears in the loop body, and the compiler cannot see the callee’s definition, it has to assume the worst: the function might read and write arbitrary memory, might have side effects, might create loop-carried dependencies through global state. The conservative assumption blocks vectorization entirely.

This is the ABI boundary problem. The System V AMD64 calling convention passes float arguments in scalar XMM registers, one value per register. There is no standard mechanism for an auto-vectorizer to rewrite a loop’s function calls into equivalent SIMD vector operations. The function signature is a contract: you pass one float, you get one float back. Vectorization would require passing eight floats at once, which is a different signature with different semantics.

Inlining eliminates the ABI boundary. Once the compiler substitutes the function body at the call site, it sees x * x instead of square(x) and can treat all the loop iterations as independent SIMD-eligible operations. The assembly changes from:

.loop:
  vmovss xmm0, [rsi + rax*4]
  call   square          ; scalar, one element per call
  vmovss [rdi + rax*4], xmm0
  inc    rax
  jl     .loop

to:

.loop:
  vmovups ymm0, [rsi + rax]   ; load 8 floats
  vmulps  ymm0, ymm0, ymm0    ; multiply 8 in parallel
  vmovups [rdi + rax], ymm0   ; store 8 floats
  add     rax, 32
  jl      .loop

Eight iterations per cycle instead of one. The inlining decision triggered the entire optimization cascade.

When Compilers Inline and When They Don’t

GCC and Clang inline aggressively at -O2 and above, using cost models based on estimated instruction counts. GCC’s max-inline-insns-auto parameter defaults to 30 instructions for its automatic heuristic; functions smaller than that threshold inline freely. Clang uses a similar model with a default threshold around 225 cost units.

The inline keyword in C++ is mostly a linkage specifier at this point. Modern compilers ignore it as an inlining hint and make their own decisions. If you need a guarantee, __attribute__((always_inline)) is the only reliable option:

__attribute__((always_inline))
inline float square(float x) { return x * x; }

With this attribute, the compiler inlines at every call site regardless of the cost model. If inlining is structurally impossible (recursive functions being the main exception), the compiler emits a warning instead of silently failing.

For functions that call many small helpers, GCC provides __attribute__((flatten)), which tells the compiler to inline everything the marked function calls, recursively:

__attribute__((flatten))
void process_pixels(Pixel* out, const Pixel* in, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = clamp(gamma_correct(to_linear(in[i])));  // all inlined
    }
}

This saves you from annotating every helper individually and is more reliable than hoping -O3 will do it. Clang has partial support for flatten as of LLVM 16; MSVC has no equivalent.

Cross-Translation-Unit Inlining with LTO

All of the above assumes the compiler can see the function definition. If square() lives in math.cpp and your loop is in compute.cpp, the compiler compiling compute.cpp sees only a declaration. Even with aggressive settings, it cannot inline across translation unit boundaries without Link-Time Optimization.

LTO (-flto on GCC and Clang) solves this by embedding intermediate representation into object files and running the optimizer at link time, after all translation units have been combined. At that point the compiler sees the entire program and can inline freely across file boundaries.

Clang’s ThinLTO (-flto=thin) is the production-ready version of this. Rather than combining all IR into one monolithic module, ThinLTO shares lightweight function summaries across TUs and makes inlining decisions based on those summaries, then compiles each TU mostly independently. It scales to millions of lines of code with roughly 20 to 50% slower link times versus non-LTO builds. Google, Meta, and Apple use ThinLTO in production; reported speedups range from 5 to 15% on large C++ codebases, almost entirely from cross-TU inlining and the subsequent dead code elimination it enables.

The Flip Side

None of this means you should mark everything always_inline. Inlining has real costs.

Every inlined function body is replicated at its call site. A function called from 50 places that was 40 instructions becomes 2000 instructions of additional code. That code has to fit in the instruction cache. A 32KB L1i cache with 64-byte cache lines holds 512 lines. Thrashing the icache costs 10 to 15 cycles per miss. An over-inlined hot loop that evicts its own code from the icache can end up slower than a clean non-inlined version.

Profiling is the only reliable guide. perf stat will show you frontend_bound stalls if icache pressure is an issue. The pattern to watch for is a loop that looks like it should be fast but shows high L1-icache-load-misses counts. In that case, pulling some helpers out of line and accepting the call overhead is the right trade.

Lemire’s broader point across his work on low-level performance is that measuring in isolation is dangerous. Measuring the cost of a function call without measuring what the compiler stopped doing is like measuring the time it takes to open a car door without noticing the car is facing a wall.

Practical Takeaways

If you have tight loops that process arrays of numbers, the function call question matters. The decision tree is roughly:

If the loop body is all in one translation unit and the helpers are small, -O2 inlining will usually handle it automatically. Check the assembly if you’re unsure.
If you need a guarantee that inlining happens, use __attribute__((always_inline)) on the callee. Do not rely on the inline keyword for this.
If helpers are spread across translation units, enable LTO. ThinLTO is almost always the right choice for non-trivial codebases.
If you have a single hot entry point that calls many helpers you want fully inlined, __attribute__((flatten)) on GCC is a clean way to express that intent.
If your hotspot has already been inlined and is still slow, look at icache pressure before reaching for more inlining.

The original Lemire article makes the case clearly with a minimal example: add3 calling add versus add3 doing it inline. The difference the compiler produces from those two versions is the concrete demonstration of the principle. What the assembly doesn’t show directly is the SIMD potential waiting behind that inlining decision. That part is where the real cycles live.