Function Call Overhead Is Mostly About What the Optimizer Can't See

Daniel Lemire’s recent post on IsoC++ makes a point that sounds simple: function calls are not free, and in tight loops their overhead can dominate. He forces a trivial add function to stay non-inline with __attribute__((noinline)) and shows it runs 4–8x slower than the equivalent inlined code. The example is clean and the point lands.

But the cycle count from call and ret is not the main story. The raw overhead of a direct call on x86-64 is 3–5 cycles for a warm cache, perfect prediction, and no frame setup. With a standard prologue it reaches maybe 10–12 cycles. At 3 GHz, that is around 4 nanoseconds. In a loop doing 2 nanoseconds of actual work per iteration, that is bad. It is not, however, what causes the 4–8x gaps Lemire observes. What causes those gaps is that the compiler, when it cannot see the body of the callee, loses its ability to optimize the caller.

The compiler’s optimizer does not operate on the call instruction in isolation. It operates on regions of code it can reason about holistically. A function call boundary is an opacity wall. Everything beyond it is unknown.

Four Optimizations That Break at a Call Boundary

Auto-vectorization. The SIMD auto-vectorizer needs to know, for each iteration of a loop, that operations are independent and map to vector instructions. An opaque call makes that impossible. The ABI specifies that float arguments pass in scalar XMM0; a vectorized version would need to process 8 at once through a completely different interface. The compiler cannot transform the loop.

Remove the call boundary and the situation changes entirely. With -O3 -mavx2, a loop that squares an array of floats goes from this:

.loop:
  vmovss xmm0, [rsi + rax*4]
  call   square                 ; one element per call
  vmovss [rdi + rax*4], xmm0
  inc    rax
  jl     .loop

To this:

.loop:
  vmovups ymm0, [rsi + rax]    ; load 8 floats at once
  vmulps  ymm0, ymm0, ymm0     ; multiply 8 in parallel
  vmovups [rdi + rax], ymm0    ; store 8 floats
  add     rax, 32
  jl      .loop

The throughput difference on AVX2 is 4–8x. On AVX-512, up to 16x. This dwarfs the cost of the call and ret instructions.

Alias analysis. Every pointer write is potentially a write to memory that you’re reading. The compiler needs to prove writes and reads don’t overlap to reorder them safely. An opaque function call destroys that proof. The compiler must assume the callee writes to any global or any pointer it might have access to. This causes values to be reloaded from memory after every call, even if you know the callee never touches them.

Two GCC attributes address this for functions where the contract genuinely holds:

// Reads no memory, writes no memory
__attribute__((const))
float fast_reciprocal(float x);

// Reads memory, writes none
__attribute__((pure))
float compute_norm(const float* arr, int n);

With const or pure, the compiler can hoist calls out of loops and eliminate redundant reloads without needing to see the body.

Constant propagation. If you call a function with a literal zero, the compiler can sometimes determine at the call site that the return value is also zero and eliminate downstream computation. Without seeing the body, that chain breaks. The optimization disappears even when the math would make it obvious to a human reader.

Loop invariant code motion. Values computed outside a loop that do not change per iteration can be hoisted before the loop begins. If those values are produced by an opaque call, the compiler cannot prove they’re invariant. It recomputes them on every iteration.

These four effects combine. A hot loop with an opaque call in the body pays for the call/ret overhead once; it pays for the loss of vectorization, alias analysis, constant folding, and hoisting every time.

How the Compiler Decides to Inline

Clang/LLVM maintains a cost model with a default threshold of 225 IR cost units at -O2 and 275 at -O3. GCC’s threshold is roughly 30–40 GIMPLE instructions for auto-inlining, with the overall limit controlled by -finline-limit, defaulting to 600. The inline keyword is primarily an ODR annotation; both compilers treat it as a weak hint at best.

To override the model and force inlining for a specific function:

// GCC and Clang — requires the inline keyword on GCC for cross-TU use
__attribute__((always_inline)) inline float scale(float x, float factor) {
    return x * factor;
}

// C++11 attribute syntax
[[gnu::always_inline]] inline float scale(float x, float factor) {
    return x * factor;
}

// MSVC
__forceinline float scale(float x, float factor) {
    return x * factor;
}

GCC has a less-discussed attribute that goes further: __attribute__((flatten)) recursively inlines all callees of the marked function, turning a complex call graph into a single compiled unit for the optimizer to work on. Clang has partial support as of LLVM 16.

To see what the compiler is actually doing, both GCC and Clang have diagnostic flags:

# Clang: report every inlining decision
clang++ -O2 -Rpass=inline -Rpass-missed=inline file.cpp

# GCC
g++ -O2 -fopt-info-inline -fopt-info-inline-missed file.cpp

# Find missed vectorizations
clang++ -O3 -Rpass-missed=loop-vectorize file.cpp

The missed vectorization output will often tell you exactly why the vectorizer gave up: "call instruction cannot be vectorized" is a direct consequence of a non-inlined call in the loop body.

The simdjson Case

The most well-known production use of forced inlining for vectorization is simdjson, the JSON parser Lemire co-authored. The library defines:

#define really_inline __attribute__((always_inline)) inline

And applies it pervasively through the hot parsing path. The result is a single inlined body for the auto-vectorizer to work on. simdjson achieves 2.5–3.5 GB/s of JSON parsing throughput; conventionally structured parsers sit around 0.5 GB/s. The difference is not algorithmic. The core operations are similar. The difference is optimizer visibility.

`std::function` and the Type Erasure Problem

The same visibility problem appears in idiomatic C++ through std::function. A concrete lambda passed to a template function preserves its type and is inlinable:

// Vectorizable: the lambda's type is visible
auto transform = [](float x) { return x * 2.0f; };
for (int i = 0; i < N; ++i) data[i] = transform(data[i]);

// Also vectorizable: template parameter preserves type
template<typename Fn>
void apply(std::vector<float>& v, Fn fn) {
    for (float& x : v) x = fn(x);
}

Wrap it in std::function and the type is erased behind an internal function pointer:

// NOT vectorizable: opaque indirect call
std::function<float(float)> transform = [](float x) { return x * 2.0f; };
for (int i = 0; i < N; ++i) data[i] = transform(data[i]);

Under Spectre mitigations with retpoline, indirect calls cost 30–80 cycles per invocation on Intel hardware prior to Cascade Lake. The eIBRS mitigation available on newer Intel cores brings that down to 4–6 cycles. For server-side C++ code processing high-throughput streams with callbacks, std::function in a hot path is a meaningful cost even before considering vectorization.

When Not to Force Everything Inline

Inlining increases code size. Larger code means more instruction cache pressure. A function called from many call sites, if inlined everywhere, duplicates the code at each site. That duplication can evict other code from L1i. The resulting cache misses cost 10–40 cycles each, potentially costing more than the call overhead would have.

The practical rule is: force inline small, frequently called functions that sit inside hot loops. For error handling paths, less-frequent utility code, or anything that grows large when inlined, let the compiler’s cost model do its job.

GCC’s __attribute__((cold)) annotates functions that should be placed in a cold section of the binary, out of the way of the hot instruction stream:

[[gnu::noinline, gnu::cold]]
void handle_parse_error(const char* msg, size_t offset);

This preserves the hot function’s instruction cache footprint while keeping the error path callable. Combined with C++20’s [[unlikely]] at the call site, the branch predictor also learns to treat the call as improbable.

Across Translation Units

The visibility problem scales to multi-file projects. A function defined in another .cpp file is completely opaque at the call site without link-time optimization. The compiler cannot inline it, propagate constants through it, or analyze its aliasing behavior.

LTO, specifically Clang’s ThinLTO, fixes this. It compiles each source file to LLVM IR with a summary, then at link time identifies profitable cross-module inlining candidates and imports the needed IR:

clang++ -O2 -flto=thin file1.cpp file2.cpp -o app

Typical speedups on call-heavy code are 5–20%. Google reports 10–20% gains in production workloads using ThinLTO plus PGO versus plain -O2.

Rust has an interesting wrinkle here. Unlike C++, a Rust function without #[inline] will not be inlined across crate boundaries even with LTO enabled, because the IR is not exported into crate metadata without the attribute. A hot utility function in a library crate silently blocks vectorization in all downstream callers unless explicitly marked #[inline] or #[inline(always)].

The Frame to Take Away

Lemire’s add/add3 example is correct, and the measurement is real. The surface-level lesson is that small functions called in tight loops should be inlined. The deeper lesson is why: not primarily because call and ret cost cycles, but because every opaque function call boundary reduces the optimizer’s world to a smaller, more constrained space. Vectorization requires a visible body. Alias analysis requires visible writes. Constant propagation requires visible arithmetic. The moment a boundary appears, those guarantees disappear, and the optimizer falls back to conservative assumptions.

Knowing this changes how you read performance-critical code. The really_inline macro in simdjson is not a micro-optimization. It is the thing that lets the compiler treat an entire parsing stage as one unit of analysis.