· 7 min read ·

What Your Profiler Won't Tell You About Function Call Costs

Source: isocpp

The direct cost of a function call on modern x86-64 hardware is about 10 to 15 cycles for a well-predicted direct call to a cached function. At 3 GHz, that is roughly 3 to 5 nanoseconds. It is real, attributable overhead, and a profiler will show it to you if you look. Daniel Lemire’s article on isocpp.org measures this directly, and the numbers match what the hardware documentation predicts.

The more interesting cost is the one that does not appear in your profiler as “function call.” It appears as a loop that runs at scalar speed instead of vector speed, with no column in your perf output labeled “missed vectorization opportunity.” Understanding the difference between these two costs changes how you reason about performance-sensitive code.

What the CPU actually does

When a C program calls a function, the processor executes call, which pushes the 8-byte return address onto the stack and jumps to the callee. The callee saves whatever callee-saved registers it needs (RBX, RBP, R12-R15 on the System V AMD64 ABI), does its work, restores those registers, and executes ret, which pops the return address and jumps back. The first six integer arguments go in RDI, RSI, RDX, RCX, R8, and R9; floating-point arguments go in XMM0 through XMM7. Caller-saved registers may be clobbered by the callee, so the caller must spill anything it wants to keep alive across the call. That spill-and-reload cost accumulates on every iteration of a loop.

Agner Fog’s instruction tables put the call/ret pair at roughly 3 to 4 cycles each on Skylake-class hardware, with the return address prediction handled by the Return Stack Buffer. An indirect call or virtual dispatch adds another 15 to 20 cycles for a misprediction penalty, since the RSB cannot predict indirect targets.

So yes, function calls cost something. But a loop processing millions of elements at 15 cycles per call is fast enough to be invisible on a quick benchmark run. The trouble starts when you compare it to what the loop could have been.

The invisible cost

Consider a simple loop:

float square(float x) { return x * x; }

void square_array(float* out, const float* in, int n) {
    for (int i = 0; i < n; ++i)
        out[i] = square(in[i]);
}

With square compiled in a separate translation unit, the compiler sees an opaque call. It cannot look inside square, so it cannot prove anything about the function’s behavior: does it read from out? Does it write to in? Does it have side effects that must be sequenced? Unable to answer these questions, the compiler produces a scalar loop:

.loop:
    vmovss  xmm0, [rsi + rax*4]
    call    square
    vmovss  [rdi + rax*4], xmm0
    inc     rax
    jl      .loop

Now make square visible at the call site, either by putting it in the same translation unit or by marking it __attribute__((always_inline)). With -O3 -mavx2, the compiler produces:

.loop:
    vmovups ymm0, [rsi + rax]
    vmulps  ymm0, ymm0, ymm0
    vmovups [rdi + rax], ymm0
    add     rax, 32
    jl      .loop

The vmulps instruction on an AVX2-capable processor squares eight floats simultaneously. The loop body that processed one element per call now processes eight elements per instruction. That is not a 10-to-15-cycle improvement from eliminating a call; it is a qualitative change in what the hardware is doing.

A profiler sampling the non-inlined version will show hot time in square and hot time in the calling loop. It will not label anything “vectorization was available and unused.” It will not tell you that the loop is running eight times slower than the hardware allows. You will see a slow loop and have no obvious starting point for why.

You can verify this on Compiler Explorer by toggling __attribute__((noinline)) on a small function and watching the vectorized loop appear and disappear in the assembly output.

What stops working at a call boundary

Vectorization is the most dramatic consequence, but the optimizer loses several other capabilities at any opaque call site.

Alias analysis degrades to worst-case assumptions. The compiler must assume that any callee could have modified any pointer-accessible memory, which forces spilling live values before the call and reloading them afterward, and prevents combining or reordering memory operations across calls.

Constant propagation stops at call boundaries. If you compute a value at compile time and pass it to an opaque function, the compiler cannot propagate that compile-time-constant into the callee or specialize the caller for that specific value.

Loop Invariant Code Motion becomes conservative in the presence of calls. Moving a computation out of a loop requires proving the result is the same each iteration and that moving it does not affect observable behavior. An opaque call can potentially change global state, which the compiler must assume it does.

You can recover some of these optimizations without inlining by using GCC attributes that constrain what a function may do. __attribute__((const)) declares that a function is a pure computation: same arguments always produce the same result, no memory accessed beyond the parameters. __attribute__((pure)) is slightly weaker, permitting memory reads but no writes. These let the vectorizer treat an external function as element-wise safe without requiring the body to be visible.

What inline actually means now

Most C++ programmers learn that the inline keyword is a hint to the compiler to substitute the function body at call sites. That interpretation was accurate in the early 1990s, when compilers lacked cost models and the programmer’s judgment mattered. It has not been accurate for a long time.

The cppreference documentation is unambiguous: the inline keyword’s function since C++98 is granting an ODR exemption, allowing a function definition to appear in multiple translation units via #include without triggering a linker error. The compiler is under no obligation to inline a function marked inline, and it may generate a real call for it whenever its cost model concludes that inlining is not worthwhile. Conversely, the compiler may inline any function it can see, regardless of whether it carries the inline keyword.

What actually controls inlining is the visibility of the function body at the call site, combined with the compiler’s internal cost model. GCC’s default threshold is roughly 400 weighted pseudo-instructions (--param max-inline-insns-single); Clang’s is around 225 cost units at -O2. Both compilers give substantial bonuses to candidates where inlining would enable constant folding or vectorization. Raising these thresholds with -O3 is one reason that flag produces better results on tight loops without changing anything about the source code.

When you need to override the cost model, __attribute__((always_inline)) forces inlining at every call site regardless of size, and __forceinline is the MSVC equivalent. The symmetric __attribute__((noinline)) prevents inlining entirely, which is useful for isolating cold error paths or for benchmarking the actual call overhead you are paying in production.

Inlining across translation units

The visibility problem compounds when code lives in separate .cpp files. Functions defined in different translation units are opaque to each other at compile time, which is why header-only libraries are structurally necessary for performance-critical C++ rather than a style choice. Eigen’s matrix arithmetic, the hot paths in {fmt}, and range-v3’s view adapters all live in headers because the compiler must see the full definition at every instantiation point. This is the same inlining constraint, expressed as a library architecture decision.

For non-template code, Link-Time Optimization can close the translation unit gap. With -flto, GCC emits GIMPLE IR and Clang emits LLVM IR into object files instead of machine code; the linker then runs a whole-program optimization pass with full inlining capability across all translation unit boundaries. Clang’s -flto=thin variant builds lightweight cross-module summaries and parallelizes the optimization, achieving 80 to 90 percent of full LTO’s benefit with much lower link time. Chrome and Firefox both use ThinLTO in production and report 10 to 15 percent runtime gains over plain -O2.

# Full LTO with GCC
g++ -O2 -flto -o program main.cpp math.cpp

# ThinLTO with Clang
clang++ -O2 -flto=thin foo.cpp bar.cpp -o app

Finding the problem

The diagnostic flags that surface missed vectorization turn an invisible problem into a concrete starting point:

# GCC: report every loop the vectorizer attempted and failed on
g++ -O3 -fopt-info-vec-missed hot_loop.cpp

# Clang equivalent
clang++ -O3 -Rpass-missed=loop-vectorize hot_loop.cpp

GCC’s message “Function call may clobber memory” on a hot loop is specific and actionable. It means the vectorizer found the right loop, computed that it could vectorize it, and then encountered a call it could not reason across. The fix is usually visibility: move the function to a header, mark it always_inline, or add an appropriate const or pure attribute.

The 10-to-15-cycle call overhead is real and worth eliminating in loops that run millions of iterations. But the missed vectorization does not cost cycles in any way your profiler will label. It simply leaves a factor of 4 to 16 on the table, producing a loop that looks like it is working correctly and just happens to run slower than it should. That is the harder cost to find, because nothing in the tooling tells you it is there until you go looking.

Was this interesting?