· 6 min read ·

Why Your Loop Goes Scalar When You Call a Function

Source: isocpp

Daniel Lemire’s recent article on isocpp.org opens with a clean example: an add function called twice from add3, which the compiler collapses to two instructions when optimization is enabled. The lesson is that function calls cost something and inlining removes that cost. The example is correct, but the interesting part is what happens in loops, and why the performance gap there is usually much larger than the call overhead suggests.

The Mechanical Cost

On x86-64 with the System V ABI, a well-predicted direct call to a warm function runs in 10-15 cycles. The CALL instruction itself costs 3-4 cycles. The caller must spill live values from the volatile registers (RAX, RCX, RDX, RSI, RDI, R8-R11) before the call and reload them after. The callee sets up and tears down its stack frame. The return branch consumes another 3-4 cycles against the Return Stack Buffer.

For a loop running a million iterations, that adds up. A loop body doing one addition and one function call spends most of its time on the call, not the addition. Lemire measures non-inlined scalar code running 4-10x slower than its inlined equivalent in tight loops. That is real overhead. But it is the smaller part of the problem.

The Visibility Boundary

When the compiler emits a call add instruction, it is not just accounting for 10-15 cycles. It is recording that it cannot see what add does. The function, from the optimizer’s perspective, is opaque: it might read globals, it might write through pointers, it might have side effects that constrain reordering. The compiler must assume the worst on all of these.

This assumption breaks auto-vectorization entirely.

GCC’s and Clang’s loop vectorizers operate by analyzing the full body of a loop and determining whether its operations can be expressed as SIMD instructions. For this to work, every load, store, and computation in the loop must be visible simultaneously. When the vectorizer encounters an opaque function call, it cannot determine whether the call aliases with the loop’s memory accesses or whether it has side effects that prevent reordering. The vectorizer stops.

The GCC diagnostic flag -fopt-info-vec-missed and the Clang equivalent -Rpass-missed=loop-vectorize will both report "Function call may clobber memory" for every loop blocked by this condition.

Here is what the difference looks like in assembly. Non-inlined, the loop is scalar:

.L3:
    mov     edi, DWORD PTR [rbx+rax*4]
    mov     esi, DWORD PTR [r12+rax*4]
    call    add
    mov     DWORD PTR [r13+rax*4], eax
    add     rax, 1
    cmp     rax, rcx
    jne     .L3

One element per iteration. With the function body inlined and AVX2 available:

.L3:
    vmovdqu ymm0, YMMWORD PTR [rbx+rax]
    vpaddd  ymm0, ymm0, YMMWORD PTR [r12+rax]
    vmovdqu YMMWORD PTR [r13+rax], ymm0
    add     rax, 32
    cmp     rax, rdx
    jne     .L3

Eight 32-bit integers per iteration. The vpaddd instruction processes 256 bits at once. On Skylake-class hardware, the vectorized version runs at roughly 0.1 ns per element; the scalar version at roughly 1.0 ns. That is a 10x difference, but only part of it comes from eliminating the call. The rest comes from 8x SIMD width that was simply unavailable before.

With AVX-512, the loop processes 16 integers per iteration and the gap widens toward 16x.

Other Passes That Stop

Vectorization is the most dramatic casualty, but not the only one.

Loop Invariant Code Motion (LICM): if a computation inside a loop does not depend on the loop variable, the compiler normally hoists it above the loop. An opaque call prevents this for any value the call might affect. A loop calling normalize(x, min, max) without inlining must recompute max - min on every iteration because the compiler cannot prove normalize does not modify min or max.

Alias analysis: after any opaque call, every pointer value the compiler was tracking must be treated as potentially stale. Memory previously loaded into registers may have been modified. The compiler issues additional loads where it could have kept values in registers.

Constant propagation: compile-time-known values passed into an opaque function cannot be folded. The function’s return value is unknown. Folding chains that would span the call boundary stop at it.

What the inline Keyword Actually Does

The confusing part is that C++‘s inline keyword was originally intended as a hint for exactly this problem. In the early 1990s, compilers had no real cost model for inlining decisions, and inline told the compiler to substitute a function body at the call site. By the late 1990s, compiler heuristics had outpaced the hint. The inlining decision moved into the optimizer entirely.

What remained in the standard is the keyword’s secondary role: an ODR (One Definition Rule) exemption. A function declared inline may appear in multiple translation units without a linker error. This is what allows function definitions to live in headers. It is also what makes the function body visible at every call site that includes the header, which is the prerequisite for inlining, but the optimizer makes the actual decision independently.

Marking a function inline does not guarantee the optimizer will inline it. The optimizer will inline functions not marked inline if it has sufficient reason. The name is a historical artifact. The keyword is a linkage mechanism.

To actually direct the optimizer: __attribute__((always_inline)) on GCC/Clang, [[gnu::always_inline]] as a C++ attribute, or __forceinline on MSVC. To prevent inlining for benchmark baselines or readable profiler call graphs: __attribute__((noinline)).

How Libraries Handle Visibility

simdjson, which Lemire co-authored, defines its own really_inline macro as __attribute__((always_inline)) inline and applies it throughout the hot parsing path. The library achieves 2.5-3.5 GB/s JSON parsing throughput on modern hardware. Parsers structured around opaque function call boundaries achieve around 0.5 GB/s on the same inputs.

Eigen achieves zero-overhead linear algebra through expression templates. Every operator in an expression like a + b * 2.5f is visible to the compiler simultaneously, enabling fusion into a single vectorized pass with no temporaries. A conventional matrix library with opaque operators returning intermediate results loses this. Header-only, template-heavy design is not stylistic preference; it is the visibility model these libraries require.

The std::sort versus qsort gap is the same mechanism in a familiar form. std::sort takes a comparator as a template parameter, making the comparator body visible at the call site and eligible for inlining into the sort’s inner comparison. qsort takes a function pointer, an indirect call to an opaque function. Measured on 10 million random integers: around 1.8 seconds for qsort, around 0.7 seconds for std::sort. The sort algorithm is the same. The comparator visibility is not.

Cross-Translation-Unit: LTO

When the function body cannot be moved to a header, Link-Time Optimization addresses the visibility problem at the linker stage. -flto on GCC or Clang emits IR into object files alongside machine code, and the linker runs a whole-program optimization pass with full cross-module visibility. The measured runtime improvement on call-heavy code is 5-20%.

Clang’s ThinLTO (-flto=thin) is the production-viable version: it emits per-module summaries and performs cross-module inlining in parallel, recovering 80-90% of full LTO benefit at substantially lower link time. Chrome, Firefox, and the Linux kernel all use ThinLTO with PGO in release builds. Both Chrome and Firefox teams have reported 10-15% improvements over plain -O2 from the combination.

Profile-Guided Optimization adds another dimension here. After collecting execution traces from a representative workload, the compiler receives call frequency data and can apply speculative devirtualization to polymorphic virtual calls that are monomorphic in practice. Google reports 5-15% improvements from PGO on production C++ code.

The Icache Tradeoff

Inlining copies function bodies into every call site. Aggressive inlining grows the binary and can push the hot path out of the L1 instruction cache, typically 32-64 KB per core on modern hardware. An L1i miss costs 10-40 cycles, which exceeds the call overhead being avoided. Agner Fog’s instruction tables and optimization manuals document cases where inlining across a cache line boundary caused a net slowdown.

The practical heuristic: prefer __attribute__((always_inline)) for functions under roughly 10 instructions in confirmed hot loops. Profile with perf stat -e instructions,L1-icache-load-misses before forcing anything larger. The compiler’s default heuristics at -O2 and -O3 are calibrated for this tradeoff and are usually right; the manual overrides are for cases where profiling proves otherwise.

The GCC and Clang diagnostics make the decision visible:

# Clang: which sites were inlined, which were not
clang++ -O2 -Rpass=inline -Rpass-missed=inline file.cpp

# GCC equivalent
g++ -O2 -fopt-info-inline -fopt-info-inline-missed file.cpp

# Missed vectorizations
clang++ -O3 -Rpass-missed=loop-vectorize file.cpp
g++ -O3 -fopt-info-vec-missed file.cpp

The deeper point in Lemire’s article is that function calls are not just overhead, they are scope boundaries. The optimizer’s reach ends at every call it cannot see into. In a tight loop, the question is not only how much the call costs but how much the compiler could have done if the call were not there.

Was this interesting?