Why Inlining Is a Vectorization Prerequisite, Not Just a Speed Hack

A function call in a tight loop carries two costs. The first is what Daniel Lemire’s recent writeup on function call overhead quantifies directly: the instruction-level mechanics of the call and return, the register saves and stack frame management, the branch predictor state. On a modern x86-64 core, a well-predicted direct call costs somewhere between 4 and 6 cycles. In a loop with a body that does comparable work, that overhead matters.

The second cost is harder to see in a profiler. An opaque function call tells the compiler it cannot see what the function does, and therefore cannot apply the class of optimizations that requires full loop body visibility. The most important of these is auto-vectorization. A loop with an inline-visible body might process 8 or 16 elements per iteration using AVX2 or AVX-512; the same loop with an opaque call processes one. The difference is not a constant additive penalty; it is a throughput multiplier.

What the compiler cannot do across a call boundary

Consider a loop that applies a transformation to each element of an array:

float sum_scaled(const float* arr, int n, float factor) {
    float acc = 0;
    for (int i = 0; i < n; i++)
        acc += apply(arr[i], factor);
    return acc;
}

If apply lives in another translation unit with no LTO, the compiler has no information about it. It cannot prove apply is pure. It cannot prove apply does not read or modify arr. It cannot determine whether two calls with the same inputs return the same output. Given that uncertainty, the compiler must preserve call order, process one element at a time, and emit a scalar loop.

Inline apply at the call site, and the compiler now sees the full expression on every iteration. If apply(x, f) is x * f, the loop becomes a fused multiply-add reduction over n floats, and on a machine with AVX2 support the compiler emits something like:

.loop:
  vmovups    ymm1, [rdi + rax*4]   ; load 8 floats
  vfmadd231ps ymm0, ymm1, ymm2     ; acc += arr[i] * factor, 8-wide
  add         rax, 8
  cmp         rax, rcx
  jl          .loop

The non-inlined version processes one element per call. The inlined version processes eight per iteration with no call overhead. The gain from vectorization is 8x on AVX2 and 16x on AVX-512; the gain from eliminating the call overhead is 4 to 6 cycles per element. Vectorization is the story; the call overhead is a footnote.

Compiler Explorer at godbolt.org makes this concrete without any benchmarking. Compile the same loop with __attribute__((always_inline)) versus __attribute__((noinline)) at -O3 -mavx2 and compare the outputs. The inlined version produces a vectorized outer loop plus a scalar cleanup for the remainder. The noinline version contains a plain call instruction in the loop body. The compiler is not making a judgment call here; it has no choice. It cannot vectorize what it cannot see.

Lemire has benchmarked the popcount variant of this pattern and documented 3 to 7x throughput differences depending on available SIMD width. The relationship between inlining and vectorization is a consistent theme in his work precisely because it is so easy to lose vectorization without noticing.

Why the vectorizer requires full loop body visibility

Auto-vectorization transforms a scalar loop into one that processes multiple elements per iteration using SIMD instructions. For this to be legal, the compiler needs to prove several things about the loop body.

First, the operations must be independent across iterations, or have a dependency structure that vectorizes (reductions are handled). An opaque call might have hidden state; two calls with the same input might return different values if the function reads a global. The compiler cannot assume otherwise.

Second, the memory accesses must not alias in ways that would change the result if reordered or batched. A function that takes a pointer argument might write through it, making subsequent array reads see modified values. Without seeing the function body, the compiler assumes the worst.

Third, the compiler needs to assign SIMD instructions to the operations. It cannot construct a vector version of an opaque call. Intrinsics like __builtin_popcountll are a special case: the compiler knows they are pure and has vector equivalents (vpopcntq under AVX-512 VPOPCNTDQ), which is exactly why the popcount loop vectorizes when the builtin is used directly but not when wrapped in a separate translation unit.

Inlining solves all of these by eliminating the boundary. The compiler analyzes the full loop body under the standard rules for the source operations, not the conservative rules it must apply to opaque calls.

The `inline` keyword does not help here

The C++ inline keyword is a linkage directive. It tells the linker that multiple definitions of a function across translation units are identical and should be merged rather than flagged as duplicates. It does not instruct the compiler to inline the function at call sites. The compiler may do so, or may not, based on its cost model.

GCC’s default inlining threshold at -O2 is 600 pseudoinstructions (-finline-limit). Clang inlines functions up to roughly 225 IR instructions at the same level. Functions exceeding these thresholds, or functions that are recursive, take address-of on locals, or contain va_list, are generally not inlined regardless of the inline keyword.

The reliable ways to force inlining are __attribute__((always_inline)) on GCC and Clang, and __forceinline on MSVC. There is no standard C++ attribute for this; proposals for [[optimize_for_speed]] and equivalents have not been adopted into C++23. constexpr functions are a partial workaround: the compiler is required to have the body visible, and they are strong inlining candidates as a result, but the compiler is not obligated to inline them in all cases.

LTO extends inlining visibility across translation units

When the function you need inlined is in a separate .cpp file, local annotations are not enough. Link Time Optimization is the mechanism for cross-module inlining.

LLVM’s ThinLTO is the practical choice for large codebases. Full LTO performs a monolithic whole-program optimization pass at link time, which is effective but scales poorly with program size. ThinLTO stores per-function summaries in object files and optimizes each module in parallel using imported summaries, enabling cross-module inlining while keeping link times manageable. The quality difference from full LTO is small for most codebases:

# ThinLTO
clang++ -O2 -flto=thin foo.cpp bar.cpp -o app

# Full LTO
clang++ -O2 -flto foo.cpp bar.cpp -o app

# With profile-guided optimization
clang++ -O2 -fprofile-generate=./pgo-data main.cpp -o instrumented
./instrumented typical_input
llvm-profdata merge -output=default.profdata ./pgo-data/*.profraw
clang++ -O2 -fprofile-use=default.profdata -flto=thin main.cpp -o optimized

PGO combined with ThinLTO is the configuration that large-scale production systems actually use. Chromium reports 10 to 15 percent improvement over plain -O2. The inlining budget is allocated to call sites that profiling identifies as hot, which concentrates optimization where it affects runtime rather than distributing it uniformly.

The same problem in Rust and Go

Rust uses LLVM as its backend, so the mechanics are identical to Clang. The #[inline] attribute serves the same cross-crate visibility role as C++‘s inline keyword; #[inline(always)] maps to always_inline. A function in a different crate without one of these annotations cannot be inlined unless you compile with -C lto=thin. Generic functions are monomorphized per concrete type, which makes each instantiation independently vectorizable, and this is why performance-critical Rust code uses static dispatch generics rather than dyn Trait in hot paths. Trait objects use vtable dispatch and cannot be inlined, with the same consequences for vectorization.

Go’s situation is more constrained. The gc compiler’s inliner is more conservative, and functions containing for loops were not inlining candidates at all until Go 1.17. Functions using defer, closures, or channel operations still cannot be inlined. The gc auto-vectorizer is also weaker than GCC or Clang; SIMD in Go typically requires hand-written assembly. Tight numeric loops in Go commonly run 2 to 5x slower than equivalent Rust or C++ code, and limited inlining is a significant contributor.

Java’s HotSpot JIT takes the opposite approach: it profiles at runtime and speculatively devirtualizes and inlines virtual calls after observing that a call site dispatches to only one concrete implementation. The inline depth limit is configurable (-XX:MaxInlineLevel defaults to 9). Hot methods up to 325 bytecodes can be inlined. This is more aggressive than static compilers because the JIT has actual frequency data, though it comes with JIT compilation overhead and deoptimization paths when assumptions are violated.

Diagnosing missed vectorization

When a loop should vectorize but does not, the compiler can explain why. GCC reports missed vectorizations with -fopt-info-vec-missed; Clang uses -Rpass-missed=loop-vectorize. An opaque function call in the loop body is among the most common causes, and the fix is to make the function body visible through any of the mechanisms above.

For verifying that a specific function is being inlined where expected, GCC’s -Winline warns when a function marked inline was not actually inlined. Clang’s -Rpass=inline reports each inlining decision. These are verbose, but targeted use during a performance investigation can surface unexpected failures quickly.

The principle that falls out of Lemire’s analysis is that inlining and vectorization are coupled. Writing a small helper function does not just add call overhead; it potentially serializes a loop that the compiler would otherwise have widened to SIMD throughput. The overhead in cycles is measurable and modest. The vectorization loss, when it occurs, is a multiplier on the loop’s runtime, and profilers will attribute the cost to the loop itself rather than to the invisible call boundary that blocked the optimization.