· 7 min read ·

Beyond Call Overhead: How Inlining Enables Vectorization in C++

Source: isocpp

Daniel Lemire’s recent piece on isocpp.org opens with a clean example: a two-argument add function called twice from add3. With optimization enabled, the compiler collapses the whole thing to two instructions. This is correct, and it makes an important point, but the cycles saved by removing two call/ret pairs are not the main story. The main story is what those call boundaries were preventing the compiler from doing.

The Actual Cost of a Call

A function call in x86-64 is not just call and ret. Before the call, the compiler must ensure the stack is 16-byte aligned, spill any caller-saved registers that hold live values (RAX, RCX, RDX, RSI, RDI, R8 through R11 under the System V AMD64 ABI), and shuffle arguments into the correct registers. After the return, it restores those registers and picks up the return value from RAX.

For a well-predicted direct call to a function in the L1 instruction cache, this sequence costs roughly 10 to 15 cycles on a Skylake-class CPU, about 4 to 5 nanoseconds at 3 GHz. On an individual call outside a hot loop, this does not matter. Inside a loop processing millions of elements, the overhead compounds with every iteration.

The more serious problem is that when the compiler encounters a call to a function whose body is not visible, it must treat that call as an opaque side-effecting operation. That constraint disables a large class of optimizations.

The Vectorization Barrier

Modern CPUs can execute SIMD instructions that process multiple data elements simultaneously. AVX2 can add eight 32-bit integers in a single vpaddd ymm instruction; AVX-512 doubles that to sixteen. The auto-vectorizer in GCC or Clang transforms scalar loops into SIMD code when it can prove the transformation is safe.

A call to an opaque function blocks this in two ways. The compiler cannot prove the callee does not write to any pointer in scope, making it unsafe to reorder or merge memory accesses. Vectorization also requires scheduling multiple iterations in parallel, which requires knowing the function has no order-dependent observable effects.

Consider the assembly the compiler produces for a loop that calls a non-visible add function:

.L3:
    mov     edi, DWORD PTR [rbx+rax*4]
    mov     esi, DWORD PTR [r12+rax*4]
    call    add
    mov     DWORD PTR [r13+rax*4], eax
    add     rax, 1
    cmp     rax, rcx
    jne     .L3

The loop processes one element per iteration and issues one call per element. When the body is visible and the compiler inlines it, the same loop becomes:

.L3:
    vmovdqu ymm0, YMMWORD PTR [rbx+rax]
    vpaddd  ymm0, ymm0, YMMWORD PTR [r12+rax]
    vmovdqu YMMWORD PTR [r13+rax], ymm0
    add     rax, 32
    cmp     rax, rdx
    jne     .L3

Eight elements per iteration, no call. Real-world measurements on this pattern sit around 1.0 ns/element scalar versus 0.08 to 0.12 ns/element with AVX2. The 10-to-15-cycle call overhead is almost a rounding error compared to the 8 to 12x difference unlocked by vectorization.

GCC and Clang both report when this optimization is blocked. Compile with -fopt-info-vec-missed (GCC) or -Rpass-missed=loop-vectorize (Clang), and you will see the message “Function call may clobber memory” identifying the specific call that prevented vectorization. That is the compiler telling you exactly what it cannot determine across the call boundary.

Inlining also enables other optimizations that are easy to overlook: Loop Invariant Code Motion (LICM), constant propagation through call arguments, dead code elimination on unreachable branches, and better register allocation across what was formerly a call boundary. All of these compound in hot loops.

What the inline Keyword Does

Most C++ developers believe the inline keyword tells the compiler to inline a function. It does not, and it never reliably did, even when it was introduced in the early 1990s as a hint. By the mid-1990s, compilers had developed cost models sophisticated enough to make independent inlining decisions, and the hint became largely advisory.

What inline does today is grant an ODR (One Definition Rule) exemption. It allows a function to have multiple identical definitions across translation units without a linker error. This is the mechanism that makes header-defined functions legal: every .cpp that includes the header gets its own copy of the definition, and the inline marker tells the linker all copies are identical.

The practical effect is that putting a definition in a header makes it visible at every call site, which is the actual prerequisite for inlining. A function defined in math.cpp and declared in math.h cannot be inlined into main.cpp at compile time regardless of any keyword, because the body is in a separate translation unit that was already compiled independently.

The cppreference page on inline states this plainly: the compiler is under no obligation to inline a function marked inline, and may generate a real call whenever its cost model concludes inlining is not worthwhile. GCC’s default threshold for inlining is around 400 weighted pseudo-instructions; Clang uses approximately 225 cost units at -O2. Both compilers apply bonuses when inlining would unlock vectorization or enable constant folding, but these are heuristics, not guarantees.

LTO: Cross-Translation-Unit Inlining

Link-time optimization solves the visibility problem at the linker level. With -flto (GCC) or -flto=thin (Clang), the compiler emits intermediate representation into object files instead of machine code, GIMPLE for GCC and LLVM bitcode for Clang. At link time, all of that IR is fed back to the optimizer, which now has visibility across every translation unit and can inline across .cpp file boundaries as if the entire program were compiled as one unit.

ThinLTO is the practical choice for large codebases. Rather than running a full whole-program optimization pass, it builds lightweight per-module summaries during compilation and imports only the function bodies worth inlining at link time, with modules processed in parallel. It captures roughly 80 to 90 percent of full LTO’s runtime benefit at a fraction of the link-time cost. Chrome, Firefox, and the Linux kernel build their release configurations with ThinLTO combined with profile-guided optimization, reporting 10 to 15 percent runtime improvements over plain -O2 on hot paths.

For CMake-based projects, enabling this is a single property:

set_property(TARGET my_target PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)

With Clang directly:

# ThinLTO
clang++ -O2 -flto=thin foo.cpp bar.cpp -o app

# Full LTO with GCC
g++ -O2 -flto add.cpp main.cpp -o app

Forcing Inlining

Sometimes the compiler declines to inline a function that profiling confirms is on the critical path. GCC and Clang provide __attribute__((always_inline)); MSVC provides __forceinline. Both bypass the cost model and inline at every call site, even at -O0.

__attribute__((always_inline)) inline float scale(float x, float factor) {
    return x * factor;
}

This is the approach simdjson takes on its hot parsing path. The library defines really_inline as __attribute__((always_inline)) inline and applies it throughout its critical sections, achieving 2.5 to 3.5 GB/s JSON parsing throughput versus around 0.5 GB/s on the same inputs when the hot path remains opaque to the optimizer.

The risk is real. If the function is large and called from many sites, forced inlining inflates the binary and can thrash the L1 instruction cache. Agner Fog’s optimization manuals document cases where inlining across cache line boundaries caused a 1.4x slowdown. Benchmark the actual loop before applying the attribute, and verify the compiler was not declining for a reason.

GCC also provides __attribute__((flatten)), which takes a different approach: annotate the outer function and the compiler recursively inlines all calls within it. Useful for hot kernels that compose multiple small helpers, such as a pipeline of the form normalize(quantize(filter(x))) where each helper is defined in its own header.

Virtual Functions and the Spectre Tax

Virtual function calls carry additional overhead beyond register setup and stack alignment. Each virtual call loads the vtable pointer from the object, loads the function pointer from the vtable, and executes an indirect branch. That indirect branch is what made virtual calls significantly more expensive after the Spectre v2 vulnerability was disclosed in 2018.

Spectre v2 exploits the branch target buffer, which CPUs use to speculatively predict indirect branch targets. The retpoline mitigation that GCC and Clang apply via -mindirect-branch=thunk routes indirect branches through a trampoline that prevents speculative forward progress, stalling the pipeline until the actual target is resolved. A virtual call that cost 1 to 3 cycles with a correctly predicted BTB entry pre-2018 now costs 10 to 25 cycles under retpoline on CPUs without hardware countermeasures. At 3 GHz, that is roughly 7 nanoseconds per virtual call, often more than the method body itself.

The C++11 final specifier addresses this directly. Marking a class or method final tells the compiler no further overrides are possible, allowing it to resolve the call statically, inline the concrete method, and vectorize the loop. Profile-guided optimization achieves similar results through speculative devirtualization: the compiler emits a type check and an inlined direct call for the statistically dominant type, with a vtable fallback for the rest. Intel’s CET-IBT hardware extension, available on Tiger Lake (11th gen) and later, eliminates the retpoline penalty entirely by enforcing control flow integrity in hardware rather than via speculation-stalling trampolines.

Finding the Problem

The most direct diagnostic is inspecting the assembly of hot loops on Compiler Explorer. A call instruction inside a vectorizable loop body is a signal worth investigating. Compiler Explorer also lets you compare GCC, Clang, and MSVC output side by side with different -O flags, which makes it easy to see whether a different compiler or optimization level would inline what yours refuses to.

The missed-optimization diagnostics from GCC and Clang will usually name the reason. “Function call may clobber memory” is specific and actionable. The fix is almost always one of three things: put the called function’s body in scope by moving it to a header, enable LTO on the build, or apply always_inline after confirming the function is small and the call is genuinely on the critical path.

The call overhead itself is a secondary concern. What the call boundary prevents the optimizer from doing is the primary one.

Was this interesting?