· 7 min read ·

More Than Vectorization: The Optimizations a Function Call Silently Prevents

Source: isocpp

The conversation around function call overhead usually converges quickly on auto-vectorization. Daniel Lemire’s article on isocpp.org illustrates the effect cleanly: move a multiply into a separate function, and the auto-vectorizer backs off, leaving a scalar loop where a SIMD loop could have been. The 4-10x slowdown in tight loops is real, well-documented, and reproducible with any modern compiler.

But vectorization is one optimization among many, and the function call boundary blocks several others that are less dramatic to measure but equally consequential in the aggregate.

The Compiler’s Memory Model

A C++ compiler maintains a model of what values are live in registers, what memory addresses alias each other, and which computations remain valid across operations. This model enables several classes of optimization:

Common Subexpression Elimination (CSE): If compute(x) is called twice with the same argument and no writes happen between the calls, the compiler can reuse the first result instead of executing the computation again.

Loop Invariant Code Motion (LICM): If config.threshold is read inside a loop and nothing in the loop writes to it, the compiler can hoist the read outside the loop, loading it once into a register rather than re-reading from memory on every iteration.

Dead Store Elimination (DSE): If a value is written to memory and then overwritten before any read, the first write is dead and can be eliminated.

When a compiler encounters an opaque function call, it must assume the worst about all three. The callee might hold a pointer to any globally reachable memory. It might modify global state. It might call other functions that do the same. The compiler has no information about what it does or does not touch, and the consequence is a forced memory model reset.

After an opaque call, the compiler cannot trust any value it had cached from memory. It cannot assume that config.threshold still holds what it read before the call. It cannot merge or eliminate stores made before the call. Alias analysis conclusions built up over the preceding statements are invalidated.

This is visible in practice. Consider:

float threshold = config.get_threshold();
for (int i = 0; i < n; i++) {
    if (expensive_check(data[i], threshold)) {
        process(data[i]);
    }
}

If expensive_check is opaque, the compiler must keep threshold in a register across calls only if it can prove the callee does not modify config or any memory reachable through it. Without that proof, it must be conservative. If expensive_check is inlined, the compiler can see directly that it does not touch config, and the question never arises. In a loop over millions of elements, the difference between a register value and a reloaded memory read is not enormous in isolation, but it compounds with every other restriction the opaque boundary imposes.

The [[gnu::const]] and [[gnu::pure]] Attributes

GCC and Clang support two function attributes that give the compiler partial visibility into a function’s side-effect profile without requiring inlining.

[[gnu::const]] declares that the function is a pure mathematical function: it depends only on its arguments, reads no memory beyond its parameters, and has no side effects. For the same inputs it always returns the same output. The compiler can eliminate duplicate calls to such a function within the same scope, reorder it freely, and hoist it out of loops without any concern about memory state.

// Pure mathematical computation: no memory reads beyond arguments
[[gnu::const]] double fast_rsqrt(double x);

for (int i = 0; i < n; i++) {
    result[i] = data[i] * fast_rsqrt(scale_factor);  // hoisted outside loop
}

[[gnu::pure]] is slightly weaker: the function may read global memory but must not write it and must have no other side effects. The compiler can cache the result and treat it as invariant within a region where no writes to reachable memory occur, but it must re-evaluate after any write that might affect the function’s visible inputs.

// May read memory but produces no side effects
[[gnu::pure]] int count_valid(const Record* records, int n);

These attributes unlock optimization without inlining. They are a precision tool: you are asserting a contract about the function’s behavior and the compiler accepts that assertion as ground truth. Agner Fog’s optimization manuals discuss function attributes in the context of building high-performance numerical libraries, where you often want functions to remain separately compiled for debuggability and binary size while preserving as much optimization potential as possible.

If the assertion is wrong, the miscompilation will not produce a crash or an obvious wrong answer in most cases. It will produce a subtly stale result whenever the compiler reuses a call result it should have re-evaluated. Neither AddressSanitizer nor UndefinedBehaviorSanitizer will catch the violation. The contract is enforced only by the programmer.

LLVM IR Encodes the Same Information

At the LLVM IR level, these attributes map to recognized function attributes:

  • readnone: the function reads no memory beyond its parameters (equivalent to [[gnu::const]])
  • readonly: the function may read but not write memory (equivalent to [[gnu::pure]])
  • nosync: the function has no synchronization effects, allowing reordering across thread fences
  • willreturn: the function is guaranteed to terminate, enabling speculative evaluation
  • speculatable: the function may be called speculatively without observable side effects

When Clang compiles a [[gnu::const]] function, it emits readnone in the LLVM IR. With ThinLTO enabled, the cross-module inliner uses these attributes in its inlining decisions: a readonly function can have its call reordered past a write to unrelated memory even without the callee’s body being available. The attributes propagate through the module summary that ThinLTO builds during the compilation phase.

The LLVM attributes are also the mechanism that lets the auto-vectorizer legally clone a function call into a loop’s vectorized body. If a function is readnone, multiple simultaneous calls in a SIMD lane have no ordering constraint relative to each other. The vectorizer can generate a vector of calls, or in languages with explicit SIMD support, a single vectorized entry point. SLEEF, the SIMD Library for Evaluating Elementary Functions, advertises vectorizable math function variants precisely by encoding these memory attributes, allowing GCC’s libmvec and Clang’s vectorizer to replace scalar math calls in hot loops with SIMD equivalents without requiring the full function body to be visible.

What This Means for Library Design

The choice between inlining and opacity is not binary. A library can expose separately compiled functions that are still optimization-friendly by annotating their side-effect profile accurately.

For a function that truly reads no memory beyond its arguments, [[gnu::const]] is accurate and enables CSE, call elimination, and loop-invariant hoisting. For a function that reads configuration or stable state but never writes, [[gnu::pure]] allows LICM in loops where no writes to reachable memory occur between iterations.

The alternative, which Eigen, {fmt}, and simdjson all use, is to put hot paths in headers entirely, giving the compiler complete information at every call site. Header-only design is the unconditional option: the compiler gets full visibility, which enables vectorization and all other optimizations, at the cost of compiling those functions once per translation unit that includes the header.

ThinLTO is the systemic solution: it builds per-module summaries during compilation and performs cross-module inlining in parallel during linking, recovering 80-90% of full LTO’s benefit without the serial link time that full LTO imposes. Chrome and Firefox both ship with ThinLTO enabled and report 10-15% runtime improvements over plain -O2 for call-heavy code.

For code that cannot be header-only and where LTO is not available or appropriate, accurate function attributes are the middle path. They require no change to the caller, no change to the ABI, and no increase in binary size from duplicated function bodies. They require only that you understand what your function actually does and commit that understanding to an annotation.

Diagnosing What You Are Losing

The diagnostic flags that reveal missed optimization are the starting point. On GCC:

g++ -O3 -fopt-info-vec-missed -fopt-info-loop-missed hot_loop.cpp

On Clang:

clang++ -O3 -Rpass-missed=loop-vectorize -Rpass-missed=licm hot_loop.cpp

If LICM is blocked by a function call, Clang will report that a load could not be hoisted because the loop may write to the load address. If the culprit is a call with no readonly or readnone attribute, adding [[gnu::pure]] to the callee, if the annotation is accurate, will resolve the remark.

Assembly output via Compiler Explorer is the most direct confirmation. A hot loop over floating-point data that produces scalar instructions at -O3 -mavx2 is a signal that something is blocking the vectorizer. If the scalar instructions surround a call instruction, you have found the boundary. The fix is one of inlining, LTO, or an accurate side-effect annotation, depending on what the function actually does.

Lemire’s observation, that function calls are not free and that their cost compounds in loops, is correct and worth internalizing. The full practical lesson that follows from it extends beyond vectorization: every opaque boundary is a point where the compiler’s memory model collapses and must be rebuilt from conservative assumptions. The tooling to recover from that collapse is more varied than “inline everything,” and knowing which tool fits which situation is the difference between writing performant library code and writing performant inline code.

Was this interesting?