The Memory Reload You Never Wrote: Alias Analysis and the Hidden Cost of Opaque Calls

Daniel Lemire’s analysis on isocpp.org frames the cost of a function call in cycles: a few for the call and ret pair, a few more for register saves, nothing catastrophic except in tight loops where the arithmetic body is trivial. The broader discussion in C++ performance circles has expanded on this to emphasize vectorization: a non-inlined call in a loop prevents auto-vectorization, which matters far more than the call’s direct cycle cost.

Both framings are accurate. Neither fully accounts for a third category of overhead that shows up in scalar code without any SIMD involved, in loops that are not particularly tight, and for functions that are called at reasonable frequency rather than hundreds of millions of times per second. The mechanism is alias analysis, and every opaque call inside a loop potentially defeats it.

What Compilers Prove About Memory

A significant part of what an optimizing compiler does is prove things about memory: that a value read from a pointer does not change between two points in the program, that two pointer arguments cannot possibly refer to the same location, that a computation inside a loop produces the same result on every iteration. These proofs are the foundation for loop-invariant code motion (LICM), which is the transformation that moves invariant computations out of loops, keeps frequently-read values in registers rather than reloading them from memory, and defers memory writes until after a loop completes.

Consider a loop that scales an array:

void scale_array(float* dst, const float* src, float factor, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i] * factor;
    }
}

The compiler knows factor and n are passed by value and cannot change. With -O2, it keeps both in registers throughout the loop. If it can also prove dst and src do not overlap, it can vectorize and process eight elements per clock with AVX2. The contribution of LICM here is quiet: there is no extra loop-invariant load to hoist, but the compiler is tracking the fact that n in a register is valid and does not need refreshing.

LICM’s impact is larger when pointer-derived values are involved. A loop that reads a configuration struct through a pointer, computes something from it on each iteration, and writes results to a separate output is a good candidate for hoisting the struct reads. Those reads happen once, the values live in registers throughout the loop, and the memory bandwidth is the struct read rather than n struct reads.

Why a Function Call Resets the Analysis

When a non-inlined function call appears inside a loop, the compiler loses the proofs it had accumulated. It cannot know what the callee does to memory. The callee might write to a global variable, modify a static local, accept a pointer parameter that happens to alias something in the caller’s scope, or call another function that does any of these things.

This is not a compiler limitation in the sense of a flaw; it is correct reasoning under incomplete information. The calling convention establishes a contract about which registers are preserved, but it says nothing about memory. Any location accessible from the callee’s execution context is potentially modified by the call.

The consequence is that the compiler must treat any value derived from a pointer as potentially stale after the call:

struct Config {
    float threshold;
    int max_iters;
};

void process(float* data, const Config* cfg, int n) {
    for (int i = 0; i < n; i++) {
        // cfg->threshold and cfg->max_iters were read into registers
        if (data[i] > cfg->threshold) {
            // After this call, cfg->threshold and cfg->max_iters
            // must be reloaded -- the compiler cannot prove
            // record_hit() does not modify *cfg
            record_hit(i);
        }
    }
}

With record_hit non-inlined, the compiler must reload cfg->threshold and cfg->max_iters on each iteration where the branch is taken. If the branch fires frequently and the struct is in L1 cache, the cost might be a few cycles per iteration. If the struct is in L2 or L3, each miss costs 12 to 40 cycles. The original source of the reload is the call, but profilers that sample at function granularity will attribute the time to cache misses, not to call overhead.

GCC’s `pure` and `const` Attributes

For functions where you control the source, GCC and Clang provide two attributes that restore some of the compiler’s analytical capability without requiring inlining:

// pure: reads from memory but never writes to it.
// The compiler can cache memory values across calls to this function.
__attribute__((pure))
float compute_weight(const float* arr, int n);

// const: does not read from or write to any memory.
// The compiler can treat this like a deterministic math function:
// hoist it out of loops, fold duplicate calls, etc.
__attribute__((const))
float fast_reciprocal(float x);

A function marked const is treated like an expression: given identical arguments it always returns the same value and has no observable side effects on memory. The compiler can hoist a const call out of a loop if its arguments are loop-invariant, fold multiple calls with identical arguments into one, and reorder it freely around memory operations. A function marked pure can be reordered relative to other non-writing operations and its results can be cached, but it cannot be moved past a write that might affect what it reads.

These annotations are only correct if the function genuinely satisfies the contract. A function incorrectly marked const that reads from a global will produce wrong results after the optimizer hoists or deduplicates calls. The compiler trusts the annotation unconditionally.

// Correct: pure float math, no memory access
__attribute__((const))
float sigmoid(float x) {
    return 1.0f / (1.0f + expf(-x));
}

// Correct: reads arr[] but never writes anything
__attribute__((pure))
float dot_product(const float* a, const float* b, int n) {
    float s = 0;
    for (int i = 0; i < n; i++) s += a[i] * b[i];
    return s;
}

With these annotations, a call to sigmoid or dot_product inside a loop does not force the compiler to reload pointer-derived values after the call. The function call still executes on each iteration where the arguments differ, but the aliasing pessimism is removed.

What Inlining Actually Restores

Inlining gives the compiler the callee’s body rather than a contract about it. With the body visible, the compiler performs escape analysis: tracing which pointers inside the function could be observed outside it, which local modifications are confined to the stack frame, and which writes affect memory the caller cares about. For a function like record_hit(i) that increments a counter in a global array, escape analysis would show that the writes go to a specific location unrelated to the cfg pointer, and the compiler would not need to reload from cfg after the call.

This is why inlining improvements are not always explained by vectorization. A loop that processes struct data through small helper functions can see 2x to 3x speedup from inlining even on scalar code without SIMD, purely from eliminating redundant memory loads. The loads are not in the source; they were inserted by the compiler because of what it could not prove about the calls. Inlining gives it the proof.

Agner Fog’s software optimization manuals document the compiler’s behavior under these conditions in detail. The relevant section covers memory operand dependencies and how the compiler conservatively generates memory reads when it cannot prove memory has not been modified. The pattern shows up clearly in annotated assembly.

The `restrict` Keyword and Its Limits

For the aliasing case between pointer parameters, C99 provides restrict, supported as a compiler extension in C++ by GCC, Clang, and MSVC:

void scale(float* __restrict__ dst,
           const float* __restrict__ src,
           int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i] * 2.0f;
    }
}

restrict is a programmer promise that the pointer is the sole access path to its memory region during the function’s execution. With this, the compiler can vectorize the loop and avoid generating a runtime overlap check. The System V AMD64 ABI has no built-in way to express this; restrict is extra-ABI information.

But restrict solves aliasing between parameters. It does nothing about function calls inside the loop body. A call to a function that takes no pointer parameters and does not modify globals still forces a conservative analysis of all pointer-derived values in the caller’s scope, because the compiler cannot rule out that the function’s implementation does something unexpected through a path the compiler cannot trace. Inlining resolves this regardless of whether restrict is present, because the alias analysis now operates on the combined inlined body.

Diagnosing the Problem

The reloads inserted by conservative alias analysis do not appear in source code and are not directly labeled in most profilers. The approach that surfaces them is examining the compiled output directly.

With GCC, -fopt-info-optimized -fopt-info-missed reports which loop transformations succeeded and which were blocked. Clang provides -Rpass=licm -Rpass-missed=licm specifically for LICM decisions, emitting per-loop notes like:

note: loop not vectorized: call instruction cannot be vectorized
note: LICM: not hoisting load because of call in loop

Inspecting the assembly manually for loops that look memory-bound: if a value that was computed before the loop body appears as a memory reload instruction (mov rax, [rbp-8] rather than a register reference) following a call instruction, alias analysis was defeated at the call. The fix is inlining the callee, marking it pure or const if its behavior warrants it, or restructuring the loop to move the call outside the iteration.

Lemire’s example is minimal by design: add3 calling add, pure arithmetic with no pointers, no globals, no memory at stake. The call overhead story is cleanest there. In practice, the code that benefits most from understanding function call costs is the code that works through pointers, reads configuration from structs, updates counters and accumulators across iterations, and calls small helpers that could plausibly be reading or writing anything. For that code, the cycle cost of call and ret is not the limiting factor. The limiting factor is how many times the compiler decided it had to go back to memory because it could not prove otherwise.