When Inlining Costs More Than the Function Call Did

The established case for inlining is well-documented. Daniel Lemire’s recent writeup shows it cleanly: the compiler can eliminate the call/return pair entirely, merge the callee’s body into the surrounding code, and then apply optimizations across what were previously separate pieces. On x86-64, a direct function call that the CPU’s return address stack predictor handles correctly costs around 4 to 8 cycles in isolation, but that cycle count understates the real benefit of inlining. A call boundary also blocks auto-vectorization, constant propagation, and alias analysis. Removing the call is the price of admission for those deeper optimizations.

That framing is correct, and it describes one direction of the tradeoff. What gets less attention is the other direction: inlining copies function bodies into call sites, and doing that at scale produces binaries that are larger and harder for the instruction cache to hold.

What Happens to the Binary

When the compiler inlines a function, it does not share the function body across call sites. It duplicates it. A 30-instruction function called from 20 places produces 600 instructions in the binary if inlined everywhere, versus 30 instructions plus 20 call/return pairs if kept out of line. The multiplier is the number of call sites.

For trivial functions, this rarely matters. add(x, y) expands to a handful of instructions inline, and the growth is negligible. The problem emerges with functions that are moderately large, called from many sites, and not on the most critical path.

Consider a validation or bounds-checking function called at the start of many methods:

[[gnu::always_inline]]
void check_range(int val, int min, int max) {
    if (val < min || val > max)
        throw std::out_of_range("value out of range");
}

If this is force-inlined into 50 call sites, each caller contains the comparison, the branch, the exception materialization path, and the string work that follows. The out-of-range path will almost never execute, but it still occupies instruction memory at every call site.

I-Cache Pressure and Why It Compounds

A modern x86 core has a 32KB L1 instruction cache (on most desktop and server designs) and a dedicated decoded instruction cache, the uop cache on Intel designs, that holds around 1,500 micro-ops. The uop cache is particularly relevant: code that fits in it executes without re-decoding each iteration. Code that overflows it must be re-fetched and decoded from L1 or beyond.

When aggressively inlined code grows the hot code footprint beyond what fits in these caches, the CPU spends cycles on instruction fetches and decodes it would otherwise skip. An L1 instruction cache miss costs roughly 10 to 15 cycles. On a loop running millions of iterations, even one cache miss per few hundred iterations can dominate everything else in the profile.

The Linux kernel makes deliberate use of noinline for exactly this reason. WARN_ONCE, might_sleep, and similar diagnostic helpers are marked __attribute__((noinline)) so the warning infrastructure stays out of the hot path instruction footprint even when called from hot functions. The kernel coding style explicitly discourages the inline keyword in .c files for anything other than trivially small accessors.

Firefox and Chromium have both published post-mortems on binary size reduction where cutting inlining improved performance on memory-constrained devices, not because call overhead was a problem, but because more code fit in cache.

The static inlining heuristic in GCC and Clang weighs estimated code size growth against estimated speed gain, reasoning locally at each call site. GCC’s default inline limit at -O2 is roughly 30 to 40 GIMPLE instructions. Clang’s is around 225 abstract cost units. Both will inline past those thresholds with __attribute__((always_inline)) or __forceinline.

Neither threshold accounts for what the rest of the compilation unit is doing to the instruction footprint. The compiler does not reason about whether inlining this function at ten call sites, combined with fifteen other inlining decisions nearby, collectively evicts the core loop from the uop cache. That analysis requires profiling, not static analysis.

This is why profile-guided optimization helps with inlining decisions but does not fully solve the cache footprint problem. PGO tells the compiler which call sites are hot and raises their inlining threshold; it does not tell the compiler that inlining ten things into the same hot function is causing uop cache evictions.

Hot/Cold Splitting as the Alternative

The technique that addresses this directly is hot/cold splitting: instead of inlining the rarely-taken branch, move it out of line and leave only a call at the call site. GCC and Clang both support __builtin_expect as a hint, and with -O2 or higher they will attempt to lay out cold paths in a separate code region.

void validate(int x) {
    if (__builtin_expect(x < 0, 0)) [[unlikely]] {
        handle_error(x);  // cold path, moved out of line
    }
    // hot path continues inline
}

With [[unlikely]] (C++20) or __builtin_expect(..., 0), the compiler moves the cold branch out of the straight-line hot path. The hot path stays compact and cache-friendly; the cold path lives elsewhere and only occupies I-cache when it runs.

For functions where the cold path is non-trivial, marking the handler with [[gnu::cold]] or __attribute__((cold)) is more direct: it tells the compiler to treat the entire function as cold, lay it out late in the binary, and optimize it for size rather than speed. The error-handling tier of many large programs, the assertion handlers, the fallback allocators, the diagnostic reporters, benefits from this treatment uniformly.

Measuring Which Side You’re On

The practical question is whether a given performance problem is call overhead or I-cache pressure. On Linux, the measurement is:

perf stat -e cycles,instructions,L1-icache-misses,LLC-load-misses ./binary

A high L1-icache-misses count relative to instructions retired points toward instruction cache pressure. Reducing inlining on nearby hot functions and re-running will show whether the miss rate drops. A low icache miss rate with a high cycle count points toward call overhead or the vectorization-blocking direction.

For finer-grained attribution, perf record with call graph capture and perf annotate will show where cycles accumulate. Compiler Explorer makes it easy to compare the code size of inlined versus non-inlined versions: toggling between -O2, -O3, and -O2 -fno-inline-functions immediately shows how much the compiler expands a given function and what assembly it produces.

Agner Fog’s optimization manuals remain the reference for instruction throughput and latency numbers on specific microarchitectures, and his instruction tables give exact cycle costs for the call and return instructions across CPU generations.

Reaching for [[noinline]]

[[gnu::noinline]] on GCC and Clang, or __declspec(noinline) on MSVC, suppresses inlining at a specific function. The C++ standard has no portable attribute for this, which is a gap given how useful the hint is in practice. These attributes are appropriate in a few clear cases: functions on the cold path that would bloat hot callers, functions called from many sites where a shared out-of-line copy is better for cache, and diagnostic or instrumentation functions that should never pollute the critical path.

The counterpart, [[gnu::always_inline]] combined with inline, forces inlining even when the compiler’s heuristic says no. This is appropriate for genuinely trivial functions where the call overhead and the optimization barrier both matter, and where measurement confirms the compiler’s default decision is wrong. Using it broadly as a performance hint without measurement is how you end up with a binary that is 20% larger and no faster.

Lemire’s article makes the right foundational point: compilers rely on inlining to remove overhead and unlock deeper optimizations, and that mechanism is worth understanding. The discipline is in recognizing that inlining has its own cost, that the compiler’s heuristics are local and do not account for the global cache picture, and that the right tool for cold paths is often explicit out-of-lining rather than hoping the compiler guesses correctly. The profiler tells you which problem you actually have.