The Optimizer Sees Only What You Show It

Daniel Lemire’s writeup on function call cost opens with a clean example: add3 calling add versus add3 doing the arithmetic inline. With inlining disabled, the per-iteration overhead of the call mechanics dominates a trivial loop body. With inlining, the compiler reduces the whole thing to a counter increment. The point is well-made.

But the example involves one function and one call site in the same translation unit. That is the easy case. What makes this principle worth thinking about seriously is that it applies at four distinct scales in a real codebase, each with different tools and different trade-offs, and the scales that affect most programs most of the time are not the one the article focuses on.

The Same Problem at Different Scales

At the call-site level, the story is as Lemire describes. The compiler compiling a tight loop cannot vectorize across a call boundary because it cannot prove the callee is pure, cannot prove it does not alias the loop’s arrays, and cannot map one-argument-at-a-time ABI-mandated scalar calls onto eight-wide AVX2 instructions. Inlining eliminates the boundary and the vectorizer can proceed. The gain from vectorization on AVX2 is roughly 8x throughput on float arrays; the call overhead itself is 4 to 8 cycles, a secondary concern.

The tools here are __attribute__((always_inline)) on GCC and Clang, __forceinline on MSVC, and for hot entry points that call many small helpers, GCC’s __attribute__((flatten)), which inlines everything the marked function calls:

__attribute__((flatten))
void process_pixels(Pixel* dst, const Pixel* src, int n) {
    for (int i = 0; i < n; i++)
        dst[i] = tone_map(gamma_correct(to_linear(src[i])));
}

This saves annotating tone_map, gamma_correct, and to_linear individually and presents the vectorizer with a single loop body. Clang gained partial support for flatten in LLVM 16; MSVC has no equivalent.

At the translation-unit level, the barrier is the C++ compilation model. A function defined in math.cpp is entirely invisible to the compiler processing render.cpp. The optimizer sees a declaration and nothing else; it must assume the worst about what the function does. This is the default state of any project that organizes code across multiple source files, which is every non-trivial C++ project.

The fix is Link-Time Optimization. Clang’s ThinLTO is the practical choice: it stores lightweight per-function summaries in object files and at link time identifies cross-module inlining opportunities, imports only the necessary IR, and compiles modules in parallel. The incremental build story is better than full LTO:

# ThinLTO build
clang++ -O2 -flto=thin -Wl,--thinlto-cache-dir=/tmp/thinlto-cache -o app *.cpp

# GCC full LTO with parallel compilation
g++ -O2 -flto=auto -o app *.cpp

Google, Meta, and Apple deploy ThinLTO in production and report 5 to 15 percent performance gains on large C++ codebases, nearly all from cross-TU inlining and the dead code elimination it makes possible. Link time increases by 20 to 50 percent, which is a real cost but generally acceptable for release builds.

At the library boundary, the problem becomes structural. A compiled static library is a set of object files. Without LTO, an application linking against it has no more visibility into library internals than it does into a separately-compiled translation unit. Adding -flto=thin to your application build does not help if the library was not also compiled with ThinLTO; the summaries must exist in both.

The workaround the C++ ecosystem settled on long before LTO was practical is header-only libraries: put the implementation in the header, force every consumer’s translation unit to compile it directly. Eigen, the linear algebra library used throughout scientific computing and machine learning, is entirely header-only. The matrix operation kernels are templates, which means every consumer compiles its own instantiation with full visibility into the implementation. The compiler can specialize, inline, and vectorize for the specific matrix dimensions and element types each user needs. A pre-compiled library could not provide this; the optimizer would see only an extern call.

Abseil, Google’s foundational C++ library, is predominantly header-only for the same reason. The core containers, string utilities, and synchronization primitives are all available with full source visibility so consumers’ compilers can inline and optimize across their use.

Templates are a natural mechanism for this. A template function is, by definition, defined in a header and instantiated per concrete type, giving every translation unit a visible body to work with. This is why performance-critical generic C++ code uses static dispatch through templates rather than virtual interfaces, not primarily to avoid the vtable indirection (which adds roughly 5 to 10 cycles over a direct call) but to guarantee the implementation is visible and inlineable at every call site.

At the shared library boundary, the situation is the most constrained. A .so or .dll is an ABI contract: function signatures are fixed, the internals are compiled separately, and cross-boundary inlining is impossible regardless of LTO settings. Every call across a shared library boundary is an opaque call as far as the optimizer is concerned. Profile-guided optimization stops at this boundary. Constant propagation stops here. If a hot inner loop calls into a shared library, it will be scalar.

This is why high-performance systems components almost universally use static linking for the critical path. It is not about dynamic loading overhead; it is about keeping the optimizer’s visibility intact.

How This Shapes Real Projects

simdjson, the high-performance JSON parser, takes the call-site approach to its logical conclusion. The project defines:

#define really_inline __attribute__((always_inline)) inline

This macro appears on essentially every function in the hot parsing path. The design intent, documented in the simdjson paper, is to present the compiler with a single contiguous body for the entire parsing loop, so SIMD can be applied throughout. simdjson achieves 2.5 to 3.5 GB/s JSON parsing throughput; implementations built on conventional function decomposition typically land around 0.5 GB/s. The algorithm is the same; the difference is compiler visibility.

LLVM’s own internals reflect a different version of the same concern. The StringRef and small container utilities in LLVM’s ADT are header-only, making them inlineable wherever they are used in LLVM’s enormous codebase. The codebase also uses ThinLTO for its release builds, which gives the optimizer cross-TU inlining across the parts that are not header-only.

Rust handles this with a more explicit model. Cross-crate inlining requires either #[inline] on the function (analogous to C++‘s inline as an ODR hint, but Rust compilers take it more seriously as a hint) or #[inline(always)] for forced inlining, or -C lto=thin for link-time optimization. Generic functions are automatically monomorphized per type, giving each instantiation full compiler visibility, which is why idiomatic performance-critical Rust uses static dispatch through generics rather than dyn Trait in hot paths. dyn Trait is vtable dispatch, which is the Rust equivalent of a virtual call, and the optimizer cannot inline through it.

The Binary Size Counter-Argument

Aggressive inlining increases binary size. A function inlined at 50 call sites appears 50 times in the instruction stream. For a function of any real size, this expands the instruction footprint of the hot path and can evict other hot code from the L1 instruction cache. At 32KB, the L1i cache holds 512 64-byte cache lines; L1i misses cost 10 to 15 cycles each. A loop that looks correct but shows high frontend_bound stalls in perf stat and elevated L1-icache-load-misses counts may have been over-inlined.

The useful complement to forced inlining is [[gnu::noinline, gnu::cold]] on error and cold paths:

[[gnu::noinline, gnu::cold]]
void throw_out_of_range(size_t requested, size_t limit);

The cold attribute places the function in the .text.cold ELF section, physically separated from the hot code region, and biases branch prediction to treat calls to it as unlikely. The hot path’s instruction footprint shrinks; the cold path stays callable. This is the direct counterpart to always_inline on hot functions: the combination of pushing hot code together and cold code away is what keeps the instruction cache effective.

What to Actually Do

The Lemire example is a clear illustration with the simplest possible case. In a real project, the decision tree is broader.

For individual hot functions in the same translation unit, check whether the compiler inlines automatically at -O2 before reaching for always_inline. Clang’s -Rpass=inline and -Rpass-missed=inline report every inlining decision and every missed opportunity with a reason. GCC’s -fopt-info-inline-missed does the same. These are verbose but targeted use during a performance investigation will surface unexpected failures.

For cross-file boundaries, enable ThinLTO on release builds. The performance gains are consistent and the build time cost is manageable with caching. Make sure libraries in the critical path are also compiled with ThinLTO, or make their hot implementations available as header code.

For library design, prefer header-only for components that are likely to appear in tight loops. Templates are naturally header-only and give you per-instantiation optimization for free. Virtual dispatch is the right tool for extensibility at runtime; it is the wrong tool for tight inner loops because it is an optimization wall by construction.

For shared libraries on the critical path, static link them if you control the build. If you do not control it, the performance ceiling for that path is scalar, single-element-per-call throughput, and the only way past it is to restructure the code to not cross that boundary in the hot loop.

The overhead of a function call is a real number, and Lemire measures it clearly. But the more important question is whether the optimizer can see past the call into the operations behind it. Where the answer is no, the ceiling on performance is set before the first benchmark runs.