· 6 min read ·

Where the Compiler's Inlining Heuristics Break Down and How PGO Fixes Them

Source: isocpp

Why the Static Heuristic Is a Guess

Daniel Lemire’s analysis of function call overhead on isocpp.org makes the case that inlining eliminates call overhead and unlocks downstream optimizations like auto-vectorization. The mechanism is correct. What the framing leaves implicit is the quality of the decisions compilers make about when to inline, and how badly those decisions can misfire without profile data.

GCC’s auto-inliner at -O2 applies a threshold of roughly 30 to 40 GIMPLE instructions. Clang targets 225 abstract cost units at the same optimization level. Any function exceeding these limits does not get inlined at call sites, regardless of how frequently those call sites execute. A function called once during startup and a function called ten million times per second in a server’s inner request loop look identical to the compiler if their instruction counts are similar. Both fail to inline at the same threshold.

The asymmetry has real consequences. Failing to inline a cold path costs a handful of cycles at initialization. Failing to inline a hot inner loop can block auto-vectorization entirely, turning an 8-wide AVX2 loop into a scalar loop and cutting throughput by 5 to 30x. Agner Fog’s optimization tables document the raw call mechanics; the vectorization loss is what the call mechanics measurement misses. The compiler has no way to distinguish these scenarios from code structure alone.

What Profile-Guided Optimization Provides

Profile-Guided Optimization changes the inputs to inlining decisions. PGO instruments the binary, runs it against a representative workload, and feeds call frequency data back into the inlining cost model for the final build. Hot call sites get elevated thresholds; cold ones get normal or lower priority. The compiler is no longer guessing at which loops are inner loops.

The three-phase workflow with Clang:

# Phase 1: instrument
clang++ -O2 -fprofile-instr-generate -o program_instrumented *.cpp

# Phase 2: collect profile data
./program_instrumented representative_input
llvm-profdata merge -output=default.profdata default.profraw

# Phase 3: optimized build using profile
clang++ -O2 -fprofile-instr-use=default.profdata -o program_optimized *.cpp

With GCC:

g++ -O2 -fprofile-generate -o program_instrumented *.cpp
./program_instrumented representative_input
g++ -O2 -fprofile-use -fprofile-correction -o program_optimized *.cpp

The instrumented binary runs 10 to 30 percent slower due to profiling overhead. The final optimized binary carries no instrumentation cost; the profile data only affects compilation decisions.

More Than Inlining

PGO data informs several optimizations beyond inlining decisions.

Branch layout changes based on measured branch frequencies. The compiler arranges the more probable path as the fall-through branch, avoiding branch prediction penalties and keeping the hot path in a contiguous instruction stream. Cold branches take a jump.

Function ordering places frequently called functions near their callers in the binary layout. If two functions are far apart in the text segment and frequently call each other, the first call per page generates an I-cache miss. PGO-driven layout brings them together.

Indirect call promotion matters for any code using virtual dispatch or function pointers. At call sites where profiling shows 95 percent of dispatches going to the same concrete target, the compiler inserts a check and a direct call for that common case. The direct call is then eligible for inlining; the virtual dispatch path handles the remainder. This is speculative devirtualization backed by profile data rather than type analysis.

Cold/hot splitting moves rarely-executed basic blocks to cold sections of the binary (.text.cold on ELF targets). Error paths, rarely-triggered branches, and initialization code that runs once get separated from the hot path. The instruction cache footprint of the hot path decreases accordingly.

Composing PGO with ThinLTO

ThinLTO and PGO address different visibility problems. ThinLTO extends the compiler’s optimization horizon across translation unit boundaries by embedding LLVM IR in object files and deferring cross-module inlining to link time. PGO extends the compiler’s knowledge from code structure to measured runtime behavior.

They compose:

# Two-phase PGO + ThinLTO build
clang++ -O2 -flto=thin -fprofile-instr-generate -o instrumented *.cpp
./instrumented representative_workload
llvm-profdata merge -output=default.profdata default.profraw
clang++ -O2 -flto=thin -fprofile-instr-use=default.profdata \
  -Wl,--thinlto-cache-dir=/tmp/thinlto-cache -o optimized *.cpp

The ThinLTO cache directory preserves per-module IR across rebuilds, making incremental builds viable. Without the cache, each ThinLTO link reruns the full cross-module analysis.

Chromium uses PGO plus ThinLTO and reports 10 to 15 percent improvement over plain -O2. The marginal gain from adding PGO to a ThinLTO build is typically 5 to 8 percent on large codebases, concentrated on the call sites that static heuristics guessed incorrectly. CMake surfaces both:

set(CMAKE_INTERPROCEDURAL_OPTIMIZATION ON)  # enables ThinLTO
target_compile_options(mylib PRIVATE -fprofile-instr-use=${PGO_PROFILE_PATH})

Manual PGO: What simdjson Is Doing

The simdjson JSON parser, from Lemire and colleagues, defines:

#define really_inline __attribute__((always_inline)) inline

This macro appears on essentially every function in the hot parsing path, forcing the compiler to inline without any profile data. The authors identified the hot path by profiling and then annotated it directly. simdjson achieves 2.5 to 3.5 GB/s parsing throughput; conventional parsers run at roughly 0.5 GB/s.

This is the manual version of what PGO automates. The developer profiled externally, identified hot paths, and applied always_inline to force the compiler’s decisions. Automated PGO handles the identification step without requiring developers to read profiles and annotate source code. For codebases with stable hot paths and known workloads, manual annotation can be more surgical. For larger codebases, or code where hot paths shift across workloads, automated PGO scales better and requires no ongoing maintenance as the code evolves.

Sampling-Based Alternatives

Instrumentation-based PGO requires an instrumented build and offline representative workloads. When neither is practical, sampling-based profiling via perf record -b on Linux hardware that supports branch traces can feed profile data into BOLT, a post-link optimizer.

BOLT operates on an already-compiled binary without a recompile. It applies function and block ordering based on perf data, achieving 5 to 10 percent improvements on layout-sensitive workloads. It cannot change what gets inlined, since that decision is made at compile time, so it covers less optimization surface than full PGO. For continuously deployed services where rebuilding for profile data is operationally difficult, it is a practical option.

Google’s Propeller project integrates sampling profile data directly into the LLVM compilation pipeline, allowing inlining decisions and other compile-time optimizations to benefit from production profile data. The integration is more involved than BOLT but covers the full optimization surface that PGO covers.

When PGO Adds Little

PGO requires representative workloads. A profile collected from an initialization-heavy startup sequence tells the compiler nothing useful about steady-state request handling. Programs where the hot path shifts substantially between workloads may see inconsistent gains or regressions if the profile data does not match production traffic.

For programs where default inlining heuristics already make correct decisions, typically small programs with hot paths fully within a single translation unit and functions comfortably under the inlining threshold, PGO adds marginal value. The infrastructure cost of maintaining an instrumented build and a profile collection pipeline is not zero.

The payoff is proportional to how badly static heuristics misallocate optimization budget. Lemire’s add/add3 example is small enough that any compiler inlines it at -O2 without any profile data. The scenarios where PGO earns its complexity are larger functions near the inlining threshold, hot paths spanning translation unit boundaries, or virtual dispatch at high-frequency call sites.

The Full Picture

The cost of a function call is not fixed by instruction counts alone. The compiler’s inlining decision determines whether call overhead materializes, and that decision quality depends on what the compiler knows about call frequency. LTO extends visibility across translation unit boundaries; PGO extends visibility across the compile/run boundary.

Starting from Lemire’s point that function calls are cheap but not free, the measurement-driven version of that insight asks which call sites are executed often enough to warrant the compiler’s full optimization attention. PGO answers that question with data rather than structural heuristics, which is why production-quality C++ builds at scale almost always include it.

Was this interesting?