Speculative Inlining and the Information C++ Doesn't Have at Compile Time

Daniel Lemire’s piece on the cost of a function call makes a clean point: function call overhead in C++ is less about the raw CALL/RET cycles and more about the optimization barrier the call boundary creates. Once the compiler cannot see inside a callee, it cannot vectorize, cannot constant-fold, cannot eliminate dead branches. Inlining removes the barrier.

That framing is correct for C++. But C++ is one answer to the problem, and it is constrained by what the compiler knows at build time. The JVM has a different answer, and it operates on information C++ compilers simply do not have.

The Static Inlining Ceiling

C++ inlines at compile time. The compiler sees the source, estimates whether inlining a callee will pay off based on function body size and loop nesting depth at the call site, and makes a permanent decision baked into the binary. Link-Time Optimization extends this to cross-translation-unit inlining, but the decision still happens before any real workload runs.

The fundamental constraint is that the compiler does not know which code paths will run hot at runtime, how many distinct types will appear at a virtual call site, or whether a branch guarded by a runtime condition will be taken 99.9% of the time. Profile-guided optimization improves those estimates, but the binary is fixed once compilation ends.

Rust has the same constraint, but its #[inline] attribute serves a different purpose than C++‘s inline keyword. In Rust, #[inline] means: embed this function’s IR in the crate metadata so downstream crates can inline it without LTO. Without it, functions in one crate are opaque to callers in another crate at LLVM’s level, because each crate compiles to its own object file. The Rust standard library aggressively annotates hot functions with #[inline] precisely because library consumers cannot be assumed to have LTO active.

// Without #[inline], callers in other crates cannot inline this without LTO
pub fn square(x: f32) -> f32 { x * x }

// With #[inline], IR is embedded in the .rlib for cross-crate inlining
#[inline]
pub fn square(x: f32) -> f32 { x * x }

Rust generics add something structurally distinct from inlining: monomorphization. When you write a generic function, the compiler generates a separate, fully specialized copy for each concrete type it is called with. max::<i32> and max::<String> become separate functions in the binary, each optimizable for their specific type. This is not inlining, the function call boundaries still exist unless LLVM subsequently inlines them, but it gives the optimizer type-specific visibility that dynamic dispatch eliminates. The cost is binary size and compile time, two things anyone who has waited on a Rust build knows well.

Go sits at the pragmatic end of the spectrum. Its inliner uses a budget measured in weighted AST node cost, roughly 80 units, and makes purely static decisions based on function structure. Functions containing defer, goroutine spawns, or select statements are not inlineable regardless of size. Go 1.12 introduced mid-stack inlining, which allowed functions that call other functions to be inlined; before that, only leaf functions qualified, which excluded most real Go code. Go 1.21 added PGO support, letting the inliner exceed its normal budget for call sites that appear hot in a pprof profile. The -gcflags="-m" build flag prints exactly which calls were inlined and why others were not.

What the JVM Does Differently

HotSpot’s C2 compiler inlines at JIT compilation time, after the interpreter and the C1 tier have already collected runtime profiling data. Before C2 compiles a method, it has a method data object (MDO) recording invocation counts, branch taken/not-taken frequencies, and at virtual call sites, a histogram of the concrete receiver types that have actually appeared in production.

The default inlining threshold for small methods is -XX:MaxInlineSize=35 (35 bytecode bytes). For frequently-called methods, that rises to -XX:FreqInlineSize=325. But the more consequential feature is what C2 does with type information.

At a virtual call site where only one concrete type has ever been observed, C2 inlines the body of that specific method and emits a type-check guard:

if (receiver.klass == ExpectedType) {
    // inlined body of that concrete method
} else {
    // uncommon trap → deoptimize
}

If a second receiver type later appears and the guard fails, the current compiled frame is deoptimized back to interpreter mode. The MDO is updated, C2 eventually recompiles with both types known, and emits two guards with two inlined bodies. This is a bimorphic inline cache.

The deoptimization mechanism is what enables this approach. C2 can make aggressive inlining assumptions based on observed behavior, knowing it can roll back if those assumptions prove wrong. A C++ compiler has no such fallback; its decisions are irrevocable once the binary ships.

HotSpot also performs Class Hierarchy Analysis (CHA): if a virtual method has only one concrete implementation in the currently loaded class set, C2 inlines it unconditionally with no guard at all. If a new subclass is later loaded that overrides the method, all compiled code relying on that assumption is invalidated and deoptimized. The JVM’s ability to unload and recompile code mid-execution is the capability C++ simply cannot replicate with static compilation.

Escape analysis is the downstream example of what this enables. After inlining, if C2 determines that an object allocated inside an inlined callee never escapes the caller’s scope, it can allocate the object on the stack (scalar replacement), eliminating heap allocation and GC pressure entirely. That optimization is only visible once the call boundary is gone, the same pattern as C++ inlining enabling auto-vectorization.

V8 and the Megamorphic Cliff

V8’s TurboFan JIT works along similar lines. Ignition and Sparkplug collect feedback through inline caches at every call site and property access. Each IC transitions through states: uninitialized, monomorphic (one shape seen), polymorphic (two to four shapes), or megamorphic (more than four shapes).

TurboFan reads this feedback when making inlining decisions: a monomorphic call site gets the full treatment, property accesses become fixed-offset loads, and the callee’s body gets inlined after a map check verifying the object’s hidden class. A megamorphic site is effectively abandoned by the optimizer. TurboFan marks it as “do not attempt type specialization,” and you lose not just inlining but all downstream type-based optimization.

The megamorphic cliff is steep. JavaScript code that mixes object shapes in a hot path falls off it, and the performance gap between monomorphic and megamorphic code in a tight loop can exceed an order of magnitude. Java’s static type system makes it much harder to accidentally create polymorphic hot paths; JavaScript’s dynamic nature makes it easy.

What This Reveals About the Trade-offs

These four approaches reflect different bets about when optimization should happen and what information will be available.

C++ and Rust bet on compile-time information. The optimizer sees all the source, makes permanent decisions, and ships a fixed binary. This produces predictable performance with no runtime overhead, but the optimizer is blind to actual execution patterns unless PGO data is fed back.

The JVM bets on runtime information. It accepts the overhead of an interpreter and profiling infrastructure in exchange for optimization decisions made on observed behavior. Speculative inlining with deoptimization is the key mechanism: inline aggressively based on what you have seen, and back out if the assumption fails. For long-running server workloads, this consistently outperforms static compilation on workloads with stable type profiles because the JIT eventually reaches a highly optimized steady state.

Go bets on simplicity. A budget-based static inliner with good defaults covers most cases without requiring programmers to annotate anything. The //go:noinline directive exists but is almost exclusively used in the runtime and in benchmarks. Go’s PGO extension preserves the simple mental model while adding a feedback channel for programs where the static heuristic misses.

For the performance-critical C++ work that Lemire’s article addresses, the practical consequence is: inlining is a compile-time decision with no fallback, so you need to give the compiler visibility. That means keeping hot inner functions visible to the optimizer (same TU or LTO), using __attribute__((always_inline)) or #[inline(always)] for functions that must be inlined to unlock SIMD or constant-folding, and annotating cold error paths with [[gnu::cold]] to protect the instruction cache footprint of the hot paths.

For JVM work, the levers are different: keep virtual call sites monomorphic in hot loops, understand that CHA often devirtualizes virtual dispatch in well-structured code without any annotation, and trust that a long-running JVM process will eventually reach a better optimization state than a cold start, because it has observed more actual behavior.

The underlying problem is the same everywhere. Call boundaries hide information from the optimizer. Each of these runtimes has a different opinion about when that information becomes available, and they build their optimization strategies around that opinion.