· 7 min read ·

Inlining Across Boundaries: How C++, Rust, and the JVM Solve the Same Problem Differently

Source: isocpp

Daniel Lemire’s post, linked through ISOcpp, demonstrates the basic mechanics well. A function call is not free. On x86-64, the CALL instruction pushes a return address, the callee saves non-volatile registers, executes, then RET pops the address back. Agner Fog’s instruction tables put CALL and RET at roughly 3 cycles each on Skylake; add prologue and epilogue work and the round trip runs 6 to 10 cycles for a trivial function. In a loop over a billion elements, that adds up.

But the cycle count is not the main issue. The more consequential cost is that a function call is an optimization barrier. When the compiler encounters a call instruction, it must assume the callee can modify any globally reachable memory, branch arbitrarily, and invalidate any alias analysis the optimizer has built up. The entire loop body becomes opaque. Auto-vectorizers, which convert scalar loops into SIMD instructions processing 8 or 16 elements per cycle, cannot work across that boundary — they need to see all memory accesses in the loop to prove vectorization is safe.

Remove the barrier by inlining, and the vectorizer can do its job. A loop that calls a noinline float transform generates a scalar loop with a CALL per iteration on GCC at -O3. Inline the same transform, and GCC generates AVX vfmadd231ps instructions: 8 floats per cycle instead of one. The throughput difference is typically 6 to 10x for this class of loop.

That is the well-known story. The less-discussed part is what happens when the call you want to inline lives in a different translation unit, a different crate, or a different class hierarchy.

What the inline Keyword Actually Does

The C++ inline keyword began as a compiler hint. The idea was that the programmer, knowing the function is small and called in hot paths, could suggest inlining to the compiler. That interpretation is effectively dead. Both GCC and Clang document that they treat inline as a candidate hint and are free to ignore it entirely. Neither compiler gives the keyword meaningful weight in its inlining cost model.

What the keyword does do is exempt a function from the One Definition Rule. The ODR says a non-inline function may be defined in at most one translation unit. Define a function in a header and include that header from ten source files, and you have ten definitions; the linker will complain or pick one arbitrarily. Mark the function inline, and the multiple definitions are permitted, provided they are identical. The linker collapses them.

This is why header-only C++ libraries mark their functions inline. It is why static inline is idiomatic in C headers. The inline keyword is a linkage and ODR mechanism, not a performance directive.

C++17 extended this to variables. Before C++17, putting a non-constexpr global variable in a header required awkward workarounds. With C++17, inline variables work exactly like inline functions:

// Safe in a header included from multiple translation units
inline constexpr double TAU = 6.28318530717958;

All definitions must be identical; the linker produces a single object. C++20 modules address the underlying problem differently by eliminating textual inclusion, but inline variables remain the standard tool for header-only libraries.

For actually forcing inlining, the correct attribute is __attribute__((always_inline)) on GCC and Clang, or [[clang::always_inline]] in standard attribute syntax. These are honored unconditionally, outside any threshold. For preventing inlining, __attribute__((noinline)). The thresholds the compilers actually use are controlled by parameters like GCC’s --param max-inline-insns-single (default 450 pseudo-instructions at -O2) and LLVM’s inline cost model (default threshold of 225 IR instructions at -O2, rising to 275 at -O3). These are the numbers that matter, not the presence or absence of the inline keyword.

Rust’s Cross-Crate Problem

Rust compiles crates independently. When building a binary that depends on a library crate, the compiler has access to the library’s public interface — types, function signatures — but not its implementation. By default, a function defined in crate A cannot be inlined into crate B’s code, regardless of how small or simple it is. The compilation unit boundary is opaque.

The #[inline] attribute changes this by serializing the function’s MIR (Mid-level Intermediate Representation) into the compiled crate’s metadata. When a downstream crate is compiled, the inliner finds the MIR and can inline the function as if it were defined locally. The Rust reference describes this as the primary effect: cross-crate availability of the function body.

This is why Vec::push, String::push_str, and most of the iterator adapter methods in the standard library are marked #[inline]. A call to vec.push(x) in application code without the attribute would be a non-inlineable cross-crate call, regardless of optimization level. With #[inline], the compiler can see the bounds check, the capacity check, and the write, and fold them into the surrounding context.

#[inline(always)] forces inlining across crate boundaries; #[inline(never)] prevents it. These map precisely to always_inline and noinline in GCC and Clang.

Link-Time Optimization removes the dependency on annotations. With Rust’s fat LTO (-C lto=fat) or thin LTO (-C lto=thin), all crate IR is available at link time and the inliner can cross crate boundaries freely. For release builds of performance-sensitive binaries, LTO is the practical solution; #[inline] exists for cases where you need inlining without committing to full LTO, and for library authors who cannot control the LTO settings of their consumers.

Java’s JIT Does It at Runtime

Java approaches the same problem from the opposite direction. The JVM has no notion of a compilation unit boundary at the inlining level. The HotSpot C2 JIT compiler, operating on bytecode after the class loader has resolved all classes, inlines freely across class and package boundaries. Its default threshold for “trivially small” methods is 35 bytecodes, inlined at the interpreter tier; the C2 server JIT uses a threshold around 325 bytecodes, controlled by -XX:MaxInlineSize.

The key advantage Java has here is that inlining decisions are always profile-guided. The JVM only JIT-compiles hot methods, typically after around 10,000 invocations. By the time the JIT runs, it has observed the actual call frequency and the concrete types that appeared at each call site. This enables something C++ compilers rarely achieve without explicit PGO: polymorphic inline caching.

When a virtual call site in Java always dispatches to the same concrete type, the JIT generates an inlined fast path with a type guard. If the type matches (the common case), execution proceeds through the inlined body without a call. If it does not, execution falls back to the virtual dispatch. For monomorphic call sites — one concrete type — this eliminates the virtual call overhead entirely. For bimorphic call sites — two concrete types — the JIT generates two inline guards and two inlined bodies.

This is why Java’s virtual dispatch overhead is often lower in practice than C++‘s, despite C++ virtual calls being theoretically cheaper in the uncontended case. C++ cannot devirtualize unless the compiler can prove the concrete type through static analysis or LTO; Java does it through runtime observation.

Profile-Guided Optimization Closes the Gap

C++ compilers can approximate Java’s profile-guided inlining through PGO. The workflow: compile with -fprofile-generate (GCC) or -fprofile-instr-generate (Clang), run representative workloads, then recompile with the collected profile data. The compiler replaces heuristic call frequency estimates with actual observed counts, and hot call sites get a large inlining bonus that overrides size thresholds.

The performance gains are consistent. Clang’s documentation cites 10 to 25% improvement on CPU-intensive workloads. Google’s internal measurements show 5 to 15% on large production binaries. The gains come partly from better inlining decisions and partly from improved branch prediction (PGO also informs branch layout) and better register allocation on hot paths.

Clang’s ThinLTO, combined with PGO, can inline across translation unit boundaries for hot call sites, approaching what Java’s JIT does at runtime. This is the production configuration for Chrome, Firefox, and most large-scale C++ deployments. The combination of ThinLTO and PGO is as close as a static compiler gets to a JIT’s ability to observe and adapt.

The Spectre Tax on Virtual Calls

Virtual calls in C++ carry additional overhead after Spectre. A virtual call is an indirect call: load the vptr from the object, load the function pointer from the vtable at the appropriate offset, then call indirectly. The indirect branch predictor (IBP) predicts the target. Spectre variant 2 exploits the IBP to cause speculative execution at an attacker-controlled address, leaking memory through a side channel.

The software mitigation, retpoline, replaces indirect calls with a sequence that prevents speculative execution from using the IBP. A retpolined indirect call on Skylake costs roughly 30 to 50 cycles, versus 5 to 7 for an unmitigated indirect call. Vtable-heavy C++ code on pre-eIBRS hardware pays this tax on every virtual dispatch. Later Intel microarchitectures (Ice Lake and newer) support eIBRS, which mitigates Spectre v2 in hardware and removes the retpoline overhead. AMD Zen 3 has AutoIBRS. But Skylake-era hardware remains common in server deployments, and the overhead is real enough to matter for vtable-heavy workloads.

Devirtualization — the compiler converting a virtual call to a direct call, then potentially inlining it — becomes more valuable post-Spectre precisely because it avoids the indirect call entirely. GCC and Clang both implement speculative devirtualization at -O2 and higher, and LTO substantially increases its reach by making concrete type information available across translation units.

Practical Guidance

The lesson from all three ecosystems is similar. The inline keyword (or #[inline], or neither in Java) does not reliably produce inlining. What produces inlining is visibility: the compiler or JIT needs to see the callee’s body at the call site. In C++, that means either defining the function in a header or using LTO. In Rust, it means #[inline] for cross-crate calls or LTO. In Java, it happens automatically because the JIT operates after class loading.

For genuinely hot inner loops where inlining is load-bearing, the correct C++ tool is __attribute__((always_inline)). Pair it with verification: Compiler Explorer makes it straightforward to confirm the compiler is not generating a call instruction where you expect one not to be. For anything more complex than a micro-benchmark, PGO is worth the workflow cost. The compiler’s heuristics are good, but they are working with estimates; observed profiles are better data.

Was this interesting?