Function Call Overhead Is Not a C++ Problem: A Cross-Language View

Daniel Lemire’s recent post on function call overhead uses the simplest possible example: calling add(x, y) inside a loop versus writing x + y directly. The point is sound. Direct calls in C++ cost 3-10 cycles, and in a tight loop that overhead compounds. What I find more interesting is treating this as a starting point rather than a destination, because the mechanics behind the analysis are not universal. Every language runtime has its own relationship with the function call abstraction, and the solutions they reach for reveal a lot about their design priorities.

C and C++: Ahead-of-Time Visibility

In C and C++, the compiler decides at compile time whether to inline a given call site. GCC uses a cost model measured in “insns” (a normalized instruction count) against a threshold that varies by context: around 400 for functions explicitly marked inline, around 30 for automatic inlining candidates. Clang/LLVM uses a default IR-instruction threshold of 225, adjustable via -inline-threshold. Both compilers boost these thresholds for call sites with constant arguments, single callers, and hot paths identified by Profile-Guided Optimization.

The real constraint is translation unit visibility. A function defined in math.cpp cannot be inlined into a call in process.cpp unless the callee’s definition is visible at the call site. This is why the standard library puts algorithm bodies in headers, and why projects that care about performance either use header-only components or enable Link-Time Optimization.

// Defined in a header: compiler can inline at every call site
inline float scale(float x) { return x * 2.0f + 1.0f; }

// Defined in math.cpp: compiler sees only a declaration elsewhere
float scale(float x);  // opaque; call not inlinable without LTO

With GCC’s -flto or Clang’s -flto=thin, the linker defers optimization to a stage where all translation units are visible together. ThinLTO parallelizes this by doing most analysis per-module and performing a thin cross-module pass at link time; full LTO is more thorough but significantly slower to compile. Either option gives the optimizer the visibility it needs to inline across boundaries that separate compilation enforces.

You can observe inline decisions directly. GCC emits them with -fopt-info-inline; Clang uses -Rpass=inline -Rpass-missed=inline. Both print per-call-site reasons:

note: inlined 'scale' into 'apply_array' (size: 4 insns, threshold: 400)
note: not inlining 'compute': too many insns (421 > 400)

Rust: The Same Backend, With Aliasing Proofs

Rust uses LLVM as its backend, so the inlining machinery is nearly identical to Clang. The #[inline], #[inline(always)], and #[inline(never)] attributes map directly to the same LLVM inline hints that Clang generates from __attribute__((always_inline)) and friends.

The meaningful difference is in what LLVM knows about the code’s memory semantics. Rust’s borrow checker guarantees that mutable references are exclusive: if a &mut T exists, no other reference to the same memory can exist simultaneously. LLVM encodes this as noalias on function parameters, which is the same annotation C programmers apply manually with restrict.

// The &mut [f32] argument gets LLVM's noalias attribute automatically.
// The loop vectorizer can exploit this without needing to inline.
fn scale(arr: &mut [f32]) {
    for v in arr.iter_mut() {
        *v = *v * 2.0 + 1.0;
    }
}

In C, the equivalent requires float* __restrict__ arr as a manual programmer promise. Without it, the compiler must assume that a pointer argument might alias another, which can block vectorization even when the loop body is fully visible. Rust provides the aliasing proof structurally, which means the optimizer sometimes succeeds where a C translation of the same logic would need an explicit hint.

Cross-crate inlining in Rust requires either #[inline] on the function or LTO. Without one of these, a function in a separate crate is opaque at the call site, and you lose the same optimization opportunities as with separate C++ translation units. Library crates that care about performance typically annotate hot functions with #[inline] precisely for this reason.

Go: Calling Conventions and a Conservative Inliner

Go has a more interesting history. Before Go 1.17 (released August 2021), Go used a pure stack-based calling convention: all arguments and return values were passed on the stack. This was simple and made goroutine stack scanning easy, but it meant every function call required stack writes for arguments and stack reads to receive them, even when the hardware’s general-purpose registers were idle.

Go 1.17 introduced register-based calling. The first nine integer arguments now pass in registers (AX, BX, CX, DI, SI, R8, R9, R10, R11); the first fifteen floating-point arguments pass in X0 through X14. The change brought Go’s convention much closer to the System V AMD64 ABI used by C/C++ and Rust on Linux and macOS. The Go team measured 5-15% improvements across integer-heavy benchmarks in the standard library.

Go’s inliner is more conservative than GCC or Clang. It computes an inlining cost for each function and declines to inline anything that exceeds 80 cost units. You can inspect these decisions at build time:

$ go build -gcflags="-m" ./...
./main.go:5:6: can inline add
./main.go:9:13: inlining call to add
./main.go:14:6: cannot inline compute: function too complex: cost 83 exceeds budget 80

There is no equivalent to __attribute__((always_inline)) in Go. The only annotation available is //go:noinline, which suppresses inlining. If you want a function inlined, the only approach is to simplify it until it fits within the budget. This is a deliberate design choice: the Go team trades optimization control for predictable compilation behavior and toolchain simplicity.

V8 and JavaScript: Speculative Everything

JavaScript’s V8 engine inverts the entire model. Rather than deciding at compile time based on static size estimates, V8’s Turbofan JIT observes call sites at runtime and makes speculative inline decisions based on what actually happens.

V8 starts by interpreting JavaScript through Ignition, collecting type feedback at each call site. A call site that consistently calls the same function is monomorphic. One that calls different functions (or sees different argument types) is megamorphic. Turbofan inlines monomorphic call sites speculatively: it emits a type guard, then the inlined function body, with a deoptimization branch if the guard fails.

function add(x, y) { return x + y; }

// After observing many calls with integers, Turbofan inlines
// this and emits an integer fast path plus a type check guard.
for (let i = 0; i < 1e7; i++) {
  result += add(i, 1);
}

This lets V8 inline through dynamic dispatch that static compilers cannot touch. A polymorphic method call in JavaScript can be devirtualized and inlined if runtime observation shows it’s always the same target. The cost is the deoptimization path: when the observed assumption breaks, V8 falls back to interpreted mode and eventually recompiles with updated type information. Hot paths that repeatedly deoptimize can lose significant performance.

The JIT approach also means that function call overhead in JavaScript is profile-dependent. The same code can run at near-native speed after warmup if it stays monomorphic, or at interpreter speed if call site polymorphism prevents Turbofan from optimizing it.

The Common Thread

Every one of these runtimes needs visibility into what a function does in order to optimize the call away. The specific mechanism varies: ahead-of-time static analysis in C/C++ and Rust, a conservative budget-based inliner in Go, runtime speculation in V8. The aliasing information that Rust provides structurally, C requires as a manual annotation, Go doesn’t need in the same way due to its type system, and JavaScript’s JIT infers from observation.

Lemire’s add/add3 example is explicitly about direct calls in C++. But the underlying insight generalizes: the optimizer’s need to see through the call boundary is constant, and the mechanisms different runtimes use to provide or deny that visibility explain a great deal about their respective performance characteristics. A Rust programmer reaching for #[inline], a Go programmer simplifying a function to fit within 80 cost units, and a JavaScript programmer avoiding megamorphic call sites are all solving the same fundamental problem with different tools.

Where the runtimes diverge is in what happens when visibility is unavailable. C/C++ and Rust default to pessimistic assumptions and leave the programmer with explicit overrides. Go defaults to conservative inlining with no override path. V8 defaults to optimistic speculation with a deoptimization safety net. Each default reflects a judgment about what matters more: predictability, control, or throughput at the cost of occasional cliffs.