Daniel Lemire’s article on function call cost makes the case that function call overhead in tight loops is real, and that compilers eliminate it through inlining. That mechanism is worth understanding. What the article does not address is a persistent misconception sitting upstream of it: writing the inline keyword in front of a function definition does not make the compiler inline it.
The inline keyword in C++ carries two distinct jobs that have almost nothing to do with each other. The first is historical: a hint to the compiler that inlining would be beneficial. Compilers have been ignoring that hint since roughly the late 1990s. The second job is active and consequential: it tells the linker that this function’s definition may appear in multiple translation units without violating the One Definition Rule. The performance hint is dead. The linkage exemption is not.
This matters every time a developer annotates a function with inline expecting it to run faster.
What inline Was Originally For
The C++98 standard introduced inline as a hint. The wording is carefully noncommittal: the inline specifier “suggests to the compiler that inlining is preferable.” Compilers were under no obligation to honor the suggestion, and in the early 1990s they largely tried to. Compiler optimizers were limited enough that a user hint was genuinely useful guidance.
By the time C++11 codified modern practices, the relationship had inverted. GCC’s optimizer had developed cost models sophisticated enough that inline was treated as a no-op for optimization decisions. The GCC 4.x series documentation states plainly that the keyword “does not affect whether a function is considered for inlining; the decision depends on the optimization level and function size.” LLVM/Clang, designed from scratch around an internal cost model, reached the same state even earlier in its development.
MSVC documents the same reality: “The inline and __inline specifiers instruct the compiler to insert a copy of the function body. However, the compiler may not inline the function in all cases.”
What inline Actually Does Today
The inline specifier does one thing reliably in GCC, Clang, and MSVC: it exempts the marked function from the One Definition Rule. Without it, defining a function in a header and including that header from two different .cpp files produces a linker error:
error: multiple definition of 'int add(int, int)'
The linker sees two definitions of the same symbol and refuses. With inline, the linker is permitted to fold all copies into one. All definitions must be identical, which the standard requires, but the duplication itself becomes acceptable:
// header.h
inline int add(int a, int b) {
return a + b;
}
This compiles and links correctly regardless of how many translation units include the header. Template function definitions have an equivalent ODR exemption built into the language rules, which is why template implementations can live in headers without inline. Non-template, non-inline functions defined in headers will produce ODR violations the moment two .cpp files include them.
The ODR exemption is the reason inline still appears in modern codebases constantly. It is functional and necessary for header-only library design. It just has nothing to do with optimization.
Who Actually Decides Whether to Inline
The compiler’s optimizer decides whether to inline a given call site based on a cost model weighing expected speedup against code size increase. GCC’s threshold is governed by -finline-limit (default: 600 pseudoinstructions). Clang uses an LLVM IR cost threshold of roughly 225 units at -O2 and 275 at -O3. Neither threshold is influenced by the presence or absence of inline on the callee.
The relevant diagnostic flags confirm this directly:
# Clang: see what was inlined and what was not (and why)
clang++ -O2 -Rpass=inline -Rpass-missed=inline foo.cpp
# GCC equivalent
g++ -O2 -fopt-info-inline-missed foo.cpp
These produce per-call-site remarks:
foo.cpp:12:10: remark: add inlined into main [-Rpass=inline]
foo.cpp:19:10: remark: heavy_compute not inlined: cost=312 > threshold=225 [-Rpass-missed=inline]
Running these on a hot path before reaching for any annotation usually reveals that small helpers were already inlined. The calls the optimizer declined are typically the ones it was right to decline: large enough that copying the body at every call site would hurt instruction cache behavior more than it helps.
Forcing Inlining When It Actually Matters
When profiling confirms that the optimizer’s decision is wrong and a specific function needs to be inlined, the correct tools are __attribute__((always_inline)) on GCC and Clang, and __forceinline on MSVC:
// GCC/Clang: bypass the cost model entirely
__attribute__((always_inline))
inline float scale(float x, float factor) { return x * factor; }
// MSVC
__forceinline float scale(float x, float factor) { return x * factor; }
// Cross-platform pattern common in performance-sensitive libraries
#if defined(__GNUC__) || defined(__clang__)
# define FORCE_INLINE __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
# define FORCE_INLINE __forceinline
#endif
Note that inline still appears alongside always_inline on GCC and Clang. It serves the ODR purpose, allowing the definition in a header. The always_inline attribute handles the optimization directive. The two concepts require separate mechanisms because they are genuinely separate concerns.
simdjson, the SIMD-accelerated JSON parser from Lemire and colleagues, takes this to its logical conclusion. The library defines:
#define really_inline __attribute__((always_inline)) inline
And applies it to essentially every function on the hot parse path. The goal is to present the compiler with a single inlined body covering the entire parsing loop, enabling AVX2 vectorization throughout. The result is JSON parsing throughput between 2.5 and 3.5 GB/s, compared to roughly 0.5 GB/s for parsers built with conventional call decomposition and the same underlying algorithm. That difference is not algorithmic. It is optimizer visibility, and really_inline is the mechanism that creates it.
Rust Gets the Semantics Right
Rust’s #[inline] attribute avoids the C++ confusion by being honest about what it does. In Rust, functions do not cross crate boundaries for inlining purposes by default. A function in one crate called from another is opaque to the LLVM backend of the calling crate, even with LTO active, unless the function definition’s IR is embedded in the crate’s .rlib metadata. #[inline] causes that embedding:
// Without #[inline]: callers in other crates cannot inline this, even with LTO
pub fn scale(x: f32, factor: f32) -> f32 { x * factor }
// With #[inline]: LLVM IR embedded in .rlib; cross-crate inlining works
#[inline]
pub fn scale(x: f32, factor: f32) -> f32 { x * factor }
This is a real, observable behavioral difference. A hot utility function in a library crate without #[inline] silently prevents the LLVM vectorizer from seeing inside it in every downstream crate. The Rust standard library annotates its hot-path functions accordingly. #[inline(always)] and #[inline(never)] map directly to LLVM’s alwaysinline and noinline attributes, with the same semantics as the C++ equivalents.
In C++, the inline keyword has no effect on cross-translation-unit inlining because each translation unit compiles independently. The mechanism for cross-TU inlining in C++ is LTO (-flto or -flto=thin), which operates at the IR level after all units are compiled, independent of any source-level annotations.
A Practical Map
| Goal | C++ | Rust |
|---|---|---|
| Allow definition in headers (ODR exemption) | inline | Not needed |
| Force inlining at a call site | __attribute__((always_inline)) | #[inline(always)] |
| Prevent inlining | __attribute__((noinline)) | #[inline(never)] |
| Enable cross-TU inlining | LTO (-flto=thin) | #[inline] on the callee |
| Diagnose inlining decisions | -Rpass-missed=inline (Clang) | RUSTFLAGS="-Cremark=all" |
The confusion around inline in C++ persists because the keyword does work, just not the work people expect. It compiles without error, it links correctly, and the annotated function probably gets inlined anyway because modern compilers are aggressive about small callees. The illusion holds until someone instruments a hot path, finds the call instruction still present in the output, and wonders why annotating it with inline changed nothing.
The compiler was ignoring the hint and making its own decision, which may or may not have been the right one. If it was wrong, the correct fix is __attribute__((always_inline)) backed by profiler evidence. The inline keyword, as a performance tool, stopped being relevant before many current C++ developers wrote their first line of code.