Daniel Lemire’s analysis of function call cost is a clean argument for inlining: a call and return costs 4 to 10 cycles on modern x86-64, and in a tight loop that cost accumulates. Inlining eliminates it and, more importantly, gives the compiler full loop body visibility so it can auto-vectorize, fold constants, and eliminate dead branches. The argument is correct. It is also incomplete.
The side of the story that gets less attention is what inlining costs on the fetch side of the pipeline. Every function you inline at every call site duplicates that function’s machine code at each location. Duplicate enough code across enough hot call sites and you start losing instruction cache lines. When the L1 instruction cache cannot hold your hot path, the processor fetches from L2 or L3 on every miss, paying 10 to 15 cycles each time, which is the same order of magnitude as the call overhead you eliminated by inlining in the first place.
The L1 Instruction Cache Is Smaller Than You Think
The L1 instruction cache on most x86-64 processors holds 32 kilobytes. A cache line is 64 bytes. That gives you 512 usable lines for your hot code. On Intel Skylake and later, the instruction fetch unit can issue up to 16 bytes per cycle from L1, giving it an effective bandwidth ceiling of about 4 to 6 instructions per cycle. Step outside L1 and the pipeline stalls waiting for fills from L2, which has higher capacity but 4 to 10 cycle latency. Step out to L3 and you are looking at 30 to 50 cycles.
A small utility function body might be 20 to 40 bytes of machine code. Inline it at 50 call sites and you have added 1,000 to 2,000 bytes of code to the text segment that the instruction fetch unit must track. If those call sites are distributed across a large hot function, you have increased the working set that must fit in L1i. Whether that matters depends on what else is competing for those 512 lines.
The practical effect is visible with perf stat. The relevant counters are L1-icache-load-misses and the TopDown.FrontendBound metric from Intel’s Top-Down Microarchitecture Analysis. If the frontend is stalling and instruction cache misses are elevated, you have a code size problem, not a call overhead problem. Chasing more inlining at that point will make it worse.
perf stat -e L1-icache-load-misses,L1-dcache-load-misses,cycles,instructions ./your_binary
How Compiler Inlining Heuristics Fail Here
GCC’s inlining threshold is controlled by -finline-limit, defaulting to 600 pseudoinstructions. Clang measures differently, using an IR instruction budget of roughly 225 at -O2 and 275 at -O3. Both thresholds exist to prevent unlimited code size growth, but they evaluate functions in isolation, not in terms of how many times each function will be inlined or how those copies will interact in the L1i.
A function with a 50-instruction body passes GCC’s threshold comfortably. If it appears in a hot loop that calls it from 30 distinct sites, inlining it produces 1,500 instructions of code spread across those sites. The inliner has no model for this multiplication effect, because it does not track how many distinct call sites a given function has relative to cache capacity. The heuristic optimizes locally and creates a global problem.
You can observe the inliner’s decisions with Clang’s reporting flags:
clang++ -O2 -Rpass=inline -Rpass-missed=inline -c hot.cpp 2>&1 | head -40
GCC provides -fopt-info-inline for similar output. Neither report tells you about cache pressure; that requires measuring the binary.
Marking Cold Paths Explicitly
The practical response is to help the compiler distinguish hot code from cold code. The [[gnu::cold]] attribute, available in GCC and Clang, marks a function as unlikely to be called. The compiler places cold functions in a separate .text.cold section, far from hot code in the address space, so they do not compete for L1i lines with the code that runs on every iteration.
[[gnu::noinline, gnu::cold]]
void handle_parse_error(const char* msg, int line) {
// error handling, logging, etc.
log_error(msg, line);
throw ParseException(msg, line);
}
Pairing [[gnu::cold]] with [[gnu::noinline]] keeps the function body in the cold section and prevents inlining from pulling it back into the hot path. Clang also respects __attribute__((cold)) and uses profile data to identify cold functions automatically when PGO is active.
For functions at the other extreme, where inlining is genuinely beneficial and the call site is isolated, __attribute__((always_inline)) on GCC and Clang, or __forceinline on MSVC, guarantees the compiler inlines regardless of its size estimate. This is appropriate for small arithmetic helpers called inside a tight numerical loop, where the body is 5 to 10 instructions and the site multiplier is one.
Profile-Guided Optimization as the Real Answer
Manual annotation scales poorly. A large codebase has thousands of functions and you cannot audit every one. Profile-guided optimization solves the problem systematically by giving the compiler actual frequency data instead of static estimates.
The workflow is straightforward:
# Instrument phase
clang++ -O2 -fprofile-instr-generate -o mybin_instrumented main.cpp
# Collection phase -- run with representative workload
./mybin_instrumented --workload representative.dat
llvm-profdata merge -output=merged.profdata default.profraw
# Optimization phase
clang++ -O2 -fprofile-instr-use=merged.profdata -o mybin_optimized main.cpp
With profile data, Clang elevates the inlining budget for hot call sites and depresses it for cold ones. A function called once during initialization and once per parsed element will be treated differently at each site. The compiler also moves cold code to .text.cold sections automatically and reorders basic blocks so hot paths fall through without branch instructions.
The gains from combining PGO with ThinLTO are measurable. Chromium’s build infrastructure uses both, and their documented performance numbers show roughly 10 to 15 percent improvement over plain -O2, with most of the gain coming from cross-translation-unit inlining decisions informed by profile data. Firefox uses a similar approach. The Linux kernel is more conservative about PGO but provides the infrastructure via make pgo.
What Simdjson Teaches About Manual Control
Not every project has a PGO workflow. The simdjson library takes a different approach: it annotates nearly every function in the hot parse path with a macro that expands to __attribute__((always_inline)) inline:
#define really_inline __attribute__((always_inline)) inline
This keeps the entire parse loop as a single inlined body visible to the compiler, enabling auto-vectorization throughout. Simdjson achieves 2.5 to 3.5 GB/s throughput on common JSON workloads against 0.5 GB/s for conventional parsers. The inlining is the prerequisite that makes vectorization possible, and the library is designed so the inlined body remains compact enough that code size is not the binding constraint.
The key observation is that simdjson’s authors made an explicit architectural choice: keep the hot path functions small enough that aggressive inlining does not cause bloat. A 15-instruction helper inlined at 8 parse-loop call sites adds 120 instructions to the hot path. That is workable. A 100-instruction function inlined at 30 sites adds 3,000 instructions and starts competing seriously with L1i capacity.
The Measurement Imperative
Lemire’s original analysis ends with a measurement argument: function call cost is real but you should profile before concluding it matters. The same principle applies here. Inlining is not uniformly beneficial. The right question is not “should I inline this function” but rather “what is the bottleneck in this loop, and does inlining help or hurt it.”
A loop that is frontend-bound with high L1i miss rates needs smaller hot code, not more of it. A loop that is backend-bound on execution unit throughput may benefit from inlining that exposes more work to the vectorizer. A loop that profiles as compute-bound with low miss rates is not worth touching at all.
The instinct to inline everything in a hot path is usually right for short arithmetic functions. It becomes wrong for larger functions with high call-site multipliers, for functions containing code that runs rarely, and for any situation where the entire working set of a hot loop no longer fits in 32 kilobytes. Reading the profiler output instead of following the instinct is the difference between a real optimization and a slower binary with fewer function calls.