Inlining Does More Than Remove a Call

Every C programmer learns early that function calls are cheap. The x86-64 calling convention passes the first six integer arguments in registers, the call instruction itself costs one cycle, and modern CPUs execute the whole sequence with impressive efficiency. So it is easy to conclude that splitting code into small functions costs almost nothing.

It costs more than that. Not always, and not in ways that are obvious from reading the source. Daniel Lemire’s recent writeup on the cost of a function call uses a deliberately minimal example to show where the expense actually lives, and the lesson is less about cycles-per-call and more about what the compiler can no longer do once a call boundary exists.

What the Call Actually Costs

At the machine level, a function call involves pushing a return address, jumping to the callee, executing the function’s prologue to set up a stack frame, executing the function, tearing down the frame, and returning. On a modern superscalar CPU running at 3-4 GHz, the mechanical overhead for a simple function is somewhere between 1 and 5 nanoseconds under favorable conditions. That is not free, but for functions doing any real work, it is usually not the bottleneck either.

The problem shows up in tight loops. Consider the canonical example:

int add(int x, int y) {
    return x + y;
}

int add3(int x, int y, int z) {
    return add(add(x, y), z);
}

A compiler without inlining sees two function calls inside add3. Each call crosses a boundary the optimizer cannot look through. With inlining, add3 becomes:

int add3(int x, int y, int z) {
    return x + y + z;
}

The mechanical call overhead is gone. But that is the least interesting part of what changed.

The Optimizer’s Information Horizon

Compilers work by analyzing regions of code and applying transformations when they can prove those transformations are safe. A function call is, by default, an opaque wall. The optimizer does not know what add does unless it can see the definition. It does not know whether add reads or modifies global state, whether it can throw, whether it has side effects that constrain reordering. Without that knowledge, the optimizer must be conservative.

Constant propagation is a clear example. If add3 is called with compile-time-known values like add3(1, 2, 3), a compiler with full visibility can fold this to the constant 6 and emit no code at all. With an opaque call boundary around add, that fold stops at the first call. The optimizer knows add(1, 2) produces some integer, but not that the integer is 3.

The same logic applies to dead code elimination, loop-invariant code motion, and alias analysis. Every optimization that reasons about what values can be where depends on being able to trace the code. Function boundaries interrupt that trace.

Where the Real Cost Shows Up: Vectorization

The most dramatic example of this is SIMD vectorization. Modern CPUs ship with vector units that can process 4, 8, or 16 integers in a single instruction using SSE, AVX, or AVX-512. Compilers will automatically generate these instructions when they can prove it is safe to do so, turning a scalar loop into one that processes multiple elements per cycle.

Consider a simple summation loop:

long sum(const int* data, size_t n) {
    long total = 0;
    for (size_t i = 0; i < n; ++i)
        total += data[i];
    return total;
}

A modern compiler with -O2 or -O3 will vectorize this into something that accumulates four or eight elements per iteration using vpaddd or equivalent instructions. The resulting code is several times faster than the scalar version on any array of meaningful size.

Now move the accumulation behind a function call:

long accumulate(long total, int x) {
    return total + x;
}

long sum(const int* data, size_t n) {
    long total = 0;
    for (size_t i = 0; i < n; ++i)
        total = accumulate(total, data[i]);
    return total;
}

If accumulate does not get inlined, the loop cannot be vectorized. The compiler cannot legally reorder the calls, cannot accumulate multiple elements in parallel, and cannot use vector registers because the function’s calling convention uses scalar registers. The loop remains fully scalar. On a dataset that fits in L1 cache, this can easily mean a 4-8x performance difference, not from call overhead but from the loss of vectorization entirely.

Agner Fog’s optimization manuals document in detail how the calling convention constrains register use. Callee-saved registers must be preserved, the stack must remain aligned, and the optimizer must assume that called functions may use the floating-point and vector state in ways that interfere with its own register allocation. These constraints are not about the cost of the call instruction. They are about what the compiler can build around the call.

The Complicated History of `inline`

C++ has had the inline keyword since the beginning, and its meaning has shifted enough that it is worth being precise about what it does today. Originally, inline was a hint to the compiler that a function was a good candidate for inlining at call sites. Compilers honored this, sometimes aggressively.

Over time, compiler heuristics for inlining improved enough that the hint became largely redundant. GCC and Clang maintain their own cost models based on instruction count, call frequency, code size thresholds, and profile-guided data from PGO builds. The inline keyword still has a weak influence, but declaring a function inline will not force the compiler to inline it, and omitting it will not prevent inlining.

What inline retained is its effect on the one-definition rule. A function marked inline can appear in multiple translation units without causing a linker error, which is why template functions and functions defined in headers are implicitly or explicitly inline. This is the primary reason the keyword still appears in modern C++ code: not to control inlining, but to make header-only definitions legal.

If you actually need to force inlining, the language provides [[likely]] and [[unlikely]] hints, and compilers provide their own extensions: __attribute__((always_inline)) in GCC and Clang, __forceinline in MSVC. These are stronger directives that override the compiler’s cost model, though they come with the risk of generating bloated code if overused.

Link-Time Optimization as a Modern Answer

The fundamental problem with inlining is that it traditionally works within a single translation unit. If add is defined in math.cpp and called from main.cpp, the compiler processing main.cpp has no visibility into add’s definition and cannot inline it.

Link-time optimization (LTO) addresses this by deferring final code generation to the link step. With -flto in GCC or Clang, the compiler emits intermediate representation rather than machine code. The linker then has the full program IR and can perform inlining and optimization across translation unit boundaries.

The practical effect is significant for codebases that are split into many small compilation units. Functions that the compiler would otherwise treat as opaque become transparent, and the optimizer can apply the same transformations it would apply to a single-file program. The trade-off is compilation time: LTO typically doubles or triples link times on large projects, which is why it is often reserved for release builds.

Profile-guided optimization compounds the benefit. With PGO, the compiler knows which call sites are hot and which functions are called frequently, allowing it to prioritize inlining at the places that actually matter for runtime performance.

When Not to Inline

Inlining is not universally beneficial. A function that is called from many sites will be duplicated at each site when inlined, increasing binary size. Large binaries stress the instruction cache; if the inlined code pushes hot paths out of L1i, the result is more cache misses and worse performance than the original function call would have caused.

Compilers track this through their cost models, but the models are imperfect. Functions that are called in only one or two hot loops are usually good candidates. Functions with complex control flow that are called from dozens of sites may be better left out-of-line, even if they are individually small.

The standard advice holds: measure before and after, look at the generated assembly with -S or a tool like Compiler Explorer, and use __attribute__((noinline)) or [[noinline]] when you need to deliberately prevent inlining to study the effect.

The Practical Takeaway

Function calls are cheap as long as the compiler can see through them. The call instruction costs a few cycles. The optimization barrier can cost an order of magnitude more, depending on what the optimizer could have done with the combined code.

For tight inner loops, the question is not whether you can afford the call overhead. It is whether the compiler can afford to stop optimizing at that boundary. When vectorization is on the table, the answer is usually no. Keep the critical path visible, let the compiler do its job, and reach for LTO or explicit inlining directives when the critical path crosses translation unit lines.