Daniel Lemire’s analysis of function call overhead focuses on inlining as the primary mitigation: pull the callee’s body into the caller, the overhead disappears along with the call boundary. Inlining is correct and important. But there is another optimization that eliminates call overhead for a specific structural pattern without requiring the callee’s body to be visible at all. Tail call optimization replaces the call instruction with a jump, and under the right conditions, the compiler applies it automatically.
What a Tail Call Is
A tail call is a function call in tail position: the calling function does nothing with the result except return it. The textbook example is a tail-recursive accumulator:
int sum(int* arr, int n, int acc) {
if (n == 0) return acc;
return sum(arr + 1, n - 1, acc + arr[0]); // tail call
}
The recursive call to sum is in tail position. After it returns, the current invocation does nothing. That means the current stack frame serves no further purpose once the recursive call begins. The compiler can pop the frame before jumping to the callee, turning the recursion into a loop that uses one stack frame for the entire computation. At assembly level, the transformation replaces:
; Without TCO
call sum
ret
; With TCO
jmp sum
The frame pointer adjustment and stack cleanup happen before the jump. Regardless of how large n is, this version does not overflow the stack.
GCC and Clang both apply this optimization at -O2 for the version above. You can verify it by compiling with -O2 -S and checking whether the assembly shows a backward branch or jmp rather than call; ret at the recursive site.
Why C++ Makes This Harder Than Most Languages
Functional languages with pervasive recursion, like Scheme, Haskell, and OCaml, guarantee tail call optimization as part of their specification. C++ does not, and the reason is structural: destructors.
Consider a version of the same function with a vector:
int sum(std::vector<int> arr, int acc) {
if (arr.empty()) return acc;
int front = arr[0];
std::vector<int> rest(arr.begin() + 1, arr.end());
return sum(rest, acc + front); // looks like a tail call
}
This is not a tail call in C++. arr and rest are objects with destructors. After sum(rest, acc + front) returns, the destructors for arr and rest must run. The current stack frame cannot be discarded before the recursive call because those destructors are bound to it. The abstract machine requires them to execute in the correct order before the function returns.
This is why C++ cannot standardize guaranteed tail call optimization. The language’s lifetime model ties destruction to scope exit, which ties scope exit to the stack frame, which prevents reusing that frame before the scope exits. Any local variable with a nontrivial destructor in scope at the call site blocks TCO, even if the variable is otherwise unused after that point.
The fix is to use plain data at the call site:
int sum(int* arr, int n, int acc) {
if (n == 0) return acc;
return sum(arr + 1, n - 1, acc + arr[0]);
}
With no local objects in scope and no destructor obligations, both GCC and Clang optimize this at -O2. The assembly emits a loop rather than recursive calls. But the compiler makes this decision silently, and it disappears silently too the moment any local object with a destructor enters scope.
[[clang::musttail]]
Clang 13 introduced the [[clang::musttail]] attribute to make tail call optimization explicit and verifiable:
int sum(int* arr, int n, int acc) {
if (n == 0) return acc;
[[clang::musttail]] return sum(arr + 1, n - 1, acc + arr[0]);
}
With [[clang::musttail]], Clang either emits a tail call or fails to compile with a diagnostic explaining why the optimization is structurally impossible. This converts a silent optimization decision into an explicit, verifiable guarantee. The attribute must appear on a return statement, and the call must immediately follow with no pending destructor obligations at the call site.
The primary motivation for the attribute, described in the Clang attribute documentation, was dispatch loops and jump tables in interpreters and protocol parsers, where a central dispatch function calls handlers that themselves need to call back into the dispatcher:
using Handler = int(*)(State* s);
int dispatch(State* s) {
Handler next = s->table[s->current_state];
[[clang::musttail]] return next(s);
}
Without guaranteed tail calls, each layer of dispatch adds a stack frame. A state machine with 10,000 transitions would consume 10,000 frames. With musttail, the dispatch loop runs in constant stack space regardless of transition depth. The attribute also applies to indirect calls through function pointers, as shown above, not just direct recursive calls.
GCC as of version 14 has no equivalent attribute. It applies TCO when it can prove the optimization is safe, but provides no way to express a mandatory requirement. MSVC has no tail call attribute either.
Mutual Recursion and Deep Call Chains
TCO matters most when call chains are deep. A default stack on Linux is 8MB; on Windows it is 1MB. At roughly 64 to 128 bytes per frame, a 1MB stack handles 8,000 to 16,000 recursive calls before overflowing. For algorithms that recurse proportionally to input size, stack overflow is a real failure mode rather than a theoretical concern.
With TCO, mutually recursive functions can call each other in tail position without accumulating frames:
int even(int n);
int odd(int n);
int even(int n) {
if (n == 0) return 1;
[[clang::musttail]] return odd(n - 1);
}
int odd(int n) {
if (n == 0) return 0;
[[clang::musttail]] return even(n - 1);
}
Without TCO, even(1000000) overflows the stack. With musttail, Clang emits a jump at each call site and the computation runs in two alternating frames. The assembly for even ends with a conditional return for the base case and a jmp odd for the recursive case; odd mirrors it.
The Interaction with Inlining
Tail call optimization and inlining attack the same problem from different directions. Inlining substitutes the callee’s body at the call site, eliminating both the call overhead and the optimization barrier. TCO eliminates the stack frame growth and the explicit call instruction, but the callee still executes as a separate function. The compiler does not need to see the callee’s definition to apply TCO at the call site.
This distinction matters across translation unit boundaries. Without link-time optimization, a call to a function defined in another .cpp file cannot be inlined at compile time. But TCO can still apply to that cross-TU call in tail position: the compiler pops the current frame and jumps to the external function’s address without needing to see its body. Cross-TU recursive patterns that cannot benefit from inlining can still benefit from TCO.
When both optimizations apply, they compose: a tail-called function that is also small enough to inline will have its body substituted and then folded into the loop, collapsing the entire recursive chain into a single iteration with no calls.
Practical Considerations
Agner Fog’s processor optimization manuals document the concrete cost of function calls on modern x86-64: a predicted direct call and return costs 4 to 8 cycles, plus register saves and stack frame setup. TCO eliminates that cost for tail calls, reducing each iteration to the cost of a taken branch, which is roughly 1 cycle on a correctly predicted path.
For algorithms expressed naturally as tail recursion, the performance difference compounds across iterations. A tail-recursive function called 10 million times in a loop saves 40 to 80 million cycles by avoiding the call mechanics, which at 3 GHz is 13 to 27 milliseconds. For functions doing trivial work per iteration, that is the dominant runtime cost.
The diagnostic value of [[clang::musttail]] is worth noting separately from the optimization value. Applying it to a call that cannot be TCO-optimized due to a destructor in scope produces an error at compile time, surfacing a structural problem that would otherwise appear as a stack overflow at runtime under large inputs. Using it as documentation of intent, even when the optimization would apply anyway, makes the requirement explicit in the source code for future readers.
Lemire’s original piece treats inlining as the answer to function call overhead. For loops with small helper functions, it is. For recursive algorithms and dispatch patterns where stack depth is proportional to input, TCO is the more precise tool. The two are not alternatives; they cover different structural cases, and knowing both gives you accurate options when profiling surfaces call overhead in non-obvious places.