Daniel Lemire’s analysis of function call overhead focuses on direct function calls: the stack setup, register marshaling, and the call/ret instruction pair. For those, modern compilers have an effective answer. Inlining eliminates the overhead and unlocks deeper optimizations. But there is a category of function call that compilers cannot simply inline away, and that category got significantly more expensive in early 2018.
Virtual function calls, function pointers, and any dispatch through an indirect branch are now subject to mitigation overhead that has nothing to do with the call instruction itself. The cause is Spectre variant 2. Understanding what changed explains a performance dynamic that no amount of inlining advice covers by itself.
Why Indirect Branches Were the Target
Spectre v2 exploits the Branch Target Buffer, the hardware structure modern CPUs use to predict indirect branch destinations. When a CPU executes call *rax, an indirect call whose target is determined at runtime, the BTB maintains per-address histories. If a particular call site almost always jumps to the same target, the BTB learns that pattern and begins fetching and executing instructions from the predicted target before the actual value of rax is even computed.
The vulnerability allows code in another process, or in some configurations a sibling hyperthread on the same physical core, to poison the BTB to point at an arbitrary address. Speculative execution then fetches and runs instructions from that address before the misprediction is caught, leaking data through timing side channels. The critical constraint is that only indirect branches are exploitable this way: direct calls, where the branch target is encoded in the instruction and fixed at compile time, are not affected.
In C++, virtual function dispatch is exactly the indirect branch pattern. A call through a base class pointer resolves the function address at runtime by loading it from the object’s vtable. Function pointers, std::function, and calls to functions in dynamically linked libraries all follow the same model.
What Retpoline Does to a Call
The software mitigation developed by Google engineers is called retpoline, short for return trampoline. Rather than executing an indirect branch through the BTB, every indirect branch is converted into a form that causes the CPU to speculate using the Return Stack Buffer instead.
; Original virtual dispatch:
mov rax, [rdi] ; load vtable pointer from object
mov rax, [rax + 16] ; load function pointer from vtable
call *rax ; indirect call -- BTB exploitable
; With retpoline:
mov rax, [rdi]
mov rax, [rax + 16]
call capture_spec ; direct call to the retpoline thunk
.spec_trap:
pause
lfence
jmp .spec_trap ; traps speculative execution here
capture_spec:
mov [rsp], rax ; overwrite return address with actual target
ret ; normal: jumps to actual target via stack
; speculative: RSB says return to .spec_trap
The mechanism depends on how the CPU’s Return Stack Buffer works. After call capture_spec, the RSB records that a ret should return to .spec_trap (the instruction after the call). Inside the thunk, the actual stack return address is overwritten with the real call target. When ret executes, normal execution jumps to the real target because that is what the stack contains. Speculative execution, however, goes to .spec_trap because the RSB points there, where the pause/lfence/jmp loop absorbs all speculative work without allowing access to any useful data.
The performance cost is direct: the CPU cannot make forward progress speculatively after the retpoline ret, because the speculation trap prevents any useful instructions from executing. The CPU must wait for the actual branch target to be resolved before the pipeline can continue. On modern out-of-order cores, this adds roughly 10 to 25 cycles per indirect call, compared to the 1 to 3 cycles a correctly-predicted direct call costs.
GCC enables retpoline via -mindirect-branch=thunk or -mindirect-branch=thunk-inline. The Linux kernel has shipped retpoline since early 2018, and security-hardened distribution builds include these flags for system-compiled binaries.
The Gap Between Direct and Indirect Call Overhead
Lemire’s examples concern direct calls, where the 3 to 10 nanosecond overhead of an uneliminated call/ret is the subject. That overhead is real and worth reducing. The retpoline overhead on an indirect call is 3 to 10 times larger, in a different category entirely.
For a loop iterating over a polymorphic collection and calling a virtual method on each element, each iteration now pays:
- Two memory loads for vtable lookup (same as before Spectre)
- The retpoline stall: roughly 15 to 20 additional cycles waiting for branch target resolution
At 3 GHz, 20 cycles is approximately 7 nanoseconds per iteration. For a loop over one million elements, that is 7 milliseconds of overhead attributable to the indirect branch alone. If the method body does a few arithmetic operations taking 2 to 4 nanoseconds each, the retpoline overhead dominates the runtime.
The figures from Agner Fog’s optimization manuals document direct call latency as around 1 to 4 cycles on modern Intel microarchitectures. The same sources show indirect call latency at 1 to 3 cycles pre-Spectre. Post-retpoline, indirect call latency is not a hardware constant you can look up in a table: it is whatever the pipeline penalty turns out to be for blocking speculative execution, which varies but is consistently in the 10 to 25 cycle range.
Devirtualization as the Mitigation
When the compiler can prove the concrete type at a call site and emit a direct call or inline the method, no indirect branch is generated and retpoline does not apply. Several mechanisms make this possible.
The final specifier tells the compiler that no class will derive from a type or override a method. At a call site where the concrete type is already known, final gives the compiler the static guarantee it needs to devirtualize:
class Gain final : public AudioEffect {
public:
float process(float x) const override { return x * factor; }
float factor;
};
void apply(Gain* g, float* buf, int n) {
for (int i = 0; i < n; i++)
buf[i] = g->process(buf[i]); // devirtualized: direct call or inlined
}
Without final, the compiler must treat g->process() as a virtual dispatch because a derived class might override it. With final, the compiler knows Gain::process is the concrete implementation and can inline the body into the loop, at which point the vectorizer also gains visibility and can emit SIMD instructions.
Escape analysis provides devirtualization without annotations. When an object is created locally and its address never escapes to code the compiler cannot analyze, the concrete type is fixed for the function’s scope. The compiler devirtualizes freely in this case without any final annotation.
Profile-guided optimization extends this further with speculative devirtualization. After a PGO profile collection run, Clang and GCC can identify virtual call sites where one concrete type handles the vast majority of dispatches. The generated code emits an inline type check, a devirtualized direct call on the predicted-hot path, and a fallback to normal virtual dispatch for other types. For call sites where one type accounts for 90 percent or more of invocations, this eliminates the retpoline overhead on nearly every iteration.
Template polymorphism eliminates the vtable entirely. CRTP and other static polymorphism patterns make the call target known at compile time, producing a direct call or an inlined body, with no indirect branch anywhere in the generated code.
Hardware Alternatives to Retpoline
Intel’s Control-flow Enforcement Technology provides a hardware alternative that avoids the speculative execution penalty. The branch-tracking component of CET requires that valid indirect call targets contain an ENDBR64 instruction. Hardware enforces this by trapping any indirect branch that lands anywhere else. This defeats the BTB poisoning attack without blocking speculative execution.
The performance model differs substantially from retpoline. ENDBR64 is decoded as a NOP and imposes no latency penalty on the executing instruction stream. Indirect calls proceed at near-normal speed. The cost is code size: every valid indirect call target grows by four bytes to accommodate the marker.
CET-IBT is available on Intel 11th-generation CPUs and later (Tiger Lake onwards) with appropriate kernel and toolchain support. GCC and Clang both emit CET instrumentation with -fcf-protection=branch. On hardware that supports it, CET-IBT is generally preferred over retpoline for workloads with frequent indirect branches: the per-call overhead is lower and more predictable. AMD has its own implementation of shadow stack and branch tracking support in more recent microarchitectures as well.
For systems where neither retpoline nor CET-IBT is available, Intel’s eIBRS (enhanced Indirect Branch Restricted Speculation) provides a hardware barrier at a different overhead point, typically with less impact on user-space code than software retpoline.
What This Changes About the Advice
The standard guidance for performance-sensitive C++ has long favored templates over virtual dispatch in hot loops, based on the compiler’s ability to see through the call boundary and vectorize. That advice gains new weight in a retpoline environment. The question is no longer only whether the compiler can vectorize; it is whether the code is generating indirect branches at all.
For a tight loop over a polymorphic collection, the design choices that matter are:
- Whether the concrete type can be proven statically (
final, local allocation, template instantiation) - Whether the polymorphism needs to live inside the loop, or whether dispatching once on type outside the loop and running a monomorphic loop inside achieves the same goal
- Whether PGO is enabled in the build, which enables speculative devirtualization for call sites that the compiler cannot prove statically
Lemire’s article correctly identifies that the call/ret overhead for a direct function call is worth understanding and eliminating. The retpoline overhead on indirect branches is the same problem operating at a higher multiplier, with no compiler inlining heuristic to save you. The only reliable solution is to not emit indirect branches in hot paths, which is exactly the direction final, templates, CRTP, and PGO-guided devirtualization all point.