Virtual Dispatch After Spectre: How Security Mitigations Reshaped the Indirect Call Cost Profile
Source: isocpp
Daniel Lemire’s analysis of function call costs focuses on direct calls: the caller knows the address of the callee at compile time, the CPU predicts the return with its Return Stack Buffer, and the overhead is modest enough that it rarely matters except when it blocks SIMD vectorization. That analysis is sound as far as it goes. But there is a second category of function call with a substantially different cost profile, and it changed in ways most performance guides have not caught up with: the indirect call.
Indirect calls cover virtual dispatch in C++, function pointer invocation, and all the type-erased callable wrappers built on those mechanisms. Their cost profile was already higher than direct calls before 2018. After January 2018, it became much higher on most production hardware, and the explanation is Spectre.
The Indirect Branch Problem
A direct call instruction (CALL 0x401234) embeds the target address in the instruction stream. The CPU’s branch predictor handles this with a Branch Target Buffer that logs predicted jump destinations; a direct call to the same address is almost always predicted correctly, for roughly one cycle. The Return Stack Buffer handles the RET, also near-free when the call depth is shallow.
An indirect call (CALL [rbx]) loads the target from a register or memory location determined at runtime. The CPU’s Indirect Branch Target Buffer tries to predict which address will appear in rbx based on historical data at that instruction address. For monomorphic call sites where the same concrete function is always called, prediction succeeds and the cost resembles a direct call. For polymorphic sites where multiple different callees appear, the predictor mispredicts and the CPU pays the full branch misprediction penalty: 15 to 20 cycles on Intel Skylake and AMD Zen 2 according to Agner Fog’s microarchitecture tables.
Virtual dispatch generates indirect calls. A virtual method invocation loads the vptr from the object, loads the method pointer from the vtable at the appropriate slot, and branches to that address. With a warm cache and a monomorphic call site, this adds roughly 3 to 5 cycles over a direct call. With a cold cache or a megamorphic call site, the misprediction penalty applies. That was the complete cost model through 2017.
What Spectre Variant 2 Found
Spectre and Meltdown, disclosed in January 2018, identified a class of vulnerabilities where an attacker could exploit speculative execution to leak data across privilege boundaries. Spectre variant 2 targets the indirect branch predictor specifically. By poisoning the Indirect Branch Target Buffer from an attacker-controlled process, an adversary could cause a victim process’s indirect branches to speculatively jump to an attacker-chosen gadget address, leaking data through cache side channels before the CPU’s pipeline detected the misprediction.
The fix required preventing the CPU from making useful speculative progress through indirect branches. Several mitigations emerged, each with a different cost profile.
Retpoline
Google’s retpoline (return trampoline) is a software mitigation. It replaces every indirect branch with a code pattern that redirects speculative execution through a harmless infinite loop, starving the speculative pipeline of useful work while the real branch target resolves:
; Original: CALL [rbx]
; Retpoline replacement:
call set_up_target
infinite_loop:
pause
lfence
jmp infinite_loop
set_up_target:
mov [rsp], rbx ; overwrite return address with real target
ret ; speculative: returns to infinite_loop; real: jumps to rbx
The CPU speculatively executes the infinite_loop path, where pause and lfence ensure no useful speculative state accumulates, while actual execution resolves the return to the real target. The retpoline sequence costs approximately 30 to 80 cycles per indirect call on Intel Skylake-era hardware. On AMD processors, AMD’s own microcode IBRS mitigation is available and less expensive, but software retpoline remains standard in portable builds.
GCC enables retpoline on x86-64 with -mindirect-branch=thunk and -mfunction-return=thunk. Clang uses -mretpoline. The Linux kernel compiles with retpoline when CONFIG_RETPOLINE=y, which is the default on distributions shipping a hardened kernel. Any user-space process on a retpoline kernel pays the mitigation cost for indirect calls in kernel code regardless of whether the user-space binary itself was compiled with the flag.
eIBRS and the Hardware Fix
Intel’s Enhanced IBRS (eIBRS), available on processors starting with Cascade Lake (late 2019) and all Sunny Cove and later microarchitectures, implements the restriction in hardware. The CPU maintains separate Indirect Branch Target Buffer state for different privilege levels, preventing cross-privilege poisoning without software trampolines. Performance measurements from the Linux kernel team put the eIBRS overhead at roughly 4 to 6 cycles per indirect branch, compared to retpoline’s 30 to 80. AMD’s Zen 2 and later processors implement a similarly efficient microcode-level mitigation; Zen 3 and later are considered unaffected by variant 2 in the classic sense.
A system running on a recent Intel Xeon or Core processor with a current kernel reports its mitigation status in /sys/devices/system/cpu/vulnerabilities/spectre_v2. The string Enhanced IBRS there indicates the cheaper hardware path is in use; Retpolines indicates the software mitigation.
The practical consequence is that Spectre overhead for indirect calls is hardware-generation-dependent. A 2019-era Intel server with a hardened kernel pays the full retpoline cost on every virtual call. A 2022-era server pays the eIBRS overhead, which is measurable but not catastrophic. A developer benchmarking on a modern laptop may see indirect call overhead 5 to 10 times lower than what production systems running older cloud instances experience. Pre-2020 benchmarks that report virtual dispatch costs without specifying the hardware and mitigation configuration are now incomplete data.
What This Means for Virtual Dispatch in Practice
Consider a loop calling a virtual method:
struct Processor {
virtual float transform(float x) = 0;
};
struct FastProcessor : Processor {
float transform(float x) override { return x * 2.5f; }
};
void apply(Processor* p, float* arr, int n) {
for (int i = 0; i < n; i++) {
arr[i] = p->transform(arr[i]); // indirect call per iteration
}
}
Before Spectre mitigations on a warm monomorphic call site, this loop costs roughly 5 to 10 cycles per iteration. With retpoline on a hardened pre-eIBRS server, that climbs to 35 to 90 cycles per iteration, almost entirely from the mitigation overhead. Adding final to FastProcessor changes the picture completely:
struct FastProcessor final : Processor {
float transform(float x) override { return x * 2.5f; }
};
void apply(FastProcessor* p, float* arr, int n) {
for (int i = 0; i < n; i++) {
arr[i] = p->transform(arr[i]); // devirtualized: FastProcessor is final
}
}
With the concrete type known and final confirming no further subclassing is possible, the compiler resolves the virtual call to a direct call and inlines the body. The indirect branch disappears entirely, along with any retpoline cost. With AVX2, the loop then vectorizes to VMULPS processing eight floats per instruction. The path from virtual dispatch with retpoline at 80 cycles per element to vectorized direct computation at under one cycle per element is entirely a consequence of whether the compiler can see through the call boundary.
Link-time optimization extends devirtualization across translation unit boundaries. If LTO’s whole-program analysis determines that only one concrete implementation of a virtual method is reachable in the binary, it will speculatively devirtualize the call with a runtime type guard. The retpoline thunk remains in the binary but does not execute on the hot path, and the speculative devirtualization path proceeds through an inlined direct call.
Detecting Whether Your Binary Is Affected
If you have a compiled binary and want to know whether it contains retpoline trampolines, objdump -d shows the pattern. GCC names them __x86_indirect_thunk_*; Clang names them __llvm_retpoline_*. The disassembly pattern, a call to a local label followed by a pause/lfence/jmp loop, is identifiable regardless of compiler naming.
For profiling, perf stat with -e branch-misses,indirect-branches shows whether a hot loop is paying indirect branch costs. A loop with millions of indirect branch events and a high misprediction rate is a candidate for devirtualization via type information or final. On a retpoline-patched system, perf records the mitigation overhead as a high cycle-per-call ratio even for correctly predicted branches, distinct from the misprediction spike you see on systems without retpoline.
Practical Adjustments
The function call cost conversation changed in 2018. The specific adjustments for code that uses virtual dispatch or function pointers in hot paths:
Use final on classes where the concrete type is fixed and inheritance is complete. The compiler can then devirtualize calls through pointers to that type and proceed to inline and vectorize. For abstract interfaces that must remain polymorphic at the dispatch point, separate the hot numerical computation from the dispatch by accepting callable types as template parameters rather than virtual base class pointers. Enable LTO for the devirtualization pass to cross translation unit boundaries. Profile on the hardware that actually runs your production workload, because the overhead on a developer machine with eIBRS can be an order of magnitude lower than on a retpoline-patched older server in a cloud data center.
Lemire’s framing, that the direct call’s raw cycle cost is rarely the interesting number, extends to the indirect case. Before Spectre, the interesting number for indirect calls was branch misprediction at 15 to 20 cycles, and monomorphic call sites largely escaped it. After Spectre, on hardened systems without eIBRS, it is retpoline at 30 to 80 cycles for every indirect call regardless of prediction accuracy. The ceiling moved, and it moved in a direction that makes templates, final, and LTO devirtualization substantially more valuable as design choices.