· 8 min read ·

Virtual Dispatch Costs More Than the Pointer Load

Source: isocpp

David Álvarez Rosa’s recent piece on isocpp.org covers the mechanics of devirtualization and static polymorphism in C++. It’s a solid introduction to the topic. What it leaves room for is a deeper look at where the overhead actually lives, what modern compilers can and cannot do about it automatically, and how to pick among the available tools once you decide the compiler needs help.

The Overhead Is Not the Call

Most explanations of virtual dispatch focus on the two-load indirect branch: load the vptr from the object, load the function pointer from the vtable slot, then branch to an unknown address. On a modern x86-64 at 3 GHz with a warm cache, those loads cost maybe 3 to 5 cycles. That is not nothing, but it is also not the reason virtual dispatch causes 10x slowdowns in tight loops.

The real cost is that the compiler cannot see past the indirect call. With a direct call, the optimizer can inline the callee, hoist invariants out of the surrounding loop, vectorize with SSE or AVX, propagate constants across the call boundary, and eliminate dead code. With a virtual call, it can do none of that. The call site is an opaque wall: the compiler must assume any global state could change, that the return value has unknown properties, and that no value computed below the call can be moved above it.

Chandler Carruth demonstrated this effect at CppCon 2014 with a benchmark over heterogeneous polymorphic objects stored as pointers. A flat array of values with a switch dispatch ran 10 to 50 times faster than the equivalent virtual dispatch loop. Most of that gap came from vectorization and cache locality, not from the pointer indirection itself. The indirect call was the wall that kept the auto-vectorizer out.

There is also the object layout cost: each polymorphic object carries a hidden vptr, typically 8 bytes on 64-bit platforms. For a class with a single float field, the actual object size with alignment can be 16 bytes instead of 4. Multiply that across a million particle structs in a game engine and you have forced your hot data to spill out of cache for no benefit on the hot path.

What the Compiler Can Already Recover

Compilers are not passive about this. GCC and Clang both implement multiple devirtualization strategies, active by default at -O2 and above.

The simplest case is local type inference. If an object is stack-allocated and its address has not escaped to an external function or been stored in a global, the compiler knows the exact dynamic type at every call site:

Derived d;
d.foo(); // trivially devirtualized: dynamic type is Derived

Base* p = new Derived();
p->foo(); // devirtualizable if p hasn't escaped

The final keyword is the most reliable tool for programmer-assisted devirtualization. Marking a class or a virtual override final tells the compiler that no further overrides exist, removing all ambiguity:

class Derived final : public Base {
    void foo() override; // all calls via Derived* are direct calls
};

This costs nothing at runtime and nothing in binary size. If a class legitimately has no subclasses in your codebase, final is almost always the right annotation.

GCC also implements speculative devirtualization (-fdevirtualize-speculatively, on by default at -O2). When the compiler suspects but cannot prove that a call site targets a single concrete type, it emits a type guard:

// conceptual expansion of a speculative devirtualization:
if (vptr == &Derived::vtable_address) {
    Derived::foo(this); // direct call, inlinable
} else {
    (*vptr_slot)(this); // slow path, full virtual dispatch
}

For monomorphic and bimorphic call sites, where one type dominates, the branch is nearly always predicted correctly and the fast path runs as a direct inlinable call.

With link-time optimization (-flto), both GCC and Clang gain access to the whole-program class hierarchy. If a virtual function has only one override anywhere in the binary, it can be devirtualized unconditionally. Clang’s -fwhole-program-vtables flag extends this to virtual constant propagation: values read through vtable pointers can be replaced with compile-time constants when the entire program is visible.

For a significant class of production builds, these mechanisms are sufficient. Mark your leaves final, enable LTO in release builds, and the compiler handles the rest.

When the Compiler Cannot Help

Several situations genuinely defeat automatic devirtualization:

Shared libraries and plugins. When types can be added at runtime by loading a .so or .dll, the compiler cannot see the full hierarchy. No LTO, no WPD, no speculative devirtualization that is guaranteed correct.

Cross-translation-unit calls without LTO. In a typical debug build, or any build without LTO, the compiler sees only one .cpp at a time. A virtual call through a base pointer passed as a function argument cannot be devirtualized because the concrete type is unknown.

Megamorphic call sites. A call site that dispatches to 8 or more distinct concrete types defeats the CPU’s indirect branch predictor. Modern x86-64 processors have a limited indirect branch prediction budget; beyond a handful of targets, mispredictions dominate. Spectre mitigations (retpolines) can inflate each mispredicted indirect call to 30 or 40 cycles.

These are the cases where you reach for static polymorphism manually.

CRTP: Compile-Time Dispatch Without Overhead

The Curiously Recurring Template Pattern (CRTP) is the classical C++ solution. A base class template is parameterized by its own derived type:

template <typename Derived>
class Shape {
public:
    double area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};

struct Circle : Shape<Circle> {
    double r;
    double area_impl() const { return 3.14159265 * r * r; }
};

The static_cast is a compile-time operation, no-op at runtime. The call to area_impl is fully visible to the optimizer, so area() inlines down to a multiply and an addition. No vptr, no indirect call, no barrier to vectorization.

This is not academic. Eigen, the most widely deployed C++ linear algebra library, builds its entire expression template system on CRTP. MatrixBase<Derived> is the root, and operations like A * B + C construct lazy expression trees that evaluate in a single SIMD-vectorized pass with no temporaries. That design would be incoherent with virtual dispatch. LLVM uses CRTP for RecursiveASTVisitor and InstVisitor, where traversal over millions of AST or IR nodes needs to be fast.

The limitations are real. Each Shape<T> instantiation is a separate class, so the compiler generates separate code for every concrete type. In a codebase with 30 shape types and a complex base class, this inflates binary size and instruction-cache pressure. You also lose runtime heterogeneity: there is no vector<Shape*> in CRTP. If you need to store mixed types in a single collection, you need a wrapper.

C++23’s deducing this offers a cleaner syntax for some CRTP patterns without the static_cast:

struct Shape {
    double area(this auto&& self) const {
        return self.area_impl();
    }
};

This removes the need to explicitly parameterize the base class, though it requires that area_impl be defined on the concrete type.

std::variant: Closed-Set Polymorphism With Cache Locality

When the set of types is fixed at compile time, std::variant with std::visit is often the better tool:

using Shape = std::variant<Circle, Rectangle, Triangle>;

double area(const Shape& s) {
    return std::visit([](auto&& shape) { return shape.area_impl(); }, s);
}

The performance difference from virtual dispatch comes primarily from storage layout, not from dispatch speed. A vector<unique_ptr<Base>> stores pointers; following each pointer is a potential cache miss. A vector<Shape> stores values inline, adjacently in memory. For a cold cache with a million-element array, the variant version can be 3 to 10 times faster than the virtual version, with most of the difference coming from cache misses on pointer dereferences, not from the dispatch mechanism itself.

For small variant sets (two to four types), the compiler typically emits a branch chain rather than a jump table, which is friendlier to the branch predictor than an indirect call. Beyond about eight types, the generated dispatch starts to resemble a vtable again.

The trade-off is extensibility. Adding a new type to a variant requires modifying every visit site. This is the closed-world assumption, which makes variants appropriate for protocol message types, AST nodes in a compiler, and ECS component types, but not for plugin-style architectures.

C++20 Concepts: Structural Typing Without the Base Class

Concepts add a third option: constraints without coupling. Any type satisfying the structural requirements can satisfy the constraint, with no inheritance:

template <typename T>
concept HasArea = requires(const T& t) {
    { t.area() } -> std::convertible_to<double>;
};

template <HasArea S>
double total_area(const S* shapes, std::size_t n) {
    double sum = 0;
    for (std::size_t i = 0; i < n; ++i) sum += shapes[i].area();
    return sum;
}

This is zero-cost at runtime, generates optimizable code for each instantiation, and requires no shared base class. The <ranges> library in C++20 is built entirely on this model. The limitation is that concepts provide constraints only, not shared implementation. For interfaces that need default behavior or shared state, CRTP or a non-virtual base class is still necessary.

Choosing the Right Tool

The decision tree is simpler than the options list suggests.

If your classes genuinely need runtime extensibility (plugins, separate compilation, user-supplied types), virtual dispatch is correct. Mark leaves final, enable LTO in release builds, and let the compiler do what it can. The overhead that remains is the cost of the flexibility you need.

If the type set is closed and cache locality matters, std::variant is likely the right choice. It composes naturally with standard containers, avoids heap allocation, and the exhaustiveness checking at std::visit sites catches missing cases at compile time.

If you are writing a library or a performance-critical kernel where template instantiation cost is acceptable and the type set is determined by the caller, CRTP gives you zero-cost abstraction with full optimizer visibility. Eigen’s 20 years of production use in HPC and robotics is a reasonable proof point.

For pure interface constraints with no shared implementation, C++20 concepts are the cleanest option and avoid the syntactic weight of CRTP entirely.

The common thread is that the compiler is not an adversary here. Speculative devirtualization, final-assisted devirtualization, and LTO-based whole-program analysis recover a substantial fraction of virtual call overhead automatically. The cases where static polymorphism pays off most clearly are the ones where the compiler’s hands are genuinely tied: megamorphic hot loops, plugin boundaries, and tight numerical kernels where vectorization is the difference between acceptable and excellent throughput.

Was this interesting?