Counting the Cost of Virtual Dispatch, and What C++23 Changes About It

The Three-Instruction Sequence

Virtual dispatch compiles to exactly three memory operations. Given a call obj->foo(), the compiler emits roughly:

mov  rax, [rdi]      ; load vptr from object
call [rax + offset]  ; load function pointer from vtable, then branch

The first instruction loads the vptr embedded at the beginning of every polymorphic object. The second loads the function pointer from the vtable at the appropriate slot, then branches to it. Two memory reads plus an indirect branch, every time, regardless of what foo() does.

On a warm cache, the vtable lookup itself is cheap. Vtables live in .rodata and stay resident once touched. The real cost is the indirect branch. Modern CPUs maintain an Indirect Branch Predictor (IBP) that tracks the history of indirect call targets. For a monomorphic call site, one that always dispatches to the same concrete type, the IBP learns quickly and the misprediction rate drops near zero. Overhead shrinks to perhaps 1-3 cycles above a direct call.

The picture changes for megamorphic sites, where four or more concrete types flow through the same call point in irregular order. The IBP saturates, predicts wrong most of the time, and the CPU pays the full misprediction penalty: 15-20 cycles on Skylake, roughly 15 on Zen 3. At those sites, virtual dispatch in a tight loop costs 5-20x more than a direct or inlined call.

Post-Spectre, the situation worsened further. Retpoline, the standard mitigation for Spectre variant 2, replaces indirect branches with a speculation-safe trampoline that deliberately defeats branch prediction. On retpoline-enabled kernels, every indirect call adds 10-30 cycles of overhead regardless of whether the IBP would have predicted correctly. David Álvarez Rosa’s writeup on isocpp.org covers the baseline overhead clearly; the Spectre dimension is what makes the calculus urgent for latency-critical code written in the last seven years.

What the Compiler Can Do

Compilers are not helpless here. GCC enables -fdevirtualize at -O2, attempting to prove at compile time what the concrete type is and replacing the indirect call with a direct one. The cases where this succeeds reliably are narrower than most developers expect.

Stack-allocated objects with known type devirtualize consistently:

Derived d;
Base& b = d;
b.f();  // GCC -O2: direct call to Derived::f, often inlined entirely

The final keyword is the single most effective hint you can give the compiler. Marking a class final tells the compiler no subclass exists, eliminating all vtable lookups for calls through that type:

class ConcreteShape final : public Shape {
    void draw() override { ... }
};

ConcreteShape* p = get_shape();
p->draw();  // With final: direct call or inlined. Without: indirect.

You can verify this on Compiler Explorer. A final class with one virtual method and a trivial body will, at -O2, compile the call to a single jmp or inline the body entirely. Without final, the indirect call remains unless LTO can see the whole program.

GCC’s -fdevirtualize-speculatively, enabled at -O3, goes further by inserting a type check with a direct call in the fast path and a fallback indirect call:

// Conceptually what the compiler generates:
if (vptr == &ConcreteShape::vtable) {
    ConcreteShape::draw(p);  // fast path: direct, inlinable
} else {
    p->draw();               // fallback: indirect
}

This is profitable at monomorphic sites where one type dominates. With LTO, cross-translation-unit devirtualization becomes possible. GCC’s -fdevirtualize-at-ltrans and Clang’s -fwhole-program-vtables can devirtualize single-implementation classes even without final.

The failure modes are instructive. Functions that take a Base* parameter cannot be devirtualized without whole-program analysis. Calls across shared library boundaries fail entirely, since the linker’s interposition mechanism means vtables can be overridden at load time. -fPIC without LTO leaves the compiler no choice but to trust the vtable unconditionally.

CRTP: The Traditional Solution

When the compiler cannot devirtualize and profiling confirms virtual overhead is the bottleneck, the established solution is static polymorphism via the Curiously Recurring Template Pattern. The base class takes the derived class as a template parameter, allowing it to call derived methods through static_cast without any vtable:

template <typename Derived>
class Shape {
public:
    double area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};

class Circle : public Shape<Circle> {
public:
    double area_impl() const { return 3.14159265 * r_ * r_; }
private:
    double r_;
};

The static_cast is safe because Shape<Circle> is always a base of Circle by the time the template is instantiated. The call resolves at compile time; with inlining enabled, area_impl() folds into area() and the dispatch disappears entirely. No vptr, no vtable load, no indirect branch.

Eli Bendersky’s 2013 benchmark measured CRTP roughly 3.9x faster than virtual dispatch in a tight loop on a Core i7, calling a trivial area method 500 million times. That gap reflects a best-case scenario for CRTP and a worst-case scenario for virtual; real workloads narrow it considerably once function bodies do meaningful work.

CRTP has been a load-bearing pattern in the C++ ecosystem for decades. std::enable_shared_from_this<T> is CRTP. boost::iterator_facade is CRTP. The policy-based design patterns in Alexandrescu’s Modern C++ Design are CRTP. It works, but it carries visible costs: the static_cast<Derived*>(this) boilerplate is error-prone, the base class must be a template, and heterogeneous containers require separate type erasure on top.

C++23 Makes the Pattern Cleaner

C++23’s “deducing this” (P0847) removes the CRTP boilerplate by introducing explicit object parameters. The base class no longer needs to be a template; the derived type is deduced from the call site:

class Shape {
public:
    template <typename Self>
    double area(this Self&& self) {
        return std::forward<Self>(self).area_impl();
    }
};

class Circle : public Shape {
public:
    double area_impl() const { return 3.14159265 * r_ * r_; }
};

Shape::area is instantiated once per derived type. The generated code is identical to CRTP: a direct call or a fully inlined body, zero dispatch overhead. The difference is that Shape is now a plain class, Circle inherits normally without the recursive template parameter, and no static_cast appears anywhere.

Deducing this handles const, volatile, and rvalue qualifications via forwarding, which CRTP requires separate overloads to express. It also cleans up the builder pattern, where CRTP was previously needed to return the correct derived type from chained method calls:

class Builder {
public:
    template <typename Self>
    Self& set_name(this Self& self, std::string name) {
        self.name_ = std::move(name);
        return self;  // Returns derived type automatically
    }
};

GCC 14 and Clang 18 both support deducing this fully. MSVC has supported it since 19.33. For new code targeting C++23, the static_cast<Derived*>(this) idiom has no remaining advantage over the explicit object parameter form.

Closed-World Polymorphism: `std::variant`

When the set of types is finite and known at compile time, std::variant with std::visit provides a third path. The variant stores one of several types inline with value semantics; std::visit dispatches via a jump table indexed by the stored type index:

using Shape = std::variant<Circle, Square, Triangle>;

double total_area(const std::vector<Shape>& shapes) {
    double total = 0;
    for (const auto& s : shapes) {
        total += std::visit([](const auto& shape) {
            return shape.area();
        }, s);
    }
    return total;
}

The jump table resembles a vtable, but the compiler sees all branches and can optimize each independently. More importantly, the objects are stored inline rather than on the heap, preserving cache locality. A vector<Shape> lays out objects contiguously; a vector<Shape*> does not, and the pointer chasing that follows is often a larger contributor to slowdown than virtual dispatch itself.

The limitation is real: variants are closed. Adding a new type means modifying every visit call. For extensible hierarchies, virtual functions or CRTP remain the appropriate tools.

When the Overhead Actually Matters

Virtual dispatch overhead only exceeds the noise floor when the function body is cheap relative to dispatch cost. A useful threshold: if the function body executes in under 20 nanoseconds, the 3-25 cycle dispatch overhead represents 15-100% of total call cost and is worth addressing. If the body takes 100 nanoseconds or more, dispatch overhead is under 5% and the design clarity of virtual functions outweighs any optimization.

Profile before restructuring. The final keyword costs nothing to add and should be applied to any class not intended to be subclassed, both as a devirtualization hint and as documentation of intent. Mark leaf classes final by default and let the compiler do the rest. For call sites where profiling confirms virtual dispatch is the measured bottleneck, CRTP or C++23 deducing this are direct replacements. For closed-type sets with cache sensitivity, std::variant with its inline storage often proves faster than either virtual dispatch or pointer-chasing CRTP wrappers.

The underlying principle from the source article holds: abstraction is not free by default, but it can be made free by design. The tools to achieve zero-cost polymorphism have existed in C++ for years; C++23 simply removes the last excuse for writing ugly boilerplate to get there.