· 8 min read ·

The CPU Cost of Virtual Dispatch and What Modern C++ Offers Instead

Source: isocpp

Virtual functions in C++ trade runtime flexibility for measurable overhead at the instruction level. David Álvarez Rosa’s piece on devirtualization and static polymorphism covers the mechanics clearly, but it leaves open a question worth pursuing: what does that overhead look like in assembly, what causes it to compound, and what does modern C++ offer as an alternative for cases where you cannot afford it.

What a virtual call generates at the machine level

Every class with virtual functions gets a vtable: a read-only array of function pointers stored in the .rodata section. Each instance carries a hidden vptr as its first data member, pointing into this table. When you call a virtual function through a base pointer, the compiler emits roughly this sequence on x86-64:

call(Base*, int):
    mov    rax, [rdi]        ; load vptr from object
    call   [rax]             ; indirect call through vtable entry

Three memory-level operations are implied: load the vptr from the object, load the function pointer from the vtable, and execute an indirect branch. In the best case, all of this lives in L1 cache and the CPU’s Indirect Branch Target Buffer has a valid prediction; the total cost is roughly 3 to 5 nanoseconds. In the worst case, none of those hold: the vtable is cold, the branch target has not been seen before, and you are looking at 20 to 35 nanoseconds for a single dispatch.

The branch prediction case is the one that tends to surprise people. A call site that receives the same concrete type on every invocation, a monomorphic call site, will be predicted reliably. A megamorphic call site, one that receives four or more concrete types with no dominant type, essentially defeats the predictor. Agner Fog’s microarchitecture manuals document the Skylake Indirect Branch Target Buffer as having around 1024 entries; if your program’s indirect branch history overflows that capacity, you see systematic mispredictions at roughly 15 to 20 cycles of penalty per miss. For a tight inner loop with a 10-instruction body, one misprediction per iteration drops effective IPC from around 4.0 to something close to 0.5.

The performance delta between a virtual call and a direct call often comes less from the dispatch cost itself than from the surrounding optimizations the indirect call prevents. An indirect call is an opaque barrier to the optimizer. The inliner cannot cross it, which means it cannot constant-fold, cannot eliminate dead stores around the call, and cannot auto-vectorize a loop that contains one. Chandler Carruth demonstrated this concretely at CppCon 2014: a tight polymorphic loop with two concrete types produced a branch misprediction rate above 40%, reducing IPC from roughly 3.8 to 1.1 on Haswell, with the devirtualized equivalent running 3.5x faster on the same data.

When the compiler fixes it for you

GCC and Clang both attempt devirtualization automatically and succeed under specific conditions. Stack-allocated objects of concrete type are always devirtualized: Derived d; d.f(); never generates an indirect call. The final specifier on a class or method is probably the most underused tool available here. Annotating a class final tells the compiler that no further override can exist, enabling devirtualization even through a base pointer:

struct Derived final : Base {
    int f() override { return x * 2; }
    int x;
};

int test(Derived* p) { return p->f(); }   // devirtualized at -O2

You can verify this on Compiler Explorer: a named symbol in the call instruction means the compiler devirtualized it; call [rax] means it did not.

GCC’s -O3 adds -fdevirtualize-speculatively, which wraps uncertain calls in a type guard: if the vptr matches the expected vtable, take the direct path; otherwise fall back to the virtual call. Clang’s -fwhole-program-vtables, which requires -flto, builds precise reachability sets for each call site and devirtualizes whenever only one concrete type can reach it. Link-time optimization extends all of these analyses across translation unit boundaries.

The situations that defeat these mechanisms are worth knowing: pointers crossing shared library boundaries, virtual bases (which make this adjustment runtime-dependent), and call sites where no single type dominates enough for the speculation heuristic to fire. In those cases, you either live with the cost or reach for a manual approach.

Three manual approaches

The oldest technique is CRTP, the Curiously Recurring Template Pattern. The base class is parameterized on the derived type, and virtual calls become direct calls resolvable at template instantiation:

template <typename Derived>
struct Shape {
    double area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};

struct Circle : Shape<Circle> {
    double area_impl() const { return 3.14159 * r * r; }
    double r;
};

There is no vptr in Circle, no vtable, no indirect branch. The compiler sees area_impl() directly and can inline it entirely. The cost is that Shape<Circle> and Shape<Square> are distinct types with no usable common base at the template level, so storing them in the same container requires additional type erasure. Eigen uses this pattern throughout its computation core via expression templates, which is why the compiler can auto-vectorize Eigen’s matrix operations as though they were handwritten loops. CRTP works well for mixins and policy-based design, but the casting syntax becomes unwieldy in complex hierarchies and the pattern fails when runtime heterogeneity is a requirement.

C++23’s deducing this feature (P0847) addresses the syntactic friction with CRTP directly. It replaces the static_cast boilerplate with an explicit object parameter:

struct Shape {
    template <typename Self>
    double area(this Self&& self) {
        return self.area_impl();   // resolved per instantiation, not through vtable
    }
};

struct Circle : Shape {
    double area_impl() const { return 3.14159 * r * r; }
    double r;
};

The compiler generates a separate instantiation per concrete Self type, giving identical performance to CRTP with far cleaner syntax. C++23 is acknowledging, in effect, that CRTP was always a workaround for a missing language feature; the this Self&& form is that feature. GCC 13 and Clang 17 both ship support for it. Beyond CRTP replacement, deducing this also enables recursive lambdas, clean builder-pattern chains, and explicit mixin composition without the template inheritance ceremony.

The third approach is std::variant combined with std::visit, suited to cases where you need a closed, finite set of types stored in the same container but still want compile-time dispatch:

using Shape = std::variant<Circle, Square, Triangle>;

double total_area(const std::vector<Shape>& shapes) {
    double sum = 0;
    for (const auto& s : shapes) {
        sum += std::visit([](const auto& shape) {
            return shape.area();
        }, s);
    }
    return sum;
}

Modern compilers convert std::visit over small variants into a switch with all arms inlined, or a compact jump table. The dispatch overhead drops to roughly 0.5 to 2 nanoseconds, and the inlined arms are visible to the vectorizer. Beyond the dispatch mechanics, there is a layout benefit that tends to matter more in practice: std::vector<Shape> stores shapes contiguously in memory with a discriminant tag, while std::vector<Base*> stores pointers leading to scattered heap allocations. For a loop over thousands of shapes, the cache behavior difference between those layouts can dominate entirely. This is the core insight behind data-oriented design in game engines: a flat array of concrete objects gives the CPU’s prefetcher a trivial job, regardless of what dispatch mechanism you use. Abseil’s team reported 1.8 to 2.4x throughput improvements in proto parsing benchmarks when converting from virtual dispatch to variant-based dispatch, with the gain attributed primarily to inlining and eliminated branch mispredictions.

What Rust gets right about this

Rust makes the static/dynamic distinction explicit at the syntax level, which forces a conscious decision at every polymorphic boundary. impl Drawable in a function signature generates a monomorphized instantiation per concrete type; dyn Drawable uses a fat pointer carrying both the data pointer and a vtable pointer, with the same overhead profile as C++ virtual calls. There is no implicit default:

// Static dispatch: monomorphized per type, zero virtual overhead
fn render(shape: &impl Drawable) { shape.draw(); }

// Dynamic dispatch: fat pointer, vtable call
fn render(shape: &dyn Drawable) { shape.draw(); }

This means Rust developers routinely confront the devirtualization decision at the type system boundary, whereas C++ developers often inherit virtual hierarchies without examining the dispatch cost. The fat pointer representation also has a structural property worth noting: the vtable pointer lives alongside the data pointer in registers, avoiding the load-from-object step that C++ vptr indirection requires. The vtable is part of the reference, not part of the object. For megamorphic dispatch the practical difference is small, but the design is cleaner.

Java’s HotSpot JIT takes a third approach: it performs profile-guided devirtualization at runtime, converting monomorphic call sites to guarded direct calls after observing enough invocations. The JVM can even invalidate these optimizations and recompile when new classes are loaded. This produces near-zero dispatch cost for long-running processes with stable type distributions, at the cost of a warm-up period and occasional deoptimization spikes. C++ PGO-guided devirtualization (-fprofile-use on GCC, -fprofile-instr-use on Clang) offers a static version of the same idea: profile a representative workload, recompile with the profile data, and let the compiler insert type guards at call sites where one type dominates. The difference is that the JIT can adapt at runtime while C++ PGO profiles go stale as workloads change.

Choosing the right tool

Virtual functions remain the right choice when the concrete type is genuinely unknown at compile time, when you are crossing a shared library boundary, or when runtime extensibility is a first-class requirement. The overhead matters in latency-sensitive inner loops where the concrete type is effectively static at every call site, and the dispatch cost or the blocked inlining shows up in a profiler.

The practical sequence is: write idiomatic code with virtual functions first, profile to locate the hot paths, apply final wherever the class hierarchy is complete, and reach for CRTP, deducing this, or std::variant when profiling confirms the cost matters. The AUTOSAR Adaptive Platform coding guidelines go further and prohibit virtual functions in certain safety-critical modules entirely, mandating CRTP and concepts instead, because virtual dispatch prevents worst-case execution time analysis required for ISO 26262 certification. That context is extreme, but it illustrates how thoroughly the overhead can matter when the stakes are high enough.

The C++23 deducing this feature is the most practically important development in this space in years. It removes the primary syntactic objection to CRTP and makes static polymorphism a first-class design option rather than a workaround you tolerate for performance reasons. Combined with std::variant for closed type sets and compiler devirtualization via final for everything else, modern C++ now has a coherent answer to virtual dispatch overhead that does not require sacrificing abstraction.

Was this interesting?