· 6 min read ·

The Optimization Cost of Virtual Dispatch, and How to Recover It

Source: isocpp

Virtual dispatch has been part of C++ since Bjarne Stroustrup formalized it in the early 1980s, and the core mechanism has not changed meaningfully since. A polymorphic class gets a vtable, each object gets a hidden pointer to that table, and calls through base-class pointers resolve at runtime by loading a function address from the table and jumping to it.

David Álvarez Rosa’s recent article on isocpp.org catalogs the costs: pointer indirection, per-object layout overhead, and reduced inlining opportunities. The third item deserves more attention than a quick list entry gives it, because inlining is not just about removing call overhead. It is the prerequisite for nearly everything else the optimizer wants to do.

What the Vtable Call Does to the Optimizer

When the compiler sees a virtual call, it emits an indirect call through a pointer it cannot analyze statically. That is not just overhead in itself; it is an optimization barrier. The compiler cannot see into the callee to eliminate redundant work on either side of the call, cannot hoist loop-invariant subexpressions out of a loop that contains virtual calls, and cannot auto-vectorize loops whose bodies invoke indirect calls.

The machine-level sequence is roughly: load the vtable pointer from offset zero of the object, load the method pointer from the vtable at the method’s fixed index, call through that pointer. On a warm cache, this adds a handful of nanoseconds. The more significant cost appears when the objects being processed are cold in cache, which turns those loads into stacked cache misses. A direct call has none of that data dependency.

But even with warm cache, the indirect call prevents the compiler from treating the callee’s body as part of the caller. A tight inner loop that calls a virtual method on every iteration cannot be vectorized. Move the same logic to a non-virtual call that the compiler can inline, and the loop becomes a candidate for SIMD processing. That is where the measurable performance difference lives in high-throughput paths, not in the nanoseconds spent dereferencing the vtable.

What the Compiler Already Does

Compilers do not accept virtual calls passively. GCC and Clang both implement devirtualization under a range of conditions, and understanding when it fires is useful before reaching for manual alternatives.

The most reliable hint you can give the compiler is the final specifier, introduced in C++11. Marking a class final tells the compiler that no further subclasses exist, so any call through a reference or pointer to that type can be resolved statically:

class Serializer {
public:
    virtual void write(std::span<const std::byte> data) = 0;
    virtual ~Serializer() = default;
};

class BinarySerializer final : public Serializer {
public:
    void write(std::span<const std::byte> data) override {
        // write raw bytes
    }
};

void flush(BinarySerializer& s, std::span<const std::byte> data) {
    s.write(data); // devirtualized: compiler sees final, resolves directly
}

Without final, a BinarySerializer& could theoretically refer to a further-derived type that overrides write. With final, that possibility is closed, and the call resolves at compile time. The vtable still exists in the binary, but this specific call site does not go through it.

GCC and Clang also perform speculative devirtualization in some cases: when the compiler believes a particular concrete type is likely but cannot prove it, it may emit a type guard followed by two call paths, one direct (which it can then inline) and one indirect as a fallback. This is opportunistic and not guaranteed.

Link-time optimization extends devirtualization further by giving the compiler visibility across translation units. With LTO enabled, calls that cross file boundaries become devirtualizable when the full class hierarchy is visible at link time. Profile-guided optimization adds another layer: profiling data can indicate which concrete types appear most frequently at a given call site, biasing speculative devirtualization toward those types.

These mechanisms are useful but best-effort. When the concrete type genuinely varies across calls, when the hierarchy spans shared library boundaries, or when the compiler’s heuristics do not fire, the virtual call stays virtual.

CRTP: Static Dispatch by Design

When you control both base and derived types and the concrete type is always known at compile time, you can eliminate dynamic dispatch at the source level using the Curiously Recurring Template Pattern. The derived class passes itself as a template argument to the base, giving the base a way to call derived-class methods at compile time:

template <typename Derived>
class SerializerBase {
public:
    void write(std::span<const std::byte> data) {
        static_cast<Derived*>(this)->write_impl(data);
    }
};

class BinarySerializer : public SerializerBase<BinarySerializer> {
public:
    void write_impl(std::span<const std::byte> data) {
        // write raw bytes
    }
};

The call to write_impl is a direct call to a statically known type. The compiler can inline it, see through it, and apply optimizations that a virtual call would have blocked. There is no vtable lookup and no pointer dereference at the call site.

The cost is ergonomic. The static_cast inside the base is the kind of code that generates questions during review. Template error messages involving CRTP hierarchies are famously dense. Because the base class is itself a template, SerializerBase<BinarySerializer> and SerializerBase<JsonSerializer> are distinct types with no common non-virtual base, which makes storing heterogeneous collections awkward. The typical solutions are std::variant for closed sets of types or a thin virtual interface layer if you genuinely need runtime polymorphism at the collection level.

C++23 Explicit Object Parameters

C++23 introduced explicit object parameters, often called “deducing this,” which substantially reduces the boilerplate of CRTP-style code. The syntax places the object as an explicit function parameter, and the compiler deduces its concrete type at each call site:

class SerializerBase {
public:
    void write(this auto&& self, std::span<const std::byte> data) {
        self.write_impl(data);
    }
};

class BinarySerializer : public SerializerBase {
public:
    void write_impl(std::span<const std::byte> data) {
        // write raw bytes
    }
};

This is covered in the C++ standard under explicit object member functions. The dispatch to write_impl still resolves at compile time through the deduced type of self. The base class no longer needs to be templated on the derived type, which removes the recursive template parameter and makes the inheritance hierarchy much easier to read.

Explicit object parameters also handle patterns that CRTP manages awkwardly. They unify const and non-const overloads into a single function, removing the need for two nearly identical implementations. Fluent builder interfaces that need to return *this as the derived type become straightforward without an additional layer of templates. The static polymorphism semantics are identical to CRTP; the ergonomics are considerably better.

Note that neither CRTP nor deducing-this helps with heterogeneous storage. Both approaches produce static dispatch, so storing a mix of BinarySerializer and JsonSerializer objects behind a common pointer still requires either virtual methods or std::variant. The optimization benefit is specific to homogeneous collections and monomorphic call sites.

Concepts as Interface Specifications

With template-based static polymorphism, the contract between base and derived is implicit: whatever methods the base template calls, the derived type must provide. C++20 concepts give you a way to make that contract explicit and get comprehensible error messages when it is violated:

template <typename T>
concept Serializable = requires(T s, std::span<const std::byte> data) {
    { s.write_impl(data) } -> std::same_as<void>;
};

template <Serializable Derived>
class SerializerBase {
public:
    void write(std::span<const std::byte> data) {
        static_cast<Derived*>(this)->write_impl(data);
    }
};

When Derived fails to satisfy Serializable, the compiler reports the constraint violation directly rather than emitting a cascade of template instantiation errors. Concepts do not change dispatch mechanics; they recover some of the documentation and diagnostic value that virtual function declarations provide in classic OOP designs, which is one of the genuine ergonomic advantages that dynamic polymorphism has over template-based alternatives.

Choosing Between the Approaches

Runtime polymorphism through virtual dispatch is the right choice when the set of types is genuinely open at runtime: plugin systems that load code across shared library boundaries, scripting engine integrations, GUI framework event handlers, anything where types are determined after the program starts. In these situations, the vtable is doing real work that static polymorphism cannot replace.

Static polymorphism is worth the complexity when all concrete types are known at compile time, when the code sits in a hot inner loop, and when the overhead of indirect calls is measurably blocking vectorization or other optimizations. The final keyword occupies a useful middle ground: one keyword per class, a meaningful hint to the compiler, and no change to the external interface. For classes that will never be subclassed, adding final is low-cost and routinely beneficial.

The underlying principle is that C++ gives you enough control to match the dispatch mechanism to the actual requirements of the code. Virtual dispatch is not inherently expensive; it is expensive in specific contexts where the optimizer needs to see through the call to do its job. Knowing which context you are in is most of the work, and then the language provides the tools to act on that knowledge.

Was this interesting?