The Part of Virtual Dispatch Overhead That Benchmarks Don't Show You

David Álvarez Rosa’s piece on devirtualization and static polymorphism covers the mechanics well, but there’s one angle worth pulling apart further: the cost most people benchmark is not the cost that actually matters. The vtable lookup is cheap. The thing virtual dispatch prevents is where you lose performance.

The Mechanics, Briefly

When a class declares a virtual function, the compiler generates a vtable: a static array of function pointers, one slot per virtual method in the hierarchy. Each object carries a hidden vptr at offset zero, pointing to its class’s vtable. A virtual call loads the vptr, indexes into the vtable, and performs an indirect branch through the resulting pointer.

In assembly, calling animal->speak() on a non-final base pointer looks like this:

mov rax, [rdi]        ; load vptr from object
mov rax, [rax + 8]    ; load function pointer from vtable slot
call rax              ; indirect call

Three instructions. On a warm cache with a predictable call site, this costs around 3 to 5 nanoseconds total, not much more than a regular function call. Benchmarks that measure virtual dispatch in isolation often conclude that virtual functions are fine.

That conclusion misses the point.

The Inlining Barrier

The compiler cannot inline through a virtual call. The function pointer sits in a vtable slot that the compiler cannot read at compile time when the concrete type is unknown. This is not a limitation that better hardware resolves; it is a fundamental property of the dispatch model.

Inlining is not just a performance micro-optimization. It is the prerequisite for most of the optimizations a compiler performs on hot code: constant propagation, dead code elimination, loop vectorization, return value optimization. When you block inlining, you block all of those downstream passes too.

Chandler Carruth has made this point explicitly in several CppCon talks: the inability to inline is often an order of magnitude more costly than the raw dispatch overhead, because inlining is what enables the optimizations that actually matter in tight loops. A 3 ns dispatch overhead is negligible. A loop that cannot vectorize because it contains an opaque function call can lose 4x to 10x throughput, depending on the operation.

What Compilers Already Do For You

Compilers are not passive about this. Both GCC and Clang perform devirtualization automatically under -O2 when they can prove the concrete type at a call site. The simplest case is a stack-allocated object:

void foo() {
    Dog d;
    Animal& a = d;
    a.speak();  // GCC and Clang both devirtualize this
}

The compiler knows d is a Dog, therefore a.speak() must call Dog::speak(), therefore it emits a direct call. With inlining enabled, that call disappears entirely.

The final keyword extends this. Marking a class or a virtual function final tells the compiler that no further overrides exist, which allows devirtualization even through pointer and reference parameters:

class Dog final : public Animal {
    void speak() const override;
};

void process(Dog* d) {
    d->speak();  // devirtualized: compiler knows Dog::speak() is the only possibility
}

Without final, process could receive a subclass of Dog whose speak() overrides this one, so the compiler must emit an indirect call. With final, no such subclass can exist, and the call becomes direct.

This is a free optimization that a lot of codebases leave on the table. If a class is never subclassed in your codebase, marking it final costs nothing in terms of design flexibility and can enable meaningful devirtualization without LTO or profile-guided optimization. The assembly difference is stark:

// Without final:
mov rax, [rdi]      ; load vptr
jmp [rax]           ; indirect jump

// With final:
jmp Dog::speak()    ; direct jump, eligible for inlining

With link-time optimization (-flto on GCC/Clang), the compiler extends this analysis across translation unit boundaries. If it can prove through whole-program class hierarchy analysis that a virtual function has only one override in the entire binary, it devirtualizes unconditionally, even without final.

When You Need to Go Further: Static Polymorphism

Devirtualization is reactive: you hope the compiler can prove the type, and you annotate with final where you can. Static polymorphism is proactive: you move the dispatch decision to compile time by construction.

The canonical C++ approach is CRTP, the Curiously Recurring Template Pattern:

template <typename Derived>
class Animal {
public:
    void speak() const {
        static_cast<const Derived*>(this)->speak_impl();
    }
};

class Dog : public Animal<Dog> {
    friend class Animal<Dog>;
private:
    void speak_impl() const { /* woof */ }
};

template <typename T>
void make_noise(const Animal<T>& a) {
    a.speak();  // resolved at compile time, inlined
}

The static_cast is a reinterpretation with no runtime cost. The call to speak_impl() is a direct call that the compiler can inline. There is no vtable, no pointer indirection, no branch prediction exposure. Eigen, the linear algebra library, uses this pattern throughout its expression template system to produce zero-overhead matrix computations that fully inline across complex expression trees.

CRTP’s limitation is that each instantiation is a distinct type. You cannot store a vector<Animal<T>> with mixed T values. The interface is viral through templates, which inflates compile times and can grow binary size.

C++20 concepts clean up the constraint side of this. Instead of requiring inheritance from a CRTP base, you express the interface as a concept and let the compiler verify conformance at instantiation:

template <typename T>
concept AnimalInterface = requires(const T& a) {
    { a.speak() } -> std::same_as<void>;
    { a.legs()  } -> std::convertible_to<int>;
};

template <AnimalInterface T>
void describe(const T& a) {
    a.speak();
}

This gives you duck typing with compile-time verification and better error messages than raw SFINAE. No base class required. C++23 goes further with deducing this (P0847), which lets you write CRTP-style mixin code without the static_cast boilerplate, and without the derived class needing to name the base explicitly.

For closed type sets, std::variant combined with std::visit is now the preferred idiom. State machines, AST node types, and message discriminants fit this model well:

using AnyAnimal = std::variant<Dog, Cat, Snake>;

std::vector<AnyAnimal> zoo;
zoo.emplace_back(Dog{});
zoo.emplace_back(Cat{});

for (const auto& a : zoo) {
    std::visit([](const auto& animal) { animal.speak(); }, a);
}

The variant stores objects by value: no heap allocation, no pointer indirection, cache-friendly iteration. The dispatch uses a small integer discriminant rather than a vtable. For two to four types with a predictable distribution, this is consistently faster than virtual dispatch, and the compiler can inline each visitor branch.

The Data Layout Multiplier

One dimension that dispatch mechanism benchmarks systematically understate is data layout. A vector<Base*> is a contiguous array of pointers to objects scattered across the heap, each of which may pull its vtable from a different cache line. A vector<Derived> or vector<AnyAnimal> is a contiguous array of values.

Björn Fahller’s benchmarks from CppCon 2021 illustrate this concretely: a polymorphic pointer array with four types in random order achieved about 1 GB/s throughput for a simple per-element computation. The same computation over a std::variant array achieved 4 GB/s. A homogeneous array with no polymorphism at all achieved 12 GB/s. The difference between the pointer-based and variant approaches is mostly layout, not dispatch mechanism.

This is why data-oriented design converges with static polymorphism in performance-critical code. Separating by type before processing, using value semantics, and avoiding pointer indirection through heterogeneous hierarchies compound to produce improvements that no dispatch optimization alone can match.

Choosing the Right Tool

Virtual dispatch remains the right choice for genuinely open extension points: plugin systems, cross-library interfaces, anything where the set of concrete types is not known at compile time. Marking leaf classes final is a low-effort improvement with no design cost.

CRTP and concepts suit library-level abstractions where you own all the types and want zero overhead with strong interface contracts. The C++23 deducing-this feature reduces the boilerplate cost of both.

std::variant fits closed algebraic type sets where you want value semantics and exhaustive dispatch. It performs well, integrates with standard pattern matching idioms, and communicates the closed-world assumption clearly to the reader.

The decision is rarely about virtual versus non-virtual in isolation. It is about what the compiler can see, what the data layout looks like, and whether the dispatch happens inside a loop that needs to vectorize. Devirtualization is what you get for free when the answer is clear to the compiler; static polymorphism is what you build when you want to guarantee it.