When the vtable Gets in the Way: Virtual Dispatch, Devirtualization, and Static Polymorphism in C++

The Actual Cost of a Virtual Call

Virtual dispatch is one of C++‘s most-used features and one of its most misunderstood performance characteristics. The overhead is real, but the common framing, that it’s just “an extra pointer dereference,” understates where the actual pain comes from.

When you call a virtual function, the compiler emits roughly three operations: load the vtable pointer from the object, load the function pointer from the vtable at the correct offset, then perform an indirect call through that pointer. On a warm L1 cache with a predictable call target, this adds maybe one to three nanoseconds per call. That number sounds small, and in many programs it genuinely is.

The real cost emerges in two scenarios. First, when the call target varies across many different types, the CPU’s branch predictor gives up on speculating the destination, and you start paying for pipeline flushes. Second, when your objects are cold in cache, each vtable dereference can trigger a cache miss. A single L3 miss on a modern processor costs 40-80 nanoseconds. The vtable and the function body it points to may both be in different cache lines from your object data, meaning one polymorphic call in a cache-cold loop can cost as much as a dozen arithmetic operations would on hot data.

This is what David Álvarez Rosa’s article on isocpp.org is driving at: the overhead isn’t just the indirection itself, it’s everything the indirection prevents. The compiler cannot inline through an indirect call it can’t resolve statically. No inlining means no constant folding across call boundaries, no dead code elimination of unused branches inside the callee, no vectorization of loops that call virtual methods.

What the Compiler Already Does For You

Before reaching for CRTP or other manual techniques, it’s worth understanding how aggressively modern compilers devirtualize on their own.

GCC and Clang both implement speculative devirtualization under -O2 and higher. When a compiler can determine the dynamic type of an object at a particular call site, it will replace the indirect call with a direct one. The most obvious trigger is a locally declared object of known type:

void process() {
    Circle c{5.0f};
    Shape* s = &c;  // compiler knows this is always a Circle
    s->draw();      // devirtualized to Circle::draw() directly
}

The final keyword is the clearest signal you can give the compiler. Marking a class final tells it that no further derivation is possible, so any pointer or reference of that type carries an exact type guarantee:

class Circle final : public Shape {
    void draw() override { /* ... */ }
};

void render(Circle& c) {
    c.draw();  // compiler knows this can only be Circle::draw
}

LTO (Link Time Optimization), enabled with -flto in GCC and Clang, extends devirtualization across translation unit boundaries. Without LTO, the compiler processing a .cpp file has no way to know whether a type visible in that TU has subclasses defined elsewhere. With LTO, the whole-program view makes many more devirtualizations possible, especially for internal types not exposed in headers.

Profile-Guided Optimization (-fprofile-use) takes this further with speculative devirtualization: if profiling shows that 98% of calls to Shape::draw at a particular site dispatch to Circle::draw, the compiler emits a type check and a direct fast path, falling back to the vtable only for the rare other cases.

For many codebases, these automatic mechanisms are sufficient. The question is when they aren’t.

CRTP: Static Polymorphism the Classic Way

The Curiously Recurring Template Pattern has been in widespread use since the mid-1990s. The idea is that a base class template takes its derived class as a template parameter, allowing the base to call derived methods without virtual dispatch:

template <typename Derived>
class Shape {
public:
    void draw() {
        static_cast<Derived*>(this)->draw_impl();
    }
    
    float area() {
        return static_cast<Derived*>(this)->area_impl();
    }
};

class Circle : public Shape<Circle> {
public:
    void draw_impl() { /* render circle */ }
    float area_impl() { return 3.14159f * radius_ * radius_; }
private:
    float radius_;
};

At the call site, draw() resolves entirely at compile time. The static_cast is a no-op at runtime (same pointer, different type), and because the compiler knows the exact type of draw_impl, it can inline freely. The resulting assembly for a tight loop over shapes is often identical to what you’d get with no abstraction at all.

The trade-off is well-known: you lose runtime polymorphism. You cannot store Shape<Circle> and Shape<Square> in the same container, because they’re different types with no common base. CRTP gives you compile-time polymorphism, not runtime polymorphism. If you need to dispatch on a type that’s only known at runtime, CRTP doesn’t apply.

There are also ergonomic costs. CRTP code is harder to read, error messages for incorrect usage are famously terrible, and the pattern requires you to forward-declare or friend carefully when the relationship between base and derived gets complex.

`std::variant` as an Alternative

For closed sets of types, std::variant combined with std::visit offers a compelling middle ground between runtime virtual dispatch and the inflexibility of CRTP:

using Shape = std::variant<Circle, Square, Triangle>;

void draw(Shape& s) {
    std::visit([](auto& shape) {
        shape.draw();
    }, s);
}

std::vector<Shape> scene;
for (auto& s : scene) {
    draw(s);
}

The compiler implements std::visit as a jump table over the variant’s type index, which is an integer comparison and a branch, not a pointer dereference chain. When the visitor lambda is instantiated for each type in the variant, the compiler can inline each concrete implementation. The result often outperforms virtual dispatch in microbenchmarks, particularly because all the implementations may be visible in one translation unit.

The cost is flexibility: the set of types in the variant must be known at compile time. Adding a new shape type means changing the variant definition and recompiling everything that touches it. This is the open-closed principle inverted: virtual dispatch is open to extension (new types without recompilation), std::variant is closed.

C++23: Deducing `this`

C++23’s explicit object parameter (P0847, commonly called “deducing this”) addresses some of CRTP’s ergonomic pain without changing the fundamental approach. Instead of a base class templated on its derived type, you write methods that deduce their object type automatically:

struct Shape {
    template <typename Self>
    void draw(this Self& self) {
        self.draw_impl();  // resolves at compile time based on actual type of self
    }
};

struct Circle : Shape {
    void draw_impl() { /* ... */ }
};

Calling circle.draw() deduces Self = Circle and dispatches to Circle::draw_impl() without virtual dispatch and without the static_cast noise. This is semantically equivalent to CRTP for the common case but reads considerably cleaner. The compiler’s ability to inline the resolved call is the same.

Deducing this also solves the long-standing problem of code duplication between const and non-const overloads, a separate benefit that makes it worth adopting regardless of the polymorphism angle.

Choosing the Right Tool

The decision tree here isn’t complicated once the costs are clear.

If your types are local and the compiler can see their concrete type at the call site, do nothing. The compiler will devirtualize. If you’re working with types that are final, annotate them and trust the compiler.

If you need runtime polymorphism (different types behind the same pointer, determined at runtime), virtual dispatch is the right model. The overhead is modest on cache-warm paths, and the architectural clarity of a clean interface hierarchy is worth something.

If you’re writing a performance-sensitive component where the set of participating types is known at compile time and you’re willing to accept template complexity, CRTP or std::variant are appropriate. Game engine component systems, numerical libraries that operate on different numeric types, and embedded system drivers that abstract over hardware with no dynamic allocation are all classic fits.

The place where developers most often leave performance on the table isn’t choosing between virtual and CRTP, it’s failing to mark final on types that will never be subclassed, or failing to enable LTO in production builds. These are free improvements that don’t require restructuring any code.

The Broader Pattern

C++‘s zero-overhead abstraction principle, often attributed to Bjarne Stroustrup, says that features you don’t use don’t cost you anything, and that features you do use couldn’t be implemented more efficiently by hand. Virtual dispatch is a case where the principle is conditional: it applies if the compiler can devirtualize, and it doesn’t apply if it can’t.

Static polymorphism, whether through CRTP, concepts, std::variant, or the newer deducing-this syntax, shifts the resolution from runtime to compile time and makes the zero-overhead principle unconditional. The trade-off is always the same: you give up runtime flexibility in exchange for compile-time performance guarantees.

Rust handles this cleanly by making the distinction explicit in the language syntax: &dyn Trait for dynamic dispatch with explicit vtable semantics, impl Trait or T: Trait for static dispatch. C++ achieves the same split through conventions and patterns rather than syntax, which is why it requires more deliberate knowledge to navigate. Articles like the one on isocpp.org help make those conventions explicit, which is where most of the practical value lies.