· 6 min read ·

The Real Cost of Virtual Dispatch in C++: Devirtualization and the Static Polymorphism Payoff

Source: isocpp

The problem is not the pointer dereference

When developers benchmark virtual dispatch against direct calls, the numbers look bad but not catastrophic. On a warm cache, a virtual call adds roughly 2-5 nanoseconds over a direct call: one load for the vptr, one load for the function pointer slot, and an indirect jump. Cold vtable access, where neither the object header nor the vtable fits in L1 cache, can push that to 15-50 nanoseconds. Indirect branch misprediction at a polymorphic call site adds another 15-20 cycles.

These costs are real, but they are not the reason David Álvarez Rosa’s piece on devirtualization and static polymorphism matters. The more significant cost is what the compiler cannot do once it encounters an indirect call.

A virtual dispatch is opaque to the optimizer. The compiler sees a function pointer loaded from memory and called through a register; it cannot prove which function will execute. So it cannot inline the callee. And inlining is not primarily about removing call overhead. It is the prerequisite for constant propagation through the call boundary, dead code elimination on the result, loop vectorization, and register allocation improvements. When a virtual call sits in a tight inner loop, the compiler cannot vectorize that loop. It cannot fold invariant subexpressions. The SIMD units sit idle. This is the actual performance problem: not the two-pointer dereference, but the optimization firewall the indirect call creates.

How compilers devirtualize

Before changing source code, it is worth knowing how much the compiler can recover automatically. Devirtualization is the transformation that turns an indirect virtual call into a direct call, restoring full optimizer visibility.

The most reliable trigger is final. When a class or method is marked final, the compiler proves at the call site that no further overrides exist, and devirtualizes at -O2 and above on GCC and Clang:

class ConcreteRenderer final : public Renderer {
public:
    void draw() override;
};

void process(ConcreteRenderer* r) {
    r->draw(); // Devirtualized: compiler knows the exact type
}

final was added in C++11 precisely because there was no way to communicate “this hierarchy ends here” to the compiler. It is free documentation and a reliable devirtualization signal, and it is underused in most codebases.

Stack-allocated objects give the compiler the same certainty even without final, since the concrete type is visible directly:

void render() {
    CircleRenderer r;
    Renderer* p = &r;
    p->draw(); // Devirtualized: type is provable from context
}

For cross-translation-unit scenarios, Link-Time Optimization (-flto in GCC and Clang) extends type hierarchy analysis across compilation boundaries. GCC’s -fdevirtualize-at-ltrans and Clang’s ThinLTO can identify virtual classes with only one concrete subclass in the entire program. Speculative devirtualization (-fdevirtualize-speculatively, on by default at -O2) handles the common case where one type dominates a call site by emitting a guarded fast path:

// Compiler-generated (conceptual)
if (vptr == &CircleRenderer::vtable) {
    static_cast<CircleRenderer*>(obj)->draw(); // Fast, inlinable
} else {
    obj->draw(); // Fallback virtual call
}

Profile-guided optimization makes speculative devirtualization substantially better by recording which concrete types actually appear at each call site. The combination of PGO and LTO eliminates most virtual call overhead from hot paths without any source changes.

GCC also provides [[gnu::flatten]], a function attribute that instructs the compiler to aggressively inline all calls inside the annotated function, including through virtual dispatch where the type is knowable:

[[gnu::flatten]]
void process_batch(std::vector<Renderer*>& renderers) {
    for (auto* r : renderers)
        r->draw(); // GCC attempts to inline through vtable when possible
}

This is blunt-force devirtualization for a specific hot caller. It does not work when the type genuinely cannot be determined.

Static polymorphism: CRTP

When the compiler cannot devirtualize, the alternative is to eliminate the indirection in source. Static polymorphism resolves dispatch at compile time, producing direct calls the optimizer can inline.

The Curiously Recurring Template Pattern, named by Jim Coplien in 1995, is the classical mechanism. A base class template takes the derived type as its parameter and casts this to call derived methods directly:

template <typename Derived>
class Shape {
public:
    void draw() {
        static_cast<Derived*>(this)->draw_impl();
    }
    double area() {
        return static_cast<Derived*>(this)->area_impl();
    }
};

class Circle : public Shape<Circle> {
public:
    void draw_impl() { /* circle rendering */ }
    double area_impl() { return 3.14159 * r_ * r_; }
private:
    double r_;
};

Shape<Circle> and Shape<Square> are distinct classes instantiated from the same template. The static_cast<Derived*>(this)->draw_impl() call resolves to a direct function call at instantiation time. No vtable, no vptr, no indirect jump. The compiler can inline draw_impl() into draw(), and draw() into its callers, propagating constants and enabling vectorization across the entire chain.

This is why Eigen is fast. The library builds every operation on MatrixBase<Derived> and expression templates: when you write A + B * C, no temporaries are allocated and no virtual dispatch occurs. The expression is inlined into a single loop that the compiler can vectorize with SIMD instructions. This architecture replaced a virtual-dispatch design and delivered 10-50x speedups on dense linear algebra operations, primarily by restoring the compiler’s ability to vectorize inner loops.

LLVM uses CRTP mixins throughout its codebase. Boost.Iterator’s iterator_facade generates complete iterator boilerplate from a handful of primitive operations via CRTP, with zero overhead in release builds. These are production-grade validations of the pattern at scale.

The structural cost of CRTP is significant. Shape<Circle> and Shape<Square> are unrelated types; you cannot store them in the same std::vector<Shape*>. If heterogeneous storage is required, you add a non-template virtual base class, partially reintroducing the overhead you eliminated. The static_cast<Derived*>(this) idiom is verbose and will silently compile with the wrong type name. Compilation times grow with deep hierarchies because each derived type instantiates the template separately.

C++23’s deducing this changes the ergonomics

C++23’s explicit object parameter (P0847, “deducing this”) addresses the core ergonomic problem. A member function can deduce the most-derived type of this without a template base class:

class Shape {
public:
    void draw(this auto& self) {
        self.draw_impl(); // self has the derived type at the call site
    }
};

class Circle : public Shape {
public:
    void draw_impl() { /* circle rendering */ }
};

When Circle c; c.draw() executes, self is deduced as Circle&. The call to self.draw_impl() is a direct call, resolved at compile time, fully inlinable. There is no static_cast, no template parameter, and no separate Shape<Circle> type. Circle and Square both inherit from the same Shape class.

This simplifies the pattern substantially. Classic CRTP required a template base for each interface, proliferating types and making error messages unreadable. Deducing this achieves the same dispatch semantics within a single non-template hierarchy. It also enables clean fluent builders where each method returns the derived type, and recursive lambda expressions that reference themselves.

C++20 concepts are complementary but address a different problem. A concept constrains template parameters: it says “this type must have a draw() method returning void.” It does not inject behavior. CRTP and deducing this inject shared implementations built on derived-class primitives. The modern combination is to use concepts for readable, compiler-enforced constraints and deducing this for the dispatch mechanism:

template <typename Derived>
    requires requires(Derived d) { d.draw_impl(); }
class Shape {
public:
    void draw(this auto& self) {
        self.draw_impl();
    }
};

C++26’s static reflection (P2996) will reduce the remaining CRTP use cases further by enabling direct introspection of member metadata, replacing patterns that currently require CRTP for compile-time type information injection.

Choosing between them

Virtual dispatch retains genuine advantages. Open hierarchies, ABI stability across shared library boundaries, smaller binary size (one vtable per class rather than N template instantiations), and better debugger support all favor it. Plugin systems and any code where the concrete type comes from runtime configuration require virtual dispatch or an equivalent runtime mechanism.

Static polymorphism requires knowing the complete type set at compile time. It pays off when the type set is closed, the code is on a hot inner loop, and the goal is to give the compiler maximum optimization latitude.

A practical approach: use virtual dispatch with final marked on any class not intended to be subclassed. Profile the hot paths. Examine the generated assembly with Compiler Explorer at -O2 or pass -Rpass=devirt (Clang) or -fopt-info-devirt (GCC) to see which virtual calls the compiler is actually resolving. If a specific call site is on a measured hot path and the compiler cannot devirtualize it, replace that section with CRTP or, on a C++23 codebase, deducing this.

The final audit alone is often enough to close the gap without restructuring anything. If a class in your hierarchy has no intended subclasses, marking it final is free, self-documenting, and gives the compiler reliable devirtualization information at the call site. Most codebases have more final candidates than they have marked, and that single keyword is worth checking before reaching for CRTP.

Was this interesting?