· 7 min read ·

The Devirtualization Ladder: Five Ways to Remove Virtual Dispatch Overhead in C++

Source: isocpp

Virtual dispatch is one of those topics where the advice tends to polarize. Either you are told to embrace object-oriented design and not worry about performance, or you are told to rip out every virtual keyword and replace it with templates. Neither framing is very useful. The real question is more specific: on which call sites does virtual dispatch matter, what exactly is the compiler doing about it, and what tools do you have when the compiler cannot help?

A recent article by David Álvarez Rosa on isocpp.org lays out the core problem well: virtual dispatch enables polymorphism but comes with hidden overhead in pointer indirection, larger object layouts, and lost inlining opportunities. What I want to do here is go further than the overview and walk through the specific mechanics, then trace the full spectrum of tools from “let the compiler fix it” to “restructure the type hierarchy.”

What a Virtual Call Actually Costs

When you call a virtual function through a base class pointer, the generated assembly looks roughly like this on x86-64:

mov rax, [rdi]          ; load vptr from object
mov rax, [rax + offset] ; load function pointer from vtable
call [rax]              ; indirect call

That is three memory operations before the function body even begins. The first load is from the object itself (the vptr is at offset zero for single inheritance), so it is likely in cache if you just accessed the object. The second load hits the vtable, which is a static table shared across all instances of a given class, so it stays warm once accessed. The indirect call itself is the harder problem.

Modern CPUs have indirect branch predictors that handle hot virtual calls reasonably well. At a single call site that is always dispatching to the same concrete type (a monomorphic call site), the predictor will nearly always get it right. The real cost is not the branch prediction miss; it is that the compiler cannot inline across an indirect call. No inlining means no constant propagation into the callee, no loop unrolling, no vectorization of loops that contain the call. In a tight inner loop processing a container of objects, that can be a 3x to 10x performance difference depending on what the virtual function does.

Every polymorphic class also carries a vptr, adding 8 bytes to every object on a 64-bit system. In a container of a million objects, that is 8MB of memory that serves no purpose except enabling dispatch. When those objects are traversed sequentially, the extra size means more cache lines touched per iteration.

Step One: Let the Compiler Devirtualize

Before restructuring anything, it is worth understanding what the compiler will devirtualize on its own. GCC and Clang both perform devirtualization in several circumstances:

A call through a reference or pointer where the static type is exactly known, with no possibility of a more-derived type, will be devirtualized. If you construct a Circle on the stack and call a virtual method on it directly (not through a Shape*), the compiler resolves it statically.

Link-time optimization (LTO) enables whole-program devirtualization. With -flto, the compiler can see across translation unit boundaries and determine that a given virtual call site is monomorphic across the entire program. This is the lowest-effort path: just enable LTO in your release builds and get devirtualization for free at the sites where it is statically safe.

Profile-guided optimization (PGO) adds speculative devirtualization. The compiler instruments the binary, you run it on representative workloads, then recompile using the profile data. Call sites that are overwhelmingly monomorphic in practice get an inlined fast path guarded by a type check, with a fallback to the virtual call. This is particularly effective in application code where the type distribution is stable.

Step Two: The final Keyword

The simplest manual intervention is final, introduced in C++11. Marking a class final tells the compiler that no further derivation is possible:

struct Circle final : public Shape {
    double radius;
    double area() const override { return 3.14159265 * radius * radius; }
};

void process(const Circle& c) {
    // Compiler knows Circle::area() is the final override.
    // This call is devirtualized and inlined.
    double a = c.area();
}

You can also mark individual overrides final without sealing the class. The compiler uses this information to eliminate the vtable lookup on any call site where it can see the concrete type is Circle. The key limitation is that final only helps at call sites where the static type is the final class itself. Calls through a Shape* that happens to point to a Circle are still virtual unless the compiler can prove the type through escape analysis.

For internal implementation classes that are never meant to be extended, final is almost always the right default. It documents intent, enables devirtualization, and costs nothing.

Step Three: CRTP for Zero-Overhead Abstraction

The Curiously Recurring Template Pattern (CRTP) is the classical approach to static polymorphism. The base class takes the derived class as a template parameter and casts this to the derived type for dispatch:

template<typename Derived>
struct Shape {
    double area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};

struct Circle : Shape<Circle> {
    double radius;
    double area_impl() const { return 3.14159265 * radius * radius; }
};

template<typename S>
double total_area(const std::vector<S>& shapes) {
    double sum = 0;
    for (const auto& s : shapes) sum += s.area();
    return sum;
}

Every dispatch is resolved at compile time. The static_cast generates no instructions; it is a type system fiction that disappears entirely. The call to area_impl() is inlineable, so the compiler can vectorize the accumulation loop, apply constant folding if radius is known, and eliminate function call overhead entirely.

The trade-offs are real and worth naming. Template code must live in headers, which increases compilation times. Each template instantiation produces a separate copy of the base class methods in the binary, so binary size grows with the number of concrete types. Error messages from template misuse are notoriously difficult to read, though C++20 concepts mitigate this somewhat. Most critically, CRTP cannot support heterogeneous collections: you cannot store Shape<Circle> and Shape<Rectangle> in the same std::vector. If you need runtime-open polymorphism (plugin architectures, user-extensible type systems), virtual dispatch is still the right tool.

Step Four: C++20 Concepts for Cleaner Static Interfaces

Concepts provide a way to write statically dispatched generic code with explicit interface documentation and readable error messages:

template<typename T>
concept HasArea = requires(const T& t) {
    { t.area() } -> std::convertible_to<double>;
};

template<HasArea T>
double compute_area(const T& shape) {
    return shape.area(); // resolved statically, inlineable
}

This is not strictly equivalent to CRTP. There is no inheritance relationship between types satisfying a concept; any type with an area() method returning something convertible to double satisfies HasArea. This is duck typing with compile-time checking. It is well-suited to generic algorithms where you want to constrain what types are accepted without imposing an inheritance hierarchy on the caller.

Concepts do not solve the heterogeneous collection problem either, but they compose cleanly with std::ranges, work naturally with standard library algorithms, and produce dramatically better error messages than unconstrained templates.

Step Five: std::variant and std::visit

For closed sets of types known at compile time, std::variant offers a different trade-off than any of the above:

using Shape = std::variant<Circle, Rectangle, Triangle>;

double area(const Shape& s) {
    return std::visit([](const auto& shape) -> double {
        return shape.area();
    }, s);
}

std::visit generates a jump table indexed by the discriminant. Like a vtable, it is an indirect dispatch, but each branch in the jump table can be inlined independently because the type is statically known per branch. The object stores its value inline rather than on the heap, eliminating a pointer dereference and improving cache locality when objects are stored in arrays. There is no vptr overhead per object; the type tag is stored once per variant.

The cost is rigidity. Adding a new type to the variant requires recompiling everything that touches it. This is appropriate for domain types you control but unsuitable for plugin architectures.

Choosing the Right Tool

The decision tree is relatively clean in practice:

If you need runtime-open polymorphism (callers can add new types without recompiling the library), virtual dispatch with final where possible and LTO in release builds is the right baseline.

If the type hierarchy is closed and performance on a specific hot path matters, CRTP or std::variant depending on whether you need value semantics or prefer the inheritance-based design.

If you are writing generic algorithms that should work across many types without coupling them through inheritance, C++20 concepts express the interface contract clearly and generate better errors than raw templates.

The key insight from Álvarez Rosa’s piece is worth repeating: latency-sensitive paths are where this matters. Profile first. The overhead of a virtual call in a hot inner loop over a homogeneous container is very different from the overhead at an occasional dispatch into a plugin handler. Reach for the simplest tool that solves the actual problem, because CRTP and concepts come with their own costs in compilation time, binary size, and code readability that are easy to underestimate until you have a large codebase full of them.

Was this interesting?