The Virtual Dispatch Tax: When the Compiler Handles It and When You Should

Virtual dispatch in C++ has a reputation that sometimes outpaces the actual problem. The overhead is real, but the conditions under which it matters are narrower than most discussions suggest, and compilers already handle a surprising share of the elimination automatically. David Álvarez Rosa’s breakdown on isocpp.org does a solid job explaining the mechanics and the manual alternatives. What it leaves room for is the fuller picture: what the compiler does on its own, where cache behavior dominates the indirect call itself, and how C++20 and C++23 change the ergonomics of static polymorphism significantly.

The Mechanism and Its Actual Costs

Every polymorphic C++ object carries a hidden pointer, the vptr, to its class’s vtable. A virtual call compiles down to three operations that would not exist in a direct call: load the vptr from the object, index into the vtable to find the function pointer, then call through that pointer. On x86-64:

mov rax, [rcx]          ; load vptr from object
call qword ptr [rax+8]  ; indirect call through vtable entry

This costs two memory accesses before execution even reaches the function body. More importantly, it costs the optimizer access to the function body entirely: since the compiler cannot know at compile time what the pointer resolves to, it cannot inline the callee, cannot eliminate redundant loads across the call boundary, and must conservatively assume the callee modifies any reachable memory.

The raw call overhead on a warm cache, with a correctly-predicted indirect branch, is roughly one to two nanoseconds on modern x86. The CPU’s indirect branch target buffer handles monomorphic and dimorphic call sites reasonably well. The overhead grows sharply when four or more distinct concrete types appear at the same call site in a tight loop: the predictor cycles through its entries, mispredictions mount, and per-call cost can reach eight to twenty nanoseconds.

That said, a point Chandler Carruth has made repeatedly at CppCon bears emphasis: the dominant cost in many real polymorphic systems is not the indirect call itself. It is cache misses caused by heterogeneous object layout. When a container holds pointers to objects of mixed types allocated across the heap, accessing them in iteration order means random memory traversal. Sorting the container by concrete type, so all Lions come before all Zebras, often recovers most of the throughput without touching the dispatch mechanism at all.

Profile before rewriting. The indirect call overhead is concrete but frequently not the actual bottleneck.

What Compilers Already Handle

GCC and Clang perform devirtualization automatically in several cases that cover a substantial portion of real-world virtual call sites.

Local type certainty. When the object is stack-allocated or freshly constructed and has not escaped the optimizer’s view, the compiler knows the dynamic type and converts the virtual call to a direct call, which it can then inline:

void process() {
    Derived d;
    Base* p = &d;
    p->foo();  // GCC -O2: devirtualized to Derived::foo(), likely inlined
}

The final specifier. Marking a class or a method final tells the compiler the type cannot be further derived, which enables devirtualization even when the static type is a base pointer:

class Derived final : public Base { ... };

void process(Derived* p) {
    p->foo();  // Devirtualized: Derived is final, no subclass can override
}

This is one of the cheapest improvements available. If a class is not intended to be a base, mark it final. The compiler gets devirtualization; the reader gets documentation.

Link-Time Optimization. With -flto, GCC and Clang perform whole-program class hierarchy analysis. If a virtual function has exactly one override across the entire program, every call site devirtualizes unconditionally. This is GCC’s IPA-devirt pass at work, and it applies without any source changes beyond enabling LTO in your build.

Speculative devirtualization with PGO. Profile-Guided Optimization adds a runtime type guard at call sites where one concrete type dominates the execution profile. The generated code looks roughly like:

// Compiler-generated at -fprofile-use:
if (__builtin_expect(p->vptr == &Derived::vtable, true)) {
    Derived::foo(p);  // direct call, inlineable
} else {
    p->foo();         // fallback: true virtual dispatch
}

GCC enables this with -fdevirtualize-speculatively at -O3. LLVM does it through Indirect Call Promotion in the PGO pipeline. In server workloads where one concrete type accounts for over 80 percent of calls, this transformation is essentially free devirtualization.

The upshot: if you have a codebase with clean hierarchies, a build process that uses LTO, and you mark terminal classes final, the compiler eliminates a significant fraction of virtual call overhead without any manual intervention in the source code.

CRTP: Manual Static Dispatch

When the compiler cannot help, or when you need the optimizer to see through the call unconditionally rather than speculatively, the Curiously Recurring Template Pattern is the classic C++ solution. The base class is parameterized on the derived type, and dispatch becomes a compile-time cast:

template <typename Derived>
class Shape {
public:
    double area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};

class Circle : public Shape<Circle> {
public:
    double area_impl() const { return 3.14159 * r * r; }
private:
    double r;
};

The call to area_impl is a direct call to a statically known type. The compiler inlines it. No vptr, no vtable, no indirect branch.

This pattern powers Eigen’s entire matrix expression system, where MatrixBase<Derived> is the root of the hierarchy and expression templates composed through CRTP fuse multi-step matrix operations into single loops with no temporaries. It appears in Boost.Iterator’s iterator_facade, in LLVM’s internal casting infrastructure, and in embedded libraries like the Embedded Template Library where virtual dispatch is often prohibited entirely by hard real-time timing requirements.

CRTP’s core limitation is significant: Shape<Circle> and Shape<Rectangle> are unrelated types. You cannot store them in a single container without reintroducing a common base class, which brings back either virtual dispatch or some other runtime mechanism. CRTP handles compile-time-homogeneous usage; it does not replace runtime polymorphism.

The other limitation is ergonomics. The static_cast<Derived*>(this)->method() pattern is repetitive, and template error messages when the hierarchy is constructed incorrectly are famously difficult to read.

C++20 Concepts and C++23 Deducing This

C++20 concepts offer a cleaner approach to static interfaces for generic code. Instead of inheriting from a CRTP base, any type satisfying the concept works, with no inheritance required:

template <typename T>
concept Shape = requires(const T s) {
    { s.area() } -> std::convertible_to<double>;
    { s.draw() } -> std::same_as<void>;
};

template <Shape T>
void render(T& s) {
    s.draw();
}

Each instantiation of render is fully monomorphic and optimizable. The advantage over CRTP is that T does not need to inherit anything, existing types can satisfy concepts retroactively without modification, and error messages when a type fails a concept constraint are readable rather than a wall of nested template instantiation failures.

C++23’s deducing-this feature (P0847) goes further and eliminates one of CRTP’s worst ergonomic problems by making the object parameter explicit and deducible by the compiler:

struct Shape {
    template <typename Self>
    double area(this const Self& self) {
        return self.area_impl();  // Self is the concrete derived type; no cast needed
    }
};

struct Circle : Shape {
    double area_impl() const { return 3.14159 * r * r; }
    double r;
};

Self is deduced to the actual derived type at the call site, making the call direct and inlineable. The static_cast boilerplate disappears. This is supported in GCC 13+, Clang 17+, and MSVC 19.36+. For new code targeting C++23, deducing-this largely supersedes CRTP for mixin and interface injection patterns.

std::variant for Closed Type Sets

When the set of concrete types is fixed at compile time but you need a single container to hold any of them, std::variant with std::visit provides static dispatch without inheritance:

using Shape = std::variant<Circle, Rectangle, Triangle>;

double total_area(const std::vector<Shape>& shapes) {
    double sum = 0.0;
    for (const auto& s : shapes) {
        sum += std::visit([](const auto& shape) {
            return shape.area();
        }, s);
    }
    return sum;
}

The dispatch is through an integer tag rather than a pointer. The compiler generates a jump table with statically bounded targets it can analyze and optimize. Because all types share the same fixed-size storage buffer, objects in a std::vector<Shape> sit in contiguous memory with no heap pointer chasing, which directly addresses the cache problem that hurts pointer-to-base polymorphic containers.

Benchmarks on quick-bench.com from Vittorio Romeo and others show variant dispatch at roughly half the cost of virtual dispatch in two-to-four-type scenarios on typical x86 hardware. The gap shrinks when the discriminator tag is predictably constant across iterations and grows when types are mixed randomly.

The constraint is the closed set requirement. std::variant cannot model a plugin architecture or any system where new types appear after compilation. It also pads all objects to the size of the largest type in the variant, which can waste memory if the type sizes diverge significantly.

Making the Decision

The question is not whether static polymorphism is faster than virtual dispatch in principle. It is whether the dispatch is on your hot path, whether the compiler is already eliminating it, and what the actual constraint is.

Mark classes final where it applies. Build with LTO in release configurations. Profile before assuming virtual dispatch is the bottleneck; check whether sorting objects by type in a container resolves the throughput problem before touching any dispatch logic.

If profiling confirms dispatch overhead and the type set is known at compile time: CRTP or concepts for fully generic code where the concrete type is always known at the call site, std::variant for heterogeneous containers with a closed type set. If you are targeting C++23, deducing-this replaces most CRTP mixin patterns with substantially cleaner code.

Virtual dispatch remains correct for open type sets, plugin architectures, ABI-stable library interfaces, and any context where the concrete type is genuinely unknown until runtime. The abstraction cost there is real, but it reflects the problem’s requirements rather than a design flaw.

Static polymorphism carries its own costs: longer compilation times, larger binaries from template instantiation, and code that is harder to follow for developers unfamiliar with CRTP or deducing-this patterns. Reach for it when you have measured the problem, confirmed the type set is fixed at compile time, and need the optimizer to see through the abstraction unconditionally rather than speculatively.