The Full Cost of Virtual Dispatch and Three Ways to Eliminate It

Virtual dispatch in C++ has a reputation for being expensive. The mental model is roughly right: every polymorphic object carries a hidden pointer to a vtable, every virtual call loads that pointer and then loads a function pointer from the table, and the resulting indirect call goes wherever the pointer leads. David Álvarez Rosa’s breakdown on the isocpp blog puts the practical consequence plainly: on latency-sensitive paths, this overhead is measurable and avoidable. What the article gestures at but does not fully develop is where the overhead actually accumulates.

The Inlining Problem Is Bigger Than the Dispatch Itself

Agner Fog’s microarchitecture manual gives the raw numbers on the dispatch mechanism. An indirect call through a register on modern Intel, when the branch is well-predicted, costs roughly one to three cycles more than a direct call. For a monomorphic call site where the same concrete type appears every time, the CPU’s indirect branch predictor learns the target quickly. In isolation, the vtable round-trip is not catastrophic.

The compounding cost comes from blocked inlining. A virtual call is an opaque barrier for the compiler. It cannot see the call target at compile time, so it cannot inline the callee. This matters because inlining is what enables the optimizations downstream: constant folding, loop vectorization, register allocation across the call site, common subexpression elimination. A loop that calls a virtual method on each iteration cannot be auto-vectorized, even if the method does nothing more complex than adding two numbers, because the compiler cannot inspect the method body to prove the vectorization is safe. These are the transformations that produce real speedups on modern hardware, and a virtual call boundary quietly prevents all of them.

Chandler Carruth’s CppCon 2014 talk on efficiency and data structures established this point clearly: a virtual call on a hot, cache-resident object is cheap in isolation. The cost compounds through the optimizer’s inability to look through it.

Misprediction is the other major factor. Post-Spectre, indirect branches on x86 are often retpoline-wrapped by the kernel or compiler, adding roughly 20-30 cycles of fixed overhead per call regardless of prediction accuracy. Even without retpolines, a call site where two or more types mix randomly pays a 15-20 cycle misprediction penalty each time the branch predictor is wrong. In workloads with eight or more concrete types cycling through a single call site, published benchmarks have shown slowdowns of four to six times versus a direct call, and that is before accounting for instruction cache effects.

The icache picture is underappreciated. Virtual dispatch scatters execution to many different function bodies. In server workloads with deeply polymorphic call graphs, profiling at Facebook and Google has found that icache misses from virtual dispatch can account for five to ten percent of total CPU cycles in production. The data layout of objects in memory, not just the dispatch mechanism, drives much of this.

What the Compiler Handles Automatically

Before writing a CRTP hierarchy, it is worth knowing how much the compiler already recovers.

Marking a class final is the simplest intervention. Both GCC and Clang at -O2 will devirtualize any virtual call through a reference or pointer of a final type:

class Renderer final : public IRenderer {
public:
    void draw() override { /* ... */ }
};

void render(Renderer& r) {
    r.draw();  // direct call, fully inlinable at -O2
}

The final keyword costs nothing at runtime. It signals that no subclass can override this method, so any call through a Renderer& is provably a direct call. Adding it to leaf classes is accurate documentation of intent, independent of the performance benefit. On Compiler Explorer, the effect is immediate and visible: the call [rax + offset] instruction disappears and the callee is inlined.

Stack-allocated objects of concrete type are also devirtualized routinely:

Derived obj;
obj.virtualMethod();  // devirtualized at -O1 and above

GCC’s -fdevirtualize-speculatively (on by default at -O2) extends this with profile-guided optimization: the compiler inserts a type check for the most common concrete type at a call site and falls through to virtual dispatch for uncommon cases. Clang’s virtual call promotion does the same.

Link-time optimization takes things further. LLVM’s WholeProgramDevirtPass, enabled by ThinLTO with -fwhole-program-vtables, analyzes vtable type metadata across the entire program. If only one override of a virtual function exists in the whole binary, the dispatch is eliminated unconditionally. Meta and Google both use this pipeline in production. GCC’s -flto -fdevirtualize-at-ltrans achieves similar results.

The practical takeaway: final on leaf classes, PGO, and LTO collectively recover most of the virtual dispatch overhead without any design changes. On most code, that is the right starting point, and manual CRTP rewrites are unnecessary.

CRTP: Static Dispatch the Established Way

When the type set is genuinely fixed at compile time and a hot loop requires full inlining through the polymorphic boundary, the Curiously Recurring Template Pattern delivers zero-overhead dispatch:

template <typename Derived>
struct Processor {
    void process(int value) {
        static_cast<Derived*>(this)->processImpl(value);
    }
};

struct FastProcessor : Processor<FastProcessor> {
    void processImpl(int value) {
        // direct call, fully inlinable, no vtable involved
    }
};

There is no vtable and no vptr. The object is eight bytes smaller on 64-bit systems than an equivalent virtual class. The static_cast resolves the call at compile time, and the compiler sees the full method body at every call site.

The costs are structural. You cannot store Processor<FastProcessor> and Processor<SlowProcessor> in the same std::vector, because they are different types with no common non-templated base. Each template instantiation produces a separate copy of every method in the base class template, inflating binary size and icache footprint in large hierarchies. Template error messages, while improved by C++20 Concepts, remain dense.

The most dangerous pitfall is accidental recursion. If a derived class does not implement processImpl, the base calls itself, typically compiling without errors and producing an infinite loop at runtime. A requires clause at the template level prevents this:

template <typename Derived>
struct Processor {
    void process(int value)
        requires requires(Derived& d) { d.processImpl(value); }
    {
        static_cast<Derived*>(this)->processImpl(value);
    }
};

This converts a silent runtime failure into a compile error, which is the right trade.

C++23 Deducing This: A Cleaner Spelling

C++23’s explicit object parameter, standardized in P0847, gives you CRTP-style static dispatch without the template base class:

struct Processor {
    template <typename Self>
    void process(this Self& self, int value) {
        self.processImpl(value);
    }
};

struct FastProcessor : Processor {
    void processImpl(int value) { /* ... */ }
};

Self is deduced to the concrete derived type at the call site. The call to processImpl is direct and inlinable. There is one Processor base class instead of one per derived type, which avoids the binary bloat from CRTP instantiations. The const/non-const overload duplication that CRTP requires is handled naturally, because Self deduces the cv-qualifier of the object being called on.

Compiler support arrived in GCC 13, Clang 18, and MSVC 2022 17.4. For new code targeting C++23, this is the cleaner spelling of most CRTP patterns.

Neither approach provides runtime polymorphism. Both require the concrete type to be known at compile time. Anything that genuinely dispatches based on information available only at runtime still needs virtual dispatch, manual type erasure, or a closed-set alternative.

std::variant and the Data Layout Argument

std::variant with std::visit is often framed as a dispatch optimization, but the more significant benefit is data layout:

using Shape = std::variant<Circle, Square, Triangle>;

void draw_all(std::vector<Shape>& shapes) {
    for (auto& s : shapes) {
        std::visit([](auto& shape) { shape.draw(); }, s);
    }
}

The shapes are stored inline in the variant, contiguously in the vector. There is no heap allocation per element, no pointer chasing, and no vptr. The dispatch uses a small integer index into a jump table rather than a vtable pointer. In cache-cold benchmarks where virtual dispatch requires following pointers to heap-scattered objects, contiguous std::vector<Shape> consistently wins by two to five times. That gap comes almost entirely from memory access patterns, not from the difference between a vtable pointer and a discriminant integer.

The constraint is a closed type set. Every type must be known when the variant is defined. Adding a new shape type requires modifying the variant definition and recompiling every std::visit site. This makes std::variant unsuitable for plugin architectures, runtime-extensible type registries, or deserialization into user-defined types.

With a large number of variant members, roughly ten or more, the template machinery std::visit generates inflates compile times noticeably. For small, stable type sets — AST nodes, rendering primitives, expression trees in a compiler — the approach is often the right one.

A Practical Summary

These techniques address different constraints, and the right choice depends on what you are actually optimizing for.

Mark leaf classes final, enable LTO and PGO on production builds, and measure before rewriting anything. The compiler recovers more overhead than most manual rewrites do, and final in particular is free and accurate documentation.

Reach for CRTP or deducing-this when you have confirmed through profiling that a virtual call in a hot loop is blocking inlining opportunities that matter, and when the set of types is genuinely closed at compile time. The isocpp article focuses on exactly this case: latency-sensitive paths where the abstraction must have zero runtime cost.

Choose std::variant when you have a small, stable type set and want contiguous storage in collections. Profile the memory access pattern first; if the data is already cache-hot, the dispatch mechanism is rarely the bottleneck.

Keep virtual dispatch everywhere the type set is open, the hierarchy is designed for extension, or the code is not on a measured hot path. Agner Fog’s optimization manual and Fabian Giesen’s analyses both converge on the same conclusion: a well-predicted monomorphic virtual call is a handful of cycles. The blocked inlining is a real cost, but compilers with final, PGO, and LTO find and fix much of it. Measure, then decide.