· 7 min read ·

What the Compiler Can't Do Across a vtable

Source: isocpp

The most common framing of virtual dispatch overhead focuses on the wrong thing. An indirect call through a vtable pointer costs a few cycles, and branch mispredictions hurt when a call site dispatches to many concrete types. Neither of those, however, explains why tight loops over polymorphic objects can run three to ten times slower than their statically-typed equivalents. The real cost is what the compiler can no longer do once it cannot see the callee’s body.

When a function is inlined, the compiler gains access to the called code and can apply constant propagation, dead-code elimination, register allocation across the boundary, and auto-vectorization. A simple dot(Vec3) through a vtable pointer becomes a full function call with saved registers, a stack frame setup, and a return sequence. The same operation inlined reduces to three multiplications and two additions, possibly merged into a single SIMD instruction. Chandler Carruth made this point explicitly in his CppCon 2014 talk on data structures and efficiency: the bottleneck is the optimization boundary, not the call instruction.

The Four Costs, Properly Ordered

Understanding the actual cost structure changes how you approach solutions. In descending order of impact:

Inhibited inlining is the largest problem. The compiler emits a call-through-pointer and cannot optimize across it. For trivial operations in inner loops, this alone accounts for the bulk of observed slowdowns.

Indirect branch misprediction follows. Modern CPUs predict indirect branches through a Branch Target Buffer. Monomorphic call sites (always dispatching to the same concrete type) warm the predictor quickly; polymorphic sites with several types thrash it, costing 15 to 20 cycles per miss on current x86 hardware.

Pointer indirection and cache pressure rank third. Loading the vptr from a hot object in L1 is cheap. When objects are heap-allocated and scattered, every access becomes a pointer chase. A vector of unique_ptr<Animal> forces one cache miss per element; a contiguous array of value types does not.

Finally, the hidden vptr costs 8 bytes per object on 64-bit targets, reducing object density per cache line and shrinking the effective size of the working set.

Eli Bendersky’s benchmark measured roughly a 3x speed difference between virtual dispatch and CRTP in a tight loop over a homogeneous array. Nearly all of it came from the compiler’s inability to inline and vectorize the loop body, not from the indirect call itself.

What Compilers Already Do For You

Before reaching for manual techniques, it is worth knowing how far compilers already go.

At -O1 and above, GCC and Clang both devirtualize calls where the concrete type is locally provable: stack-allocated objects, unique_ptr<Derived> with a fixed pointee type, or new Derived without intermediate reassignment. The final keyword is the cheapest and most underused lever here. Marking a class or method final gives the compiler an unconditional guarantee:

struct Dog final : Animal {
    void speak() override { puts("Woof"); }
};

void bench(Dog* d) {
    d->speak(); // devirtualized at -O1+, no guard emitted
}

At -O2, GCC enables -fdevirtualize-speculatively, which emits a type-guard around a direct call for likely-monomorphic sites, allowing the inlined path to dominate at runtime without changing the program’s semantics.

For whole-program analysis, GCC’s -fdevirtualize-at-ltrans (implied by -flto) and Clang’s -fwhole-program-vtables (also requires -flto and -fvisibility=hidden) allow the linker to examine the entire class hierarchy and prove that no derived class exists for a given type within the linked binary. GCC exposes success through -fdump-tree-optimized; the absence of OBJ_TYPE_REF nodes means the call was devirtualized.

If your call sites are monomorphic and your types are marked final, the compiler often handles this without any manual intervention. You need alternatives only when the type genuinely varies at runtime and the compiler cannot prove otherwise.

CRTP: Static Polymorphism Without the Vtable

When runtime polymorphism is not the requirement, the Curiously Recurring Template Pattern is the canonical solution. The derived type is passed as a template argument to its own base class, making dispatch resolvable at instantiation time:

template<typename Derived>
struct AnimalBase {
    void speak() {
        static_cast<Derived*>(this)->speak_impl();
    }
};

struct Dog : AnimalBase<Dog> {
    void speak_impl() { puts("Woof"); }
};

template<typename T>
void make_speak(AnimalBase<T>& a) {
    a.speak(); // resolved at instantiation, fully inlineable
}

The resulting code contains no vtable lookup and no indirect call; the static_cast collapses entirely at instantiation time. The Eigen linear algebra library uses this pattern pervasively through MatrixBase<Derived>, enabling expression templates that compose matrix operations lazily with zero runtime overhead. Embedded systems code uses it for hardware abstraction layers where a virtual call’s vptr overhead is significant relative to available RAM.

The trade-offs deserve clear-eyed treatment. CRTP generates a separate instantiation of every templatized function for each concrete type. A base class with 20 methods and 10 derived types produces 200 function bodies versus 20 virtual functions and 10 vtables; binary size grows proportionally. Compile times rise with each new instantiation. There is also no heterogeneous container: AnimalBase<Dog> and AnimalBase<Cat> are unrelated types, and you cannot store them in the same std::vector without additional type erasure.

CRTP is the right choice for library infrastructure running in tight inner loops over a known type set, for embedded targets where object layout overhead matters, and wherever you need the compiler to auto-vectorize across the polymorphic boundary.

std::variant and the Value-Semantics Alternative

When the set of types is closed, std::variant combined with std::visit offers a different trade: closed polymorphism in exchange for value semantics and contiguous object storage.

using Animal = std::variant<Dog, Cat, Bird>;

void make_speak(Animal& a) {
    std::visit([](auto& animal) { animal.speak(); }, a);
}

std::visit dispatches through an index-based jump table built from a constexpr array, which is similar in raw call cost to a vtable lookup. The storage model is where the meaningful difference lies. A std::vector<Animal> stores all objects contiguously with no pointer indirection. A vector<unique_ptr<Animal>> forces an L3 miss per element; the variant vector behaves like a flat array. In benchmarks with thousands of game entities, this layout difference alone has produced large frame-time reductions, as Louis Dionne demonstrated in his CppCon 2017 presentation on runtime polymorphism. The visitor functions are monomorphized at compile time; with an inlineable lambda, the handler for each type can be optimized independently.

The constraint is that the type set is frozen at the point of the variant definition. Adding a new animal type means modifying every site that names Animal. Virtual dispatch handles this without touching existing call sites. The choice between the two approaches follows from whether the type hierarchy or the set of operations on it is more likely to grow.

For ergonomic multi-type dispatch, the overload pattern from C++17 composes well with std::visit:

template<typename... Ts>
struct overload : Ts... { using Ts::operator()...; };

std::visit(overload{
    [](Dog& d)  { puts("Woof"); },
    [](Cat& c)  { puts("Meow"); },
    [](Bird& b) { puts("Tweet"); },
}, animal);

C++23 Deducing This: The CRTP Replacement

C++23’s P0847R7, “Deducing this” allows the implicit this parameter to be made explicit and deduced, eliminating the CRTP boilerplate for mixin and method-forwarding patterns:

// Before C++23: CRTP mixin
template<typename Derived>
struct Loggable {
    void log() { static_cast<Derived*>(this)->log_impl(); }
};
struct Service : Loggable<Service> {
    void log_impl() { puts("Service log"); }
};

// C++23: deducing this
struct Loggable {
    void log(this auto& self) { self.log_impl(); }
};
struct Service : Loggable {
    void log_impl() { puts("Service log"); }
};

Loggable is now a single non-template class. The method log deduces the concrete type of self at each call site and dispatches statically, with full inlining available. There is no Loggable<Service> template instantiation; Service inherits from Loggable as it would from any base class. Error messages improve substantially because the recursive template structure that makes CRTP pathologies hard to diagnose is gone. The feature also enables clean fluent-builder patterns and recursive lambdas that previously required awkward workarounds.

Deducing this does not replace type erasure for heterogeneous runtime-polymorphic containers. For that use case, libraries like dyno provide an external vtable model that decouples the dispatch table from the object, improving cache behavior for small objects. But for the common CRTP pattern of injecting compile-time-resolved behavior into a class hierarchy, deducing this is cleaner in nearly every respect. GCC 13, Clang 18, and MSVC 19.36 (Visual Studio 2022 17.6) all support the feature.

Making the Choice

The decision follows from what the code needs. Runtime open-set polymorphism, where new types can be added without recompiling call sites, fits virtual dispatch. Use final wherever possible, apply LTO, and let the compiler devirtualize what it can prove statically.

A closed type set with heterogeneous value storage and cache-sensitive access patterns fits std::variant plus std::visit. The contiguous layout and the absence of pointer chasing frequently matter more than the dispatch mechanism itself.

Zero-overhead inner loops over a known type set, or library abstractions that need to inject behavior into a class hierarchy at compile time, fit CRTP or C++23 deducing this. The latter handles the forwarding and mixin cases with far less template machinery.

David Álvarez Rosa’s article on isocpp.org correctly identifies that latency-sensitive paths benefit from static polymorphism. The broader point is that the choice becomes considerably clearer once you stop thinking of virtual overhead as “the cost of an indirect call” and start thinking of it as “the cost of an optimization boundary.” From that angle, each alternative is a different way of giving the compiler back the information it needs, and the right tool depends on exactly which part of that information you are willing to fix at compile time.

Was this interesting?