· 7 min read ·

What the Optimizer Loses When You Call Through a Vtable

Source: isocpp

Virtual dispatch has a reputation for being slow, and the costs are real, though the explanations most commonly given are incomplete. The overhead includes the extra instructions for loading the vtable pointer and branching indirectly, but the more significant impact is on the optimizer, which loses visibility into the call target and cannot inline, vectorize, or constant-fold across the call boundary.

David Álvarez Rosa’s recent piece on isocpp.org names this precisely: virtual dispatch carries hidden overhead beyond pointer indirection. The hidden part is what the compiler cannot do once it encounters an indirect call.

What a Virtual Call Compiles To

When a class declares virtual methods, the compiler emits a vtable, a static array of function pointers, and writes a hidden vptr field into every object of that type. On x86-64 with the Itanium ABI, a call to obj->method() compiles to roughly:

mov rax, [rdi]     ; load vptr from object
call [rax + 16]    ; call through slot 2 in the vtable

Two memory loads and an indirect branch account for the full dispatch mechanism. On a cache-warm path with a predictable call target, the cost is somewhere between 1 and 5 nanoseconds on a modern processor. Branch target buffers handle repeated calls to the same address efficiently, so the overhead in the monomorphic case is modest.

The picture changes with a heterogeneous collection. A loop over an array of Base* pointing to several derived types in unpredictable order defeats the branch predictor. With four or more derived types appearing in random order, you regularly pay 10 to 20 nanoseconds per call for mispredictions alone. Agner Fog’s microarchitecture guides note that the branch target buffer holds roughly 4,000 entries; workloads that exceed that capacity with scattered vtable dispatch pay for it consistently.

The Inliner Is the Real Cost

The optimizer inlines aggressively because inlining is not primarily about removing function call overhead. It is about making the callee’s code visible at the call site so the optimizer can act on both together. Once a function body is inlined, the compiler can fold constants from the caller’s context, eliminate dead branches, merge adjacent loops, and autovectorize a loop whose body was previously an opaque call.

A virtual call is an optimization barrier. The compiler sees call [rax + 16] and has no information about what code lives at that address while it is building and optimizing IR. It cannot inline through the call, propagate types across it, or fold constants. Everything downstream is either conservatively modeled or ignored by the optimizer entirely.

This is why benchmarks comparing virtual dispatch to direct calls in tight loops often show 4 to 10x differences. Chandler Carruth’s 2015 CppCon talk on performance tuning demonstrated a polymorphic dispatch loop achieving around 40 million calls per second, compared to several hundred million for the equivalent devirtualized code. The speedup comes from the loop being vectorizable once the call target is visible to the optimizer; the raw indirect-call overhead accounts for a small fraction of the measured difference.

When the Compiler Fixes It for You

GCC and Clang both implement devirtualization strategies that eliminate virtual calls without any source change. The simplest case involves an object with automatic storage and no observable escape:

void process() {
    Derived d;
    Base* p = &d;
    p->method(); // GCC and Clang devirtualize this at -O2
}

The compiler knows the dynamic type of d exactly, since it was constructed on the stack and its address has not escaped, and rewrites the virtual call to a direct call. GCC implements this as an interprocedural analysis pass called virtual table analysis (VTA).

The final specifier is the most practical low-effort mechanism:

class Concrete final : public Base {
    void method() override;
};

void call(Concrete* c) {
    c->method(); // always devirtualized
}

Marking a class final tells the compiler that no subclass of Concrete can exist, so any virtual call through a Concrete* or Concrete& is unconditionally devirtualized. The annotation costs nothing beyond one keyword on a leaf class and applies at the use site without requiring link-time optimization. Both Chrome and V8 have documented performance gains from systematic final annotation on leaf classes, typically in the 5 to 15 percent range on virtual-call-heavy workloads.

The third mechanism is whole-program devirtualization under link-time optimization. Clang’s -fwhole-program-vtables flag, combined with LTO, performs a cross-module analysis: if only one class in the entire program overrides a given virtual slot, every call site for that slot is devirtualized. This is powerful in large codebases but requires LTO enabled across the link step, which not every project can adopt.

Where devirtualization fails is the common case: a heap-allocated object accessed through a Base* where the compiler cannot determine the dynamic type at compile time. That covers most production use of runtime polymorphism. For those paths, if the hotness of the code justifies the effort, the solutions are structural.

CRTP: Dispatch Resolved at Compile Time

The Curiously Recurring Template Pattern encodes the derived type as a template parameter of the base class, letting the base call derived methods through a static cast rather than a vtable:

template <typename Derived>
class Serializable {
public:
    std::string serialize() const {
        return static_cast<const Derived*>(this)->serialize_impl();
    }
};

class Record : public Serializable<Record> {
public:
    std::string serialize_impl() const {
        return "{id: 1}";
    }
};

Serializable<Record>::serialize() is a concrete instantiation where serialize_impl() resolves to a direct, fully visible call. The compiler can inline it, constant-fold through it, and vectorize any loop containing it. There is no vtable, no vptr field, no indirect branch.

The Eigen linear algebra library is the canonical production example. MatrixBase<Derived> provides all arithmetic operations in terms of derived(), and each concrete matrix type carries its shape and storage layout as compile-time template parameters. An expression like A + B * C produces a nested template type encoding the full computation; evaluating it collapses to a single fused loop with no temporaries and full SIMD vectorization. This is only possible because the compiler can see through the entire expression hierarchy at compile time.

The trade-offs are straightforward. CRTP requires the derived type at template instantiation time, so a base parameterized on Derived cannot be stored in a homogeneous container for runtime dispatch over mixed types. Each instantiation is a separate class, so the binary grows per instantiation. Template error messages in CRTP hierarchies remain genuinely difficult to read regardless of the compiler vendor.

std::variant as Closed-Set Polymorphism

When the set of possible types is fixed and known at compile time, std::variant with std::visit provides value-semantic polymorphism without heap allocation:

using Shape = std::variant<Circle, Rectangle, Triangle>;

double area(const Shape& s) {
    return std::visit([](const auto& shape) {
        return shape.area();
    }, s);
}

Most standard library implementations compile std::visit to a jump table indexed by the variant’s discriminant. There is one indirect branch, but the discriminant is stored inline in the object rather than loaded through a pointer, and the jump table is small, static, and cache-friendly. Performance lands between CRTP and virtual dispatch: closer to zero when the type set is small and the branch predictor can learn the active type, closer to virtual dispatch latency under random access patterns with many distinct types.

The constraint is closure. Adding a new type to Shape forces recompilation of every visitor. Virtual dispatch is open by design; new derived classes appear at runtime without recompilation. For plugin systems, dynamic libraries, or any case where the type set grows after compilation, virtual dispatch remains the right choice.

C++23 Deducing this: CRTP Without the Machinery

C++23 introduced explicit object parameters (P0847), which let member functions deduce the concrete type of the object at the call site:

struct Serializable {
    std::string serialize(this const auto& self) {
        return self.serialize_impl();
    }
};

struct Record : Serializable {
    std::string serialize_impl() const { return "{id: 1}"; }
};

The this const auto& self parameter causes Self to be deduced as Record when record.serialize() is called, making self.serialize_impl() a direct, inlinable call. Unlike CRTP, the base class carries no template parameter; multiple deducing-this bases coexist without the template ambiguity problems that appear in multi-inheritance CRTP hierarchies. GCC 14, Clang 18, and MSVC 19.36 all support this feature under -std=c++23.

It does not replace CRTP in all situations. If the derived type needs to appear as a return type or a data member in the base class at class-definition time, CRTP remains necessary. For the common case of injecting shared implementation into derived classes, deducing this is cleaner syntax with identical performance characteristics.

Choosing the Right Tool

The decision follows from what is known at compile time and how open the type hierarchy needs to be.

If the derived type is known at a specific call site and performance of that path matters, try final first. It requires minimal change and works without LTO. If the type is always determined at instantiation time and shared implementation in a base class is needed, CRTP or deducing this provide zero overhead with full compiler visibility into the call chain. If the type set is fixed and value semantics are preferable, std::variant fits well. If the type set is open, types come from external code at runtime, or stable binary interfaces across library boundaries are required, virtual dispatch is the correct choice and should be used without apology.

The optimizer can often recover from virtual dispatch when given enough structural information through final annotations, stack-allocated objects, or LTO. That is worth verifying with a profiler before investing in CRTP. When the compiler cannot help and profiling confirms the cost, static polymorphism offers zero-overhead alternatives with trade-offs that have been validated over decades of production use in Eigen, LLVM, Boost, and similar projects.

Was this interesting?