· 6 min read ·

The Optimization Fence: What Virtual Dispatch Actually Costs in C++

Source: isocpp

David Álvarez Rosa’s piece on devirtualization and static polymorphism covers the mechanics well, but there is a framing issue that trips people up when they first encounter this topic: the vtable lookup is not really the problem. The indirection cost of a virtual call — loading the vptr from the object, indexing the vtable, then jumping through a function pointer — is real but modest on warm cache. The deeper issue is what that indirect call tells the compiler it is not allowed to do. Call it an optimization fence.

What Virtual Dispatch Actually Does

Every class with virtual functions carries a hidden pointer: the vptr, typically at offset zero in the object layout. It points to a per-class vtable, a static array of function pointers. A call through a base pointer or reference becomes:

// Roughly what the compiler generates for p->draw()
mov rax, [p]           // load vptr
mov rax, [rax + 8]     // index vtable (offset depends on slot)
call [rax]             // indirect call through function pointer

The vptr load and table index are memory reads, which matters when the object is cold in cache. But profile-guided analysis on typical OOP workloads shows these loads are rarely the bottleneck. What matters more is the indirect call instruction itself, because the compiler cannot see through it.

The Real Cost: The Optimizer Stops Here

When the compiler encounters an indirect call, it must assume the worst: the callee can read and write any memory visible to it, alias any pointer, and produce any value. That assumption severs several optimization chains simultaneously.

The most painful is auto-vectorization. Consider a tight loop:

void transform_all(Shape** shapes, int n) {
    for (int i = 0; i < n; i++)
        shapes[i]->process();  // virtual call
}

Even if process() is a trivial scalar multiply, the vectorizer cannot fuse iterations into SIMD because it cannot see inside the call. On AVX2 hardware processing floats, a vectorizable loop processes 8 elements per cycle; a scalar loop through virtual dispatch processes one. The overhead is not the 5-cycle call cost — it is the 8x throughput loss from a vectorization gate the compiler cannot open.

Inlining is the prerequisite for most of what optimizers do downstream: constant propagation, dead code elimination, loop-invariant code motion, common subexpression elimination. Virtual dispatch blocks inlining at the call site, and everything that depends on it goes with it. The call overhead in isolation is a distraction.

This situation got materially worse after 2018. Spectre v2 exploits the CPU’s Indirect Branch Target Buffer to speculatively redirect indirect calls. The mitigation, retpoline, replaces indirect jumps with a convoluted sequence that redirects speculative execution through the return stack buffer. On pre-Cascade-Lake Intel hardware still common in cloud fleets, retpoline adds 30 to 80 cycles per indirect call. On newer hardware with eIBRS the penalty drops to 4 to 6 cycles, but that is still a permanent tax on every virtual dispatch that retains the optimization fence regardless.

When the Compiler Does the Work for You

Before reaching for CRTP or concepts, it is worth knowing what the compiler can already devirtualize automatically.

The final keyword is the clearest signal you can give:

class FastTransform final : public Transform {
public:
    void process() override { /* ... */ }
};

void run(FastTransform* t) {
    t->process();  // devirtualized: compiler knows exact type
}

With final, the compiler sees that no further subclass can exist and converts the indirect call to a direct call. That restores inlining eligibility, which restores vectorization and everything that depends on it. Marking leaf classes final is often the cheapest intervention, and it communicates design intent beyond its optimization value.

Escape analysis handles locally constructed objects even without final:

void run() {
    FastTransform t;  // concrete type known, pointer doesn't escape
    t.process();      // devirtualized automatically
}

Profile-guided optimization adds speculative devirtualization: the compiler inserts a type guard and inlines the hot-path type, falling back to virtual dispatch only for rare alternatives. Link-time optimization extends all of this across translation unit boundaries. Neither LTO nor PGO should be considered optional for performance-sensitive builds.

The limits: megamorphic call sites with many concrete types, function pointers stored in containers and called later, and any scenario where the concrete type is genuinely unknown at compile time. For those, you need a different strategy.

CRTP: Static Dispatch Through Templates

The Curiously Recurring Template Pattern has been the standard C++ answer to compile-time polymorphism for decades. The idea is that the base class is itself a template parameterized on the derived class, which gives it access to the concrete type:

template <typename Derived>
class TransformBase {
public:
    void process() {
        // Cast to derived, call implementation
        static_cast<Derived*>(this)->process_impl();
    }
};

class ScaleTransform : public TransformBase<ScaleTransform> {
public:
    void process_impl() {
        // Concrete implementation — no virtual dispatch
    }
};

The static cast is resolved at compile time. There is no vtable, no vptr, and no indirect call. The compiler sees the concrete process_impl function and can inline it, vectorize across it, and optimize through it freely. The abstraction costs nothing at runtime.

The tradeoffs are real. CRTP pushes complexity into the template hierarchy, makes error messages from misuse difficult to parse, and means that code using the interface must also be templated. You lose the ability to store mixed concrete types behind a single base pointer in a non-template container — that requires type erasure or virtual dispatch. CRTP suits cases where the concrete type is always known at the call site, which in practice covers most latency-sensitive inner loops.

A common extension of the pattern adds a free function template as the interface:

template <typename T>
void process_all(T* items, int n) {
    for (int i = 0; i < n; i++)
        items[i].process_impl();  // monomorphic per instantiation
}

Each instantiation of process_all is monomorphic and fully optimizable. The compiler generates a separate version for each concrete type, which costs binary size but recovers all optimizations.

C++20 Concepts: Cleaner Constraints

C++20 concepts give static polymorphism better syntax and far better error messages than raw CRTP:

template <typename T>
concept Transformable = requires(T t) {
    { t.process_impl() } -> std::same_as<void>;
};

template <Transformable T>
void process_all(T* items, int n) {
    for (int i = 0; i < n; i++)
        items[i].process_impl();
}

The requires clause documents exactly what operations the type must support. When a type fails to satisfy the concept, the compiler error names the unsatisfied constraint rather than dumping pages of template instantiation trace. The dispatch mechanism is the same: fully static, no vtable, full inlining eligibility.

Concepts also enable subsumption: a more constrained overload automatically takes priority over a less constrained one. This lets you write fast paths for types with richer interfaces without explicit specialization:

template <Transformable T>
void run(T& t) { t.process_impl(); }  // generic path

template <BatchTransformable T>  // BatchTransformable subsumes Transformable
void run(T& t) { t.batch_process_impl(); }  // selected automatically

The compiler resolves which overload to call based on constraint satisfaction, entirely at compile time.

Making the Call

The decision between virtual dispatch and static polymorphism is not primarily about which is faster in isolation. It is about the usage pattern.

Virtual dispatch is the right choice when concrete types are assembled at runtime (plugin systems, configuration-driven factories), when mixed-type collections behind a common interface are necessary, or when binary size and compile times matter more than raw throughput. The final annotation and LTO will recover most of the cost on latency-sensitive paths if the concrete type set is small.

Static polymorphism through templates or CRTP is appropriate when the concrete type is known at every call site, the operation is on a hot inner loop, or the abstraction is internal to a module rather than a public API boundary. C++20 concepts are the current best option for expressing those constraints with readable error output and clean overload resolution.

The practical approach is to write the clean virtual design first, profile, and then apply targeted interventions: final on leaf classes, LTO and PGO in the build pipeline, and CRTP or concept-constrained templates at the specific hot paths the profiler identifies. The abstract design and the optimized implementation are not mutually exclusive; they just operate at different layers.

Was this interesting?