What the Compiler Loses When You Use Virtual Dispatch

The moment you write virtual on a method, the compiler makes a note: this call site cannot be resolved until runtime. That note propagates through the optimizer and closes doors that would otherwise be open. The pointer indirection gets all the attention in benchmarks, but the optimization barriers it creates are the larger story.

David Álvarez Rosa’s piece on isocpp.org frames this well: virtual dispatch enables polymorphism at a cost of pointer indirection, larger object layouts, and reduced inlining opportunities. The inlining side deserves closer examination, because that is where the real performance story lives.

What the vtable actually does

Every class with virtual methods carries a hidden pointer, the vptr, that points to a vtable: an array of function pointers for that class’s virtual methods. When you call a virtual method, the generated assembly performs two memory loads before reaching the actual function:

mov  rax, [rdi]        ; load vptr from object
call [rax + offset]    ; load function pointer, then call

On a warm L1 cache with the branch predictor trained, this costs roughly 5-10 cycles more than a direct call. On a cold cache, where the vtable line has been evicted, the cost climbs to 100-300 cycles. That range matters in cache-pressure scenarios, but the indirection itself understates the full cost.

There is also the Spectre v2 factor. After the 2018 disclosure, operating systems and compilers deployed retpoline to prevent branch-target-buffer poisoning through indirect calls. Retpoline turns every indirect branch into a construct the CPU cannot speculatively execute, adding 10-25 cycles per virtual call on systems without hardware mitigations. Newer Intel microarchitectures, Cascade Lake with eIBRS and Tiger Lake with CET-IBT, recover most of that cost, but cloud instances frequently run on older silicon where the retpoline penalty still applies.

The inlining firewall

Inlining is not primarily an optimization in itself; it is what makes other optimizations possible. When the compiler inlines a call, it can see inside the callee and apply constant propagation, dead code elimination, loop unrolling, and vectorization. When a virtual call blocks inlining, those downstream passes cannot fire.

The vectorization case is the most concrete. Consider a loop over a homogeneous collection:

for (int i = 0; i < n; i++) {
    sum += shapes[i]->area();
}

If area() is virtual, the compiler emits one call per iteration. If area() can be inlined, the compiler can pack multiple elements into SIMD registers and process them together. On an AVX2-capable machine, that difference is 8 float operations per clock versus one. The gap between scalar and vectorized execution on a tight arithmetic loop routinely reaches 6-8x throughput, which dwarfs the pointer indirection overhead.

Alias analysis degrades at virtual call sites too. The compiler must assume that a call through an unknown function pointer can read or write any globally reachable memory. This forces conservative memory ordering around the call, prevents register-friendly code generation, and can block hoisting of loop-invariant loads out of inner loops.

When the compiler devirtualizes on its own

Compilers are not passive about this. GCC, Clang, and MSVC all implement devirtualization under specific conditions, and knowing those conditions helps you write code the compiler can actually optimize.

The most reliable trigger is final. Marking a class final tells the compiler that no further derivation is possible, so any call on that concrete type can be resolved statically:

class Circle final : public Shape {
    double area() const override { return 3.14159265 * radius * radius; }
    double radius;
};

A reference or pointer known statically to be Circle will have its virtual calls devirtualized and inlined. This is low-effort and communicates design intent alongside the optimization benefit. Leaf classes in most object hierarchies were never intended to be further derived; final makes that explicit.

Local escape analysis is another path. When an object is constructed on the stack and its address demonstrably does not escape the current translation unit, the compiler can prove which vtable will be loaded at every call site. This works reliably within a single file but not across translation unit boundaries without link-time optimization.

Profile-guided optimization takes a probabilistic approach. After a profiling run, if the compiler observes that a large fraction of dispatches at a given call site go to one concrete type, it generates a type-check-plus-fast-path:

cmp rax, offset vtable_for_Circle
jne .fallback
; inlined Circle::area()
jmp .done
.fallback:
call [rax + offset]
.done:

This trades a branch for the occasional full dispatch on the slow path. When one type dominates a call site in production profiling, the generated code is nearly as good as a statically resolved call.

Link-time optimization extends devirtualization across translation unit boundaries. Clang’s ThinLTO captures 80-90% of the benefit of full LTO with much shorter build times, and vtable-heavy codebases commonly see 10-15% end-to-end runtime improvements from enabling it. The limitation is scope: shared libraries loaded at runtime are outside the linker’s reach, so plugin-based architectures cannot benefit.

Static polymorphism: making the type visible at compile time

When you need to eliminate virtual dispatch on a latency-sensitive path and compiler inference is insufficient, the solution is to restructure so the concrete type is known at compile time.

The Curiously Recurring Template Pattern (CRTP) is the established idiom for this:

template<typename Derived>
struct Shape {
    double area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};

struct Circle : Shape<Circle> {
    double area_impl() const { return 3.14159265 * radius * radius; }
    double radius;
};

The static_cast here generates no machine instructions. The compiler resolves area_impl() as a direct call, inlines it, and proceeds with full optimization. The Eigen linear algebra library uses CRTP throughout its expression template design to achieve near-BLAS throughput on generic matrix operations without any runtime dispatch.

The cost is the loss of heterogeneous collections through a single base pointer, since Shape<Circle> and Shape<Rectangle> are unrelated types. If you need to store them together, you need a separate type-erasure layer, which re-introduces some of the overhead you were trying to avoid.

C++20 concepts offer a different spelling suited to generic algorithms that operate on a single type at a time:

template<typename T>
concept HasArea = requires(const T& t) {
    { t.area() } -> std::convertible_to<double>;
};

template<HasArea S>
double total_area(const std::vector<S>& shapes) {
    double sum = 0;
    for (const auto& s : shapes) sum += s.area();
    return sum;
}

This expresses “any type satisfying this interface” without requiring inheritance. The compiler sees the concrete type at each instantiation, enabling full inlining and vectorization. Error messages are also substantially better than unconstrained templates when callers supply the wrong type.

C++23’s deducing this cleans up some of the CRTP boilerplate. The explicit object parameter lets a base class method receive the concrete derived type without template inheritance:

struct Shape {
    double area(this auto const& self) const {
        return self.area_impl();
    }
};

GCC 13, Clang 17, and MSVC 2022 17.4 all support this. The generated code is identical to CRTP, but without the repetitive template parameter threading through the hierarchy or the risk of accidentally instantiating through the wrong specialization.

For closed type sets, std::variant with std::visit gives value semantics and contiguous storage:

using AnyShape = std::variant<Circle, Rectangle, Triangle>;

double area(const AnyShape& s) {
    return std::visit([](const auto& shape) {
        return shape.area();
    }, s);
}

The dispatch mechanism is a jump table rather than a vtable, and compilers frequently devirtualize all arms completely. Beyond the dispatch cost, std::vector<AnyShape> stores objects contiguously in memory, which is categorically better for cache behavior than std::vector<Shape*> where each pointer points to a separate heap allocation. On cache-cold data, the pointer-chasing cost can exceed the virtual dispatch overhead several times over.

The constraint is closure: adding a fourth shape variant requires recompiling every translation unit that uses AnyShape. Where extensibility matters more than locality, the open world of virtual dispatch remains the right model.

Choosing deliberately

The case for virtual dispatch is still real. Open type sets, plugin architectures, and application-level code where performance headroom is comfortable all favor it. The ergonomics of a clean abstract interface class are genuinely valuable, and the overhead is invisible in most contexts.

The final keyword is the lowest-effort intervention. Marking concrete leaf classes final often gives the compiler what it needs without restructuring the design at all. It is also a meaningful design statement: this class is complete as written.

On hot paths where the type set is fixed, CRTP and std::variant with std::visit are the two most useful alternatives, depending on whether you need an open hierarchy with type-erased storage or value semantics with a closed set. Concepts suit generic algorithms where the caller supplies the type. Deducing this suits the same problem with less boilerplate, on compilers that support C++23.

The underlying principle across all of them is giving the compiler full information about which function will be called. When the compiler has that information, it can inline, vectorize, eliminate dead code, and optimize across call boundaries. The choice of polymorphism mechanism is, in large part, a choice about how much of that information to make available and at which stage of compilation.