What Virtual Dispatch Actually Costs, and When to Stop Paying It

David Álvarez Rosa’s recent piece on isocpp.org covers the basics well: virtual dispatch adds pointer indirection, bloats object layouts, and limits inlining. All true. But the framing of “hidden overhead” undersells the real problem. The overhead is not primarily the two pointer dereferences. It is what those dereferences prevent the compiler from doing to everything around them.

The Optimization Fence Nobody Mentions

An indirect call through a vtable is, from the optimizer’s perspective, a call to an unknown function. The compiler cannot see through it. It cannot prove the callee has no side effects, cannot reorder loads or stores across it, cannot eliminate redundant computations, and most critically cannot inline the body. That last part is what kills performance in tight loops.

Consider a loop calling a virtual transform(float) on each element of a buffer. Even if the implementation is a single multiply, the vectorizer cannot fuse it into a VMULPS instruction. The loop runs scalar, one element at a time, at roughly 1.0 ns per element. After devirtualization allowing the compiler to inline and vectorize, the same loop runs at 0.08 to 0.12 ns per element on a machine with AVX2. That is an 8 to 12x gap, and none of it is the vtable lookup itself. The lookup costs maybe 3 to 5 nanoseconds on a warm cache. The vectorization opportunity it blocks is worth far more.

Post-Spectre, there is an additional tax. The retpoline mitigation for Spectre v2 (-mindirect-branch=thunk on GCC, -mretpoline on Clang) turns every indirect call into a serializing sequence that prevents useful speculative execution. On pre-Cascade Lake Intel hardware, this adds 30 to 80 cycles per indirect call. On top of a 3 to 5 ns baseline, that is roughly a 10 to 27 ns surcharge per dispatch. Agner Fog’s microarchitecture manuals document the underlying CPU mechanics; the indirect branch target buffer on Skylake holds around 1024 entries, and a megamorphic call site with more than four concrete types will blow through it quickly.

The assembly signature to check is straightforward. On Compiler Explorer, call [rax] means the dispatch was not eliminated. A named symbol in the call instruction means it was devirtualized and the body is inlining-eligible.

What the Compiler Can Do Without Your Help

Before reaching for manual solutions, it is worth knowing when the compiler will handle this for you.

The final keyword is the cheapest intervention. Mark a class or method final and the compiler has a static guarantee that no override exists. At -O2, GCC devirtualizes any call to a final method on a final class through any pointer:

struct FastProcessor final : IProcessor {
    float transform(float x) override { return x * 2.0f; }
};

void process(FastProcessor* p, float* buf, int n) {
    for (int i = 0; i < n; ++i)
        buf[i] = p->transform(buf[i]);  // devirtualized at -O2; VMULPS at -O3
}

No vtable lookup is generated. The loop vectorizes. The final annotation also communicates design intent, which is worth something independent of the optimization.

Stack-allocated objects of known concrete type are always devirtualized by escape analysis, regardless of final. LTO (-flto or -flto=thin) extends this cross-TU: if a virtual function has exactly one override across the entire linked program, every call site devirtualizes. GCC’s -fdevirtualize-speculatively (default at -O3) and Clang’s indirect call promotion go further, inserting a type guard and inlining the statistically dominant type:

// Compiler-generated speculation at -fprofile-use:
if (p->vptr == &DerivedClass::vtable) {
    DerivedClass::foo(p);  // direct call, inlineable
} else {
    p->foo();              // fallback indirect call
}

When one concrete type handles 80% or more of calls, this is nearly free devirtualization with no source changes. PGO-guided speculative devirtualization pairs well with microservice architectures where one implementation dominates at runtime.

CRTP: Static Polymorphism via Template Inheritance

For cases where the compiler cannot help, the Curiously Recurring Template Pattern is the established solution. The concrete type is encoded in the base class template parameter, making every dispatch purely compile-time:

template<typename Derived>
class Shape {
public:
    float area() const {
        return static_cast<const Derived*>(this)->areaImpl();
    }
};

class Circle : public Shape<Circle> {
public:
    float areaImpl() const { return 3.14159f * radius * radius; }
private:
    float radius;
};

The static_cast is compile-time only. No vtable, no vptr, no indirect call, no object layout overhead. Eigen’s matrix library is the canonical large-scale example: MatrixBase<Derived> is the root, and expression templates built on CRTP fuse multi-step matrix operations into single vectorized loops with no temporaries and no virtual dispatch anywhere in the computation core. Hard real-time environments, including automotive systems governed by ISO 26262 and the AUTOSAR Adaptive Platform, mandate CRTP specifically because virtual dispatch prevents worst-case execution time analysis required for safety certification.

The tradeoffs are real. Shape<Circle> and Shape<Triangle> are unrelated types with no common base, so heterogeneous containers require separate type erasure work. Template instantiations increase binary size and compile time. Error messages on misuse can be difficult to parse. The calling code must itself be templated, which propagates compile-time constraints up the call stack.

C++20 Concepts: Static Interfaces Without Inheritance

Concepts offer static dispatch without requiring inheritance at all. A type satisfies a concept structurally; it does not need to inherit from anything:

template<typename T>
concept Drawable = requires(const T& t) {
    { t.area() } -> std::convertible_to<float>;
    { t.draw() } -> std::same_as<void>;
};

template<Drawable T>
void render(const T& shape) {
    shape.draw();  // static dispatch, fully inlined
}

Each instantiation is monomorphic and fully visible to the optimizer. Existing types can satisfy concepts retroactively without modification. Concept subsumption means the compiler automatically selects the most-constrained overload: a function constrained on std::random_access_iterator is preferred over one constrained on std::input_iterator with no explicit dispatch logic required.

The first concepts proposal was voted out of C++11 at the 2009 Frankfurt meeting because the GCC experimental implementation required hundreds of thousands of lines and imposed unacceptable compile times. That version used nominal satisfaction with explicit concept maps, closer to Rust’s impl Trait. The C++20 version uses structural satisfaction and is substantially simpler, at the cost of no coherence guarantees. The evolution is described in Stroustrup’s concept design papers and is worth tracing if you want to understand why the feature landed the way it did.

C++23 Deducing This: What CRTP Was Waiting For

C++23 added explicit object parameters (P0847, “deducing this”), available in GCC 13+, Clang 17+, and MSVC 19.36+. This is the language feature that CRTP existed to approximate:

struct Shape {
    template<typename Self>
    float area(this Self&& self) {
        return self.areaImpl();  // resolved to the concrete type at each call site
    }
};

struct Circle : Shape {
    float areaImpl() const { return 3.14159f * radius * radius; }
    float radius;
};

Self is deduced to the concrete derived type. No static_cast, no template parameter on the base class, no boilerplate. Generated code is identical to CRTP: zero runtime overhead, full inlining eligibility. The same feature enables recursive lambdas without std::function allocation (auto fib = [](this auto self, int n) -> int { ... };), eliminates const-overload duplication, and cleans up builder-pattern chains. The WG21 paper explicitly acknowledges CRTP as the pattern this feature replaces.

std::variant for Closed Type Sets

When the set of concrete types is fixed at compile time, std::variant with std::visit is a third option worth considering:

using Shape = std::variant<Circle, Rectangle, Triangle>;

float totalArea(const std::vector<Shape>& shapes) {
    float total = 0.0f;
    for (const auto& s : shapes)
        total += std::visit([](const auto& shape) { return shape.area(); }, s);
    return total;
}

Modern compilers convert std::visit over small variants into an inlined switch or jump table with all arms visible to the optimizer. Dispatch overhead is roughly 0.5 to 2 ns. The larger benefit is layout: a std::vector<Shape> stores objects contiguously; a std::vector<Base*> stores pointers to scattered heap allocations. For thousands of objects, cache behavior often dominates dispatch cost. Google’s Abseil team reported 1.8 to 2.4x throughput improvement in protocol buffer parsing after switching from virtual dispatch to variant-based dispatch, with the gain attributed primarily to inlining and eliminated branch mispredictions. The constraint is that the type set must be enumerable at compile time, which rules out plugin architectures and open extension points.

Choosing the Right Tool

The decision is not complicated once you understand the costs. For code that the compiler can see statically, final is the lowest-effort intervention and should be the first thing you reach for. LTO plus PGO handles a large share of the remaining cases without source changes. Manual static polymorphism is appropriate for hot inner loops with known concrete types at design time, for embedded and real-time systems where virtual dispatch is prohibited or analytically inconvenient, and for library code like Eigen where the performance guarantee is part of the API contract. std::variant fits closed type sets where data locality matters as much as dispatch cost.

C++23 deducing-this means new code targeting GCC 13+ or Clang 17+ no longer needs CRTP boilerplate to get the performance. The language has converged on zero-overhead abstraction being achievable without metaprogramming gymnastics, and the compilers have converged on eliminating most of the remaining overhead automatically for code that uses final, LTO, or PGO. What remains is the judgment call of when your use case falls outside what the compiler can see.