· 8 min read ·

The Hidden Cost of Virtual Dispatch Isn't the Call, It's the Inlining You Lose

Source: isocpp

Virtual functions are one of the most overused features in C++. Not because polymorphism is wrong, but because the cost model is widely misunderstood. Most developers think the overhead is the two pointer indirections, the vtable load and the function pointer dereference. Those matter, but they are not the real problem. The real problem is what the compiler stops doing once it cannot see through a virtual call.

David Álvarez Rosa’s recent article on isocpp.org lays out the mechanics of virtual dispatch overhead and the case for static polymorphism. It is a solid primer. What I want to do here is go deeper on the specific mechanism that makes virtual dispatch genuinely expensive, trace through what compilers actually attempt to fix, and examine the realistic trade-offs between CRTP, std::variant, and the newer C++23 patterns that have started to make the old CRTP idiom feel less necessary.

What the vtable Actually Costs

When a class declares any virtual function, the compiler inserts a hidden pointer at the start of every object instance. On a 64-bit platform, that is eight bytes per object, regardless of how many virtual methods you have. All instances of the same concrete type share a single vtable, a static array of function pointers laid out in declaration order.

A call through a base pointer, shape->area(), compiles to roughly two loads and an indirect call:

mov  rax, [rdi]       ; load vptr from object (pointer chase #1)
call [rax + 0]        ; load function pointer from vtable slot, then call (pointer chase #2)

For a hot loop over a homogeneous array of the same type, those vtable entries stay warm in L1. You pay maybe five to ten cycles over a direct call. Not catastrophic.

The real cost appears in two scenarios. First, a heterogeneous array of mixed derived types. The vtables scatter across different addresses; each dispatch can cause a cache miss, costing 50-200 cycles depending on where the data lands. Second, and more significant, the indirect call prevents the compiler from inlining the callee.

Why Inlining Loss Dominates

Inlining is not just about removing call overhead. When the compiler inlines a function into its call site, it gains visibility into both the caller and the callee simultaneously. That visibility enables:

  • Constant propagation: values known at the call site propagate into the callee, eliminating dead branches
  • Loop vectorization: arithmetic in the callee can be merged into a SIMD loop in the caller
  • Register allocation across the call boundary: no spill-and-reload for values that would otherwise be clobbered by the calling convention

Consider a loop summing the area of ten million circles:

double total = 0.0;
for (auto* s : shapes) {
    total += s->area();   // virtual call
}

With virtual dispatch, the compiler sees an indirect call and a scalar add. It cannot vectorize. The loop runs one iteration per cycle maximum, limited by the call and the serial dependency on total.

With a known concrete type or an inlined implementation:

for (const auto& c : circles) {
    total += M_PI * c.r * c.r;  // inlined
}

Now the compiler sees plain arithmetic and auto-vectorizes with AVX2, processing eight floats per cycle. That is not a marginal improvement. It is a structural change in what the hardware is doing. The five-cycle virtual call overhead was never the bottleneck. The missed vectorization was.

What Compilers Attempt to Fix

Compilers are not passive about this. Both GCC and Clang run devirtualization passes at -O2 and above.

The simple case works well: a locally declared object with no re-assignment to a more general type.

void foo() {
    Circle c(5.0);
    Shape& s = c;   // compiler knows s is a Circle
    s.area();       // devirtualized to Circle::area(), possibly inlined
}

The more powerful case requires Link-Time Optimization. With -flto, GCC builds a Class Hierarchy Analysis across all translation units. If only one class in the entire program overrides Shape::area(), every call through a Shape* gets devirtualized. GCC also supports -fdevirtualize-speculatively, which emits an inline type check with a direct-call fast path and an indirect fallback:

cmp [rdi], Circle_vtable_ptr   ; is this definitely a Circle?
jne .fallback
call Circle::area              ; direct, potentially inlined
jmp .done
.fallback:
mov rax, [rdi]
call [rax + 0]
.done:

For workloads that are monomorphic in practice but polymorphic in type, this speculation can recover most of the inlining benefit without changing the code.

Marking a class or method final is the cheapest intervention that always works:

class Circle final : public Shape { ... };  // compiler can devirtualize all Circle* calls

final is a promise to the compiler that no further derivation will override the method. It costs nothing at runtime, takes seconds to add, and enables devirtualization even without LTO, even across shared library boundaries where LTO cannot reach.

Devirtualization fails for heap-allocated objects with unknown dynamic types, calls across shared library boundaries where vtable pointers resolve at load time, and any case where the compiler cannot prove the concrete type. These are common in real codebases. The compiler helps, but it cannot fix everything.

CRTP: Static Polymorphism Without the Overhead

For cases where the set of types is known at compile time and heterogeneous containers are not required, CRTP (Curiously Recurring Template Pattern) resolves dispatch entirely at compile time.

template <typename Derived>
class ShapeBase {
public:
    double area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};

class Circle : public ShapeBase<Circle> {
public:
    double area_impl() const { return M_PI * r_ * r_; }
private:
    double r_;
};

The static_cast is free at runtime. area_impl() is a direct call the compiler inlines. There is no vtable, no pointer indirection, no indirect branch. A loop over std::vector<Circle> using this interface vectorizes as readily as hand-written arithmetic.

C++20 concepts complement CRTP by formalizing the interface requirements:

template <typename T>
concept Shape = requires(const T s) {
    { s.area() }       -> std::convertible_to<double>;
    { s.perimeter() }  -> std::convertible_to<double>;
};

template <Shape S>
double total_area(const std::vector<S>& shapes) {
    double sum = 0.0;
    for (const auto& s : shapes) sum += s.area();
    return sum;
}

Concepts do not change the runtime behavior. They provide machine-checked documentation of what the template requires, and they produce readable error messages at the call site rather than deep inside template instantiation output. A type that satisfies Shape does not need to inherit from anything.

The trade-offs are real. CRTP cannot store mixed types in the same container. ShapeBase<Circle> and ShapeBase<Square> are unrelated types after instantiation. Each instantiation generates separate code, inflating binary size. Compile times grow with deep template hierarchies. Stack traces involve mangled names that are tedious to decode. For tooling-heavy workflows, especially large team codebases, these costs add up.

std::variant for Closed-Set Heterogeneous Polymorphism

When the set of types is fixed but you do need heterogeneous storage, std::variant offers a different approach. It is a discriminated union: one fixed-size value slot, a tag indicating which type is active, and std::visit to dispatch based on that tag.

using ShapeVar = std::variant<Circle, Square, Triangle>;

struct AreaVisitor {
    double operator()(const Circle&   c) const { return M_PI * c.r * c.r; }
    double operator()(const Square&   s) const { return s.side * s.side; }
    double operator()(const Triangle& t) const { return 0.5 * t.b * t.h; }
};

std::vector<ShapeVar> shapes = { Circle{5.0}, Square{3.0}, Triangle{3,4,5} };
double total = 0.0;
for (const auto& sv : shapes) {
    total += std::visit(AreaVisitor{}, sv);
}

std::visit dispatches through a jump table internally, not unlike a vtable. For a single call, the overhead is similar. The advantages come from two other sources.

First, memory layout. A std::vector<ShapeVar> is contiguous. Each element is sizeof(largest_type) + sizeof(discriminant) bytes, stored inline. A std::vector<Shape*> is a contiguous array of pointers, but the objects themselves are scattered on the heap. The variant layout eliminates the pointer indirection to the object entirely, which reduces cache misses in heterogeneous traversals significantly. Published benchmarks show roughly 3x improvement in heterogeneous loop throughput for variant versus pointer-to-virtual, primarily from cache behavior.

Second, the compiler sees the complete type set. For small N (two or three alternatives), Clang typically compiles std::visit into a direct branch tree with each branch inlined, producing the same quality code as a handwritten if/else if chain. The closed-world assumption lets the optimizer do its job.

The constraint is that the type set is fixed at compile time. Adding a new shape requires updating the variant typedef and every visitor, recompiling everything. For extensible type hierarchies, virtual dispatch remains the right tool.

C++23 Changes the CRTP Idiom

C++23 introduced explicit object parameters (P0847), sometimes called “deducing this”. This enables a pattern that previously required CRTP boilerplate:

struct Shape {
    // C++23: the explicit object parameter `self` captures the derived type
    template <typename Self>
    double area(this Self&& self) {
        return std::forward<Self>(self).area_impl();
    }
};

struct Circle : Shape {
    double area_impl() const { return M_PI * r * r; }
    double r;
};

Circle c{5.0};
c.area();  // calls Circle::area_impl() directly, no vtable

The compiler deduces Self as Circle at the call site and generates a direct call to area_impl(). This achieves the same zero-overhead dispatch as CRTP without the template <typename Derived> class ShapeBase boilerplate or the awkward static_cast<Derived*>(this) in every method. It is cleaner and produces clearer error messages.

This feature also simplifies recursive CRTP patterns common in expression template libraries. Code that previously required two or three levels of template machinery can often collapse to a single struct with explicit object parameters.

Practical Decision Points

For most application code, virtual dispatch is the right choice. The ergonomics are better, the compiler tooling is better, and the performance impact is negligible unless you are in a hot inner loop.

The cases where switching to static polymorphism is worth the complexity cost are narrow but real:

  • Tight numerical loops where the type is homogeneous and vectorization matters. CRTP or plain templates.
  • Heterogeneous traversals with a fixed type set where cache locality drives performance. std::variant.
  • Embedded targets where vtable pointers per object are unacceptable. CRTP or policy-based design.
  • Library interfaces where compile-time composability matters more than runtime extensibility. Concepts plus templates.

The one change that costs nothing and should be a reflex: mark leaf classes final. It costs zero and helps the compiler devirtualize in more contexts, especially when LTO is not in use. Most class hierarchies in practice have far more final classes than they bother to annotate.

Virtual dispatch is a runtime mechanism solving a runtime problem: behavior that varies by type, determined after the program is compiled. Static polymorphism is a compile-time mechanism solving a compile-time problem: behavior that varies by type, determined before the program runs. When you know the answer at compile time, giving it to the compiler produces better code. The discipline is knowing which situation you are actually in.

Was this interesting?