The Inlining Gap: Virtual Dispatch Overhead and Static Polymorphism in Modern C++

Every virtual call in C++ performs three operations that a direct call does not: load the vptr from the object’s first bytes, load the function pointer from the vtable at a fixed offset, then execute an indirect jump. On a modern CPU with hot L1 cache, those two memory loads cost roughly 4 to 8 cycles. The indirect branch predictor adds another 3 to 5 cycles when it sees multiple call targets in rotation. That is the dispatch overhead most performance discussions lead with.

It is not where performance is usually lost.

The larger cost is what the optimizer stops doing once it encounters an indirect call. It cannot inline the callee. Without inlining, constant propagation stops at the call boundary, loop vectorization is blocked by any call inside the loop body, dead code elimination cannot propagate across the site, and calling-convention overhead consumes registers that would otherwise hold live computed values. Chandler Carruth’s well-known CppCon 2015 demonstration showed a tight numerical loop dropping from roughly 18 ns per iteration with virtual dispatch to around 1.2 ns after switching to CRTP. The speedup came from the inlining that followed: once the callee body was visible to the optimizer, the loop was auto-vectorized. The indirect call overhead itself was a small fraction of the story.

This article by David Álvarez Rosa frames virtual dispatch as a performance problem with two components: pointer indirection and missed inlining. That framing is correct. The rest of this post maps out the concrete techniques available, from free compiler-level fixes to explicit design changes, and where C++23 has simplified the trade-offs.

What the Compiler Can Do On Its Own

Compilers have been attacking this problem for decades through devirtualization, converting an indirect virtual call to a direct call when the dynamic type can be proven or estimated.

The simplest form requires no code changes. GCC and Clang both perform intra-procedural type propagation. If you allocate a concrete type on the stack and pass its address as a base pointer, the compiler often proves the dynamic type and emits a direct call at -O2:

void foo() {
    Dog d;
    Animal* p = &d;
    p->speak(); // GCC -O2: emits "call Dog::speak" directly
}

Stack-allocated objects and locally-scoped unique_ptr<Derived> are common cases where this works. The limit is pointer escape: once the pointer is stored in a global, passed to another translation unit, or returned from a function, the compiler loses the type information and falls back to the indirect call.

The final specifier, added in C++11, is the cheapest annotation you can add. Marking a class final tells both the compiler and the reader that no subclass will ever exist:

class Dog final : public Animal {
    void speak() override;
};

void bark(Dog* d) {
    d->speak(); // Always devirtualized: no subclass of Dog is possible
}

GCC and Clang both exploit final aggressively at -O2. If your class is effectively sealed, adding final costs nothing at runtime and frequently unlocks inlining without any other design change. This is the first thing to check before reaching for CRTP.

With link-time optimization (-flto), the compiler extends type analysis across translation unit boundaries. GCC’s -fdevirtualize-speculatively (enabled at -O3 and with profile-guided optimization) goes further still: it generates an inline type check and falls through to the virtual call only when the prediction is wrong. The emitted code looks roughly like this:

if (p->vptr == &Dog::vtable) Dog::speak(p); // direct, inlinable
else p->speak();                            // fallback path

This works well for monomorphic and bimorphic call sites. At around eight distinct call targets in rotation, the branch chain becomes unprofitable and the compiler backs off. Megamorphic dispatch, common in interpreted language runtimes and plugin architectures, is where devirtualization reliably fails.

CRTP: Moving Dispatch into the Type Parameter

When the compiler cannot devirtualize because the concrete type is genuinely unknown at the call site, CRTP encodes the concrete type as a template parameter, moving dispatch entirely to compile time:

template <typename Derived>
struct Shape {
    double area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};

struct Circle : Shape<Circle> {
    double area_impl() const { return 3.14159 * r * r; }
    double r;
};

The static_cast is valid because Shape<Derived> is only instantiated with Derived as a subclass. The call to area_impl() resolves at compile time. There is no vtable, no vptr in the object layout, and the callee body is fully inlinable.

The trade-offs are real. Shape<Circle> and Shape<Square> are entirely unrelated types; you cannot store them in the same container without an additional type-erasure layer. Every function that accepts a shape must itself be a template, pushing the template parameter up the call stack. Each instantiation produces a separate copy of the base class methods, inflating binary size and compile time. Eigen, the canonical CRTP user in production code, is known for both its numerical performance and its ability to exhaust compiler memory on large programs.

CRTP also carries a correctness gap. If a derived class fails to implement area_impl(), the error message points deep into the template instantiation stack rather than at the missing method. C++20 concepts address this partially by letting you express the interface requirement directly:

template <typename T>
concept ShapeLike = requires(const T& s) {
    { s.area() } -> std::convertible_to<double>;
};

A function constrained by ShapeLike still compiles to direct, inlineable calls while producing diagnostic messages that name the missing requirement cleanly.

Deducing `this` in C++23

C++23’s deducing this (P0847R7) solves the CRTP boilerplate at the language level. The explicit object parameter syntax lets a member function deduce the concrete derived type without parameterizing the base class:

struct Shape {
    double area(this const auto& self) const {
        return self.area_impl();
    }
};

struct Circle : Shape {
    double area_impl() const { return 3.14159 * r * r; }
    double r;
};

Shape is now a plain class, not a class template. The function area is still a function template (its self parameter deduces to the most-derived type at each call site), but the base class has no type parameter. This eliminates the instantiation explosion and the viral template propagation that make CRTP codebases expensive to compile, without giving up static dispatch or inlining. GCC 13 and Clang 17 both support the feature. Experiments on Compiler Explorer confirm that both emit direct, inlineable calls for the deducing-this version with no observable binary size penalty compared to CRTP for simple hierarchies.

The limitation remains: deducing this is still static polymorphism. You cannot mix Circle and Square in the same runtime container any more than you can with CRTP.

`std::variant` for Closed Sets

When the set of possible types is fixed at compile time and you need to store values of different types in the same container, std::variant provides closed-set polymorphism with value semantics:

using Shape = std::variant<Circle, Square, Triangle>;

double area(const Shape& s) {
    return std::visit(overloaded{
        [](const Circle& c)   { return 3.14159 * c.r * c.r; },
        [](const Square& q)   { return q.side * q.side; },
        [](const Triangle& t) { return 0.5 * t.base * t.height; }
    }, s);
}

std::visit dispatches through a jump table indexed by the variant’s one-byte discriminant, not through a pointer chain through heap-allocated objects. For two to four types, compilers typically generate a direct branch or switch; the branch predictor handles it well when one type dominates. More importantly, objects in a std::vector<Shape> are stored contiguously in memory with no pointer indirection, so the CPU prefetcher works correctly. Benchmarks comparing virtual dispatch over heap-allocated objects against std::variant over a flat array consistently show one to two orders of magnitude difference on cold data, almost entirely from cache behavior rather than dispatch mechanics.

The constraint is the closed set itself: adding a new type requires recompiling all code that touches the variant. That constraint is appropriate for cases where the type set is definitional, such as AST node kinds, protocol message types, or command variants. It is inappropriate for plugin systems or any interface where third-party types must be supported.

The overloaded helper used above is a common C++17 pattern not yet in the standard library. P2781 proposes adding it as std::overload in a future standard, but as of 2025 it remains a copy-paste utility.

Choosing Between These Approaches

The decision follows from what the compiler can see and what the type set looks like.

If the concrete type is locally provable, or the class is effectively sealed, use final and let the compiler handle it. If you are in a C++23 codebase and need static polymorphism with mixin behavior, deducing this is cleaner than CRTP and avoids the instantiation cost. If you are on C++17 and need static dispatch with policy-based composition, CRTP remains the standard approach. If the type set is closed and you want contiguous storage, std::variant with std::visit is the right model. Virtual dispatch belongs where the type set is genuinely open, the code is not on a hot path, or the flexibility of runtime extensibility outweighs the performance trade-off.

Game engines illustrate this layering clearly. Unreal Engine’s UObject system uses virtual dispatch for component callbacks that fire at 60 Hz per entity, which is acceptable. Its Niagara VFX system and physics solver use template specialization and data-oriented layouts specifically to enable vectorization in the inner loop, where virtual dispatch would block it. A similar split appears in high-frequency trading systems, where a fixed set of order types is handled with std::variant or compile-time dispatch on the critical path, while a broader framework of configurable strategies uses virtual interfaces for non-latency-sensitive management code.

The original article’s core point holds: the overhead is not simply the call itself. Every technique described here is fundamentally about giving the compiler more visibility into the callee, either by proving the type statically, parameterizing on it, or closing the set. The dispatch mechanism matters less than the inlining it does or does not permit.