Polymorphism is one of the foundational tools in C++ design. At a certain scale or in certain contexts, the runtime mechanism that powers it — virtual dispatch — becomes a measurable constraint. David Álvarez Rosa’s piece on devirtualization and static polymorphism covers the fundamentals well. What this post adds is a more precise account of where the cost actually lives, what compilers can recover automatically, and how the static polymorphism landscape looks in 2026 with C++23 in widespread use.
What a Virtual Call Actually Does
Every polymorphic class in C++ carries a hidden pointer — the vptr — injected at offset zero of the object by the compiler. It points to the vtable, a static array of function pointers specific to the concrete type. On a 64-bit system, this adds 8 bytes to every instance.
The dispatch sequence for obj->virtualMethod() is:
1. Load vptr from *obj (potential cache miss)
2. Load function pointer from vptr[N] (second potential cache miss)
3. Indirect branch to that pointer (branch predictor must guess the target)
4. Execute the callee body (potential instruction cache miss)
That is two memory reads before the call begins, followed by an indirect branch. On a warm, monomorphic call site where the vptr and vtable are in L1 cache and the branch predictor has locked onto a single target, the overhead is modest: roughly 4 cycles for each memory read plus 1–2 cycles for the predicted branch. Tolerable.
The situation degrades sharply in two conditions. First, when the object graph is scattered across heap memory and neither the vptr nor the vtable slot is cached, a cold virtual call can cost 50 to 300 cycles — most of that waiting on DRAM. Second, when a single call site dispatches to many distinct concrete types in an unpredictable pattern, the indirect branch predictor fails. Modern Intel CPUs maintain per-call-site history in an Indirect Branch Predictor, but with more than four to eight distinct targets observed in a rotating pattern, prediction accuracy drops toward 50 percent, and a mispredicted indirect branch on Skylake costs roughly 20 cycles.
Chandler Carruth’s CppCon 2014 talk made a point that deserves repeating: the instruction cache miss is often worse than the dispatch overhead itself. When a std::vector<Base*> stores pointers to heterogeneous objects scattered across the heap, each call fetches a different function body. Those bodies compete for instruction cache slots. In a tight loop over thousands of objects, this can stall the frontend systematically, and no amount of compiler cleverness fixes a cold icache.
What Compilers Recover Automatically
Compilers are not passive about this. Both GCC and Clang perform class hierarchy analysis (CHA) at -O2 and above, scanning a translation unit to determine whether a virtual call can be proven to have exactly one possible target. When the static type of the receiver is a concrete class, or when a class or method is marked final, the compiler replaces the indirect call with a direct one and often inlines it entirely.
class Renderer final : public IRenderer {
public:
void draw(Canvas& c) override { /* ... */ }
};
void render(Renderer& r, Canvas& c) {
r.draw(c); // devirtualized: Renderer is final, single override
}
With link-time optimization (-flto), the visibility extends across translation units. If a virtual function has only one concrete override in the whole program, both GCC and Clang will devirtualize every call to it. GCC also performs speculative devirtualization without LTO: it emits a type check at the call site and falls through to a direct call for the most likely type, with a virtual fallback for the rest. With profile-guided optimization (-fprofile-use), it ranks candidates by observed frequency.
Clang’s ThinLTO adds -fwhole-program-vtables, which embeds vtable metadata into LLVM IR so that the linker can perform devirtualization across module boundaries without requiring a full monolithic LTO link.
The cases where devirtualization fails are straightforward: any call through a Base* or Base& where the concrete type cannot be proven, any call across a library boundary without LTO, and any class hierarchy left open (no final). These are exactly the patterns idiomatic OOP tends to produce, which is why the optimizer cannot always help.
Static Polymorphism: The CRTP Foundation
When the set of types is known at compile time and runtime extensibility is not required, static polymorphism eliminates the vtable entirely. The classical mechanism is the Curiously Recurring Template Pattern (CRTP):
template <typename Derived>
class Serializable {
public:
void serialize(Stream& s) const {
static_cast<const Derived*>(this)->serializeImpl(s);
}
};
class Record : public Serializable<Record> {
public:
void serializeImpl(Stream& s) const {
s.write(id);
s.write(name);
}
private:
int id;
std::string name;
};
The static_cast to Derived* is resolved entirely at compile time. The call to serializeImpl is a direct call; the compiler can inline it. No vptr, no vtable, no indirect branch. Eigen’s entire expression template system is built this way: MatrixBase<Derived>, ArrayBase<Derived>, and every lazy expression type use CRTP so that a statement like result = a * b + c generates a single evaluation loop with no virtual calls and no temporaries, directly eligible for auto-vectorization.
The cost is real: each template instantiation is a distinct class, so three types inheriting from Serializable<T> produce three copies of the serialize logic in the binary. Debug builds bloat significantly. Compile times increase with each instantiation. Error messages from a missed interface method are pages of substitution failure before concepts arrived.
std::variant combined with std::visit offers a different trade-off for closed type sets:
using Shape = std::variant<Circle, Square, Triangle>;
double totalArea(const std::vector<Shape>& shapes) {
double sum = 0.0;
for (const auto& s : shapes) {
sum += std::visit([](const auto& shape) {
return shape.area();
}, s);
}
return sum;
}
With two to four variant alternatives, compilers frequently unroll std::visit into a branch chain and inline each path. The objects are stored by value in the vector, eliminating pointer indirection and preserving spatial locality. With many alternatives, the visit degrades toward jump-table dispatch, but even then the objects remain contiguous in memory — which is often the larger win.
C++23 Makes This Substantially Cleaner
The main ergonomic complaint against CRTP has always been the template parameter on the base class. Base<Derived> forces the hierarchy to be spelled out at the definition site, prevents storing mixed derived types behind a common non-template base, and produces confusing error messages.
C++23’s deducing this (P0847) resolves this by making this an explicit, deducible parameter:
class Base {
public:
void interface(this auto& self) {
self.implementation(); // direct call, statically resolved
}
};
class Concrete : public Base {
public:
void implementation() { /* ... */ }
};
Base is now a plain class, not a class template. Concrete inherits from Base without specifying a template argument. The self parameter in interface is deduced to Concrete& at the call site, making self.implementation() a direct call. The compiler inlines freely.
This also fixes the fluent interface problem that CRTP handled with a static_cast dance. A builder returning this auto& self automatically returns the derived type, preserving the correct static type through a chain of method calls without any explicit casting. GCC 13, Clang 17, and MSVC 19.36 all support this feature.
Choosing the Right Tool
The choice between virtual dispatch and static polymorphism is not primarily about pedantry over a few nanoseconds. It follows from the structure of the problem.
Virtual dispatch is appropriate when the set of types is open — when callers outside your control will provide implementations, when plugins or shared libraries extend the hierarchy, or when objects of different types genuinely need to live in the same container and be treated uniformly at runtime. It is also appropriate on paths that are not performance-critical, where the overhead is irrelevant and the clarity of a simple virtual interface outweighs everything else.
Static polymorphism is appropriate when the type set is closed and known at compile time, when the call site is on a hot path (inner loop, per-frame update, per-packet processing), and when contiguous storage of value-typed objects matters for cache behavior. Embedded systems and safety-critical contexts often mandate it by policy, since RTTI overhead and indirect branches on in-order pipelines (ARM Cortex-M0) are unacceptable.
The compile-time cost is real and should not be dismissed. CRTP-heavy codebases have significantly longer build times than equivalent virtual-dispatch code. C++20 concepts improve error quality dramatically, and C++20 modules are beginning to address the instantiation cost, but neither eliminates it. If a team is already struggling with 20-minute clean builds, adding heavy CRTP hierarchies is a meaningful engineering choice, not a free optimization.
A practical approach: use final generously on classes and methods that are not designed for further derivation. This is zero-cost at runtime and enables the compiler to devirtualize without any change to the calling code. Measure before restructuring into CRTP; a monomorphic virtual call site in a warm cache is already nearly free, and the optimizer may have handled it. Reach for static polymorphism where profiling shows real cost, where the type set is genuinely closed, or where value semantics and contiguous storage matter more than runtime flexibility.