David Alvarez Rosa’s recent piece on devirtualization and static polymorphism covers the mechanics of how compilers recover from virtual dispatch. It’s worth reading on its own. What I want to do here is frame the problem from the optimizer’s perspective, because the usual explanation, “virtual calls are slow because of pointer indirection,” undersells how much damage they actually do.
What the vtable actually looks like
When you declare a class with at least one virtual method, the compiler inserts a hidden pointer at the start of every object of that class. That pointer, the vptr, points to the vtable, which is a static array of function pointers, one slot per virtual method in the class hierarchy. The layout in memory looks like this:
object in memory:
[vptr] --> vtable:
[0] -> Base::method_a
[1] -> Base::method_b
[2] -> Base::method_c
A derived class that overrides method_b gets its own vtable where slot 1 points to the override. The vptr in derived objects points to the derived vtable.
A virtual call to obj->method_b() compiles down to something like this on x86-64:
mov rax, [rdi] ; load vptr from object (first indirection)
call [rax + 8] ; load function pointer from vtable slot, call it (second indirection)
The double indirection is real, and if either the vptr or the target function pointer misses the L1 cache, you’re paying 100-300 cycles instead of 1. Virtual dispatch adds roughly 5-10 cycles in the hot-cache case, but can blow out to 100+ cycles on a cache miss, and when the call target varies across many types, the instruction cache pollution from loading different code paths is its own cost on top.
But the cache miss story is not the main event.
The barrier the optimizer cannot cross
The deeper problem is that the compiler sees a virtual call as an opaque transfer to an unknown address. It cannot prove at compile time which concrete function will execute. That single uncertainty cascades into a list of things the optimizer refuses to do.
First, it cannot inline. Inlining is the gateway optimization. When a function is inlined, the compiler folds the callee’s body into the caller, eliminates the call overhead, and then runs the combined body through every subsequent pass: constant propagation, dead code elimination, loop unrolling, and vectorization. A virtual call closes that door entirely, because the target is not known until runtime.
Second, it cannot vectorize across the call. If you have a loop that calls a virtual method on each element of an array, the compiler cannot transform that into a SIMD loop. It does not know whether the method on element N+1 is the same function as on element N, so it cannot batch the work into vector lanes. In hot numeric or data-processing code, this is often the most expensive consequence.
Third, alias analysis breaks down. The compiler’s alias analyzer tracks which memory locations a piece of code can read or write. Across an opaque function call, it has to assume the callee can read or write any globally reachable memory. That blocks a lot of load-store optimization and can force the compiler to reload values from memory that it would otherwise keep in registers across the call.
What you end up with is code that is slower than the two-indirection story suggests, because the surrounding code also degrades. The loop around the virtual call does not vectorize. The values around the call have to round-trip through memory. The branches around the call do not merge. The virtual call is a hole in the optimization graph.
How the compiler tries to recover: devirtualization
The compiler has several strategies to prove, at compile time, which concrete function a virtual call will invoke, and then replace the indirect call with a direct one. Once it is direct, inlining and everything downstream becomes possible again.
The simplest case is the final keyword. Mark a class or a virtual method final, and you tell the compiler that no further overrides exist. The compiler can then devirtualize any call where the static type is that final class or where the receiver was constructed as that type.
class Renderer final : public IRenderer {
public:
void draw(Scene& s) override { /* ... */ }
};
void render_scene(Renderer& r, Scene& s) {
r.draw(s); // devirtualized: compiler knows r is Renderer, which is final
}
The compiler does not need to prove anything about the runtime state. The type system tells it that Renderer has no subclasses, so r.draw(s) maps to exactly one function.
Link-time optimization (LTO) extends the compiler’s view across translation units. Without LTO, the compiler processing one .cpp file cannot see whether a class defined in another .cpp has subclasses. With LTO, the linker runs another optimization pass over the whole program, and that pass can see the full class hierarchy. If a class has no overrides anywhere in the linked binary, the LTO pass can devirtualize calls to it even without final.
Profile-guided optimization (PGO) takes a different approach: instead of proving which function will always be called, it observes which function is most often called at a given call site during a profiling run. It then emits a type check followed by a direct call for the common case, with a fallback to the full virtual dispatch for the rare case. This is sometimes called speculative devirtualization or indirect call promotion:
// What the optimizer conceptually generates after PGO
if (vptr == Derived::vtable_ptr) [[likely]] {
// devirtualized fast path
static_cast<Derived*>(this)->method();
} else {
// fallback to virtual dispatch
this->method();
}
The fast path inlines and optimizes normally. The cold path is the original virtual call. For programs where a small number of concrete types dominate at a given call site, this recovers most of the performance without any source changes.
Escape analysis is the fourth mechanism. If the compiler can prove that an object does not escape the current function, it knows the complete lifetime of that object. Combined with the knowledge of how the object was constructed, it can determine the concrete type and devirtualize without any runtime check.
CRTP: make the type relationship structural
When compiler-assisted devirtualization is not enough, or when you want to guarantee zero overhead at the source level without depending on LTO or PGO being active, the Curiously Recurring Template Pattern (CRTP) moves the polymorphism into the type system.
template <typename Derived>
class Shape {
public:
double area() const {
return static_cast<const Derived*>(this)->area_impl();
}
};
class Circle : public Shape<Circle> {
public:
double area_impl() const {
return 3.14159 * radius * radius;
}
private:
double radius;
};
class Square : public Shape<Square> {
public:
double area_impl() const {
return side * side;
}
private:
double side;
};
Shape<Circle> and Shape<Square> are entirely different types. There is no vtable. The call to area_impl() through the base resolves at compile time to the correct derived implementation, and the compiler can inline it freely. The static_cast in area() generates no code; it just changes how the compiler interprets the pointer.
The cost is that you lose the ability to hold heterogeneous collections behind a single pointer type. You cannot put a Shape<Circle> and a Shape<Square> in the same std::vector without another layer of indirection. CRTP trades runtime flexibility for compile-time transparency.
C++20 concepts: express the interface without inheritance
C++20 concepts let you describe what a type needs to support without requiring inheritance from a base class at all.
template <typename T>
concept ShapeLike = requires(const T& s) {
{ s.area() } -> std::convertible_to<double>;
{ s.perimeter() } -> std::convertible_to<double>;
};
template <ShapeLike S>
void print_stats(const S& shape) {
std::cout << "Area: " << shape.area() << "\n";
std::cout << "Perimeter: " << shape.perimeter() << "\n";
}
Any type that satisfies ShapeLike works here. The call to shape.area() is a non-virtual call through the concrete type, so it inlines and vectorizes normally. The concept gives you documentation and compile-time enforcement of the interface contract without any runtime mechanism. The ergonomic improvement over raw unconstrained templates is significant: error messages name the constraint that was violated rather than a stack of template instantiation failures.
std::variant and std::visit: sum types with closed dispatch
When you have a closed set of types and want value semantics rather than pointer semantics, std::variant combined with std::visit gives you a dispatch mechanism the compiler can often devirtualize entirely.
using Shape = std::variant<Circle, Square, Triangle>;
double total_area(const std::vector<Shape>& shapes) {
double sum = 0.0;
for (const auto& s : shapes) {
sum += std::visit([](const auto& shape) {
return shape.area();
}, s);
}
return sum;
}
std::visit uses a jump table indexed on the variant’s type tag, but because the set of types is fixed and known at compile time, the compiler can see all possible targets. It will often devirtualize the entire visit call and inline all the concrete area() implementations. In practice, modern compilers with optimization enabled reduce this to a tight dispatch sequence with no hidden indirection.
The tradeoff is rigidity. Adding a new type to a std::variant-based design requires modifying the variant type alias and recompiling everything that uses it. That is the right tradeoff for many internal data model problems, but it is wrong for plugin systems or any extension point that needs to be open.
When you should not reach for any of this
All of these static polymorphism techniques assume a closed world at compile time. That assumption is false in a significant class of problems.
Plugin architectures need to load code that was not present when the application was compiled. A plugin is a concrete type the compiler has never seen. Virtual dispatch, with a well-designed abstract base class as the plugin interface, is exactly the right tool. There is no static alternative.
Extension points in libraries, where user code provides implementations of an interface the library defines, are the same story. The library author cannot know what types the user will write. The interface must be expressed in terms of base class pointers or something equivalent.
Any situation where the set of types is determined at runtime, by user input, configuration, or dynamic loading, cannot be resolved at compile time. CRTP and concepts require instantiation against specific types; there is nothing to instantiate against an unknown.
The rule of thumb is: use static polymorphism when you control all the types and compile them together. Use dynamic polymorphism when the set of types is open or determined outside your compilation unit.
Putting the pieces together
The progression from virtual dispatch to static polymorphism is a progression from runtime information to compile-time information. The optimizer can do much more with compile-time information: inline aggressively, vectorize loops, keep values in registers, and eliminate branches. Virtual calls cost not just the two pointer dereferences but the optimization opportunities the surrounding code loses.
final and LTO are the first resort, recoverable through source annotations or build configuration without restructuring the code. PGO and speculative devirtualization are the second resort, useful when the type hierarchy is genuinely open but a small number of types dominate at runtime. CRTP, concepts, and std::variant are the third resort, requiring more significant design commitment but guaranteeing zero-overhead polymorphism regardless of what the optimizer can prove.
Most codebases benefit from applying final liberally, enabling LTO in release builds, and reaching for std::variant or templates in the hot paths where the type set is genuinely closed. The virtual calls that remain will be in the places where they belong.