Devirtualization Has Limits: The Case for Static Polymorphism in Performance-Critical C++

Virtual functions are one of the first abstractions C++ programmers reach for. They model open-ended hierarchies, decoupled interfaces, and runtime variation in a way that reads clearly. The problem is that the machine executing your code does not see an interface; it sees an indirect branch, a pointer dereference, and a blocked inlining opportunity.

David Álvarez Rosa’s piece on isocpp.org covers the mechanics competently. What I want to go deeper on here is specifically when compilers rescue you and when they do not, and which static alternatives offer the best tradeoffs for different situations.

The hardware view of a virtual call

When a C++ class has virtual methods, the compiler inserts a hidden pointer, the vptr, as the first member of every instance. That pointer targets a vtable: a static array of function pointers, one slot per virtual method. A call through base_ptr->method() generates roughly:

mov rax, [rcx]           ; load vptr from object (1st dereference)
call qword ptr [rax + 8] ; call through vtable slot (2nd dereference)

Two memory loads, one indirect branch. Each step is a problem in a hot path. The first load has a reasonable chance of hitting cache if the object was recently accessed; the second load, into the vtable itself, may not be, particularly if the working set spans many different concrete types. More importantly, the indirect branch depends on the runtime type of the object. Modern branch predictors handle small, stable sets of recurring types reasonably well, but a heterogeneous container exercising many different derived types in unpredictable order will stress the indirect branch predictor and produce pipeline stalls.

The inlining problem tends to be the larger issue in practice. A direct call to a known function can be inlined, which lets the compiler propagate constants, eliminate dead code, vectorize across loop iterations, and hoist loop-invariant loads. An indirect virtual call blocks all of this unless the compiler can prove the dynamic type at that point. That proof is the devirtualization problem, and it is harder than it looks.

What compilers can actually devirtualize

GCC and Clang both perform type-based devirtualization using class hierarchy analysis, and they approach it in a few distinct ways.

The simplest case is a final class or a final method. The compiler knows statically that no override exists and emits a direct call:

class Renderer final : public IRenderer {
    void draw(const Mesh& m) override { /* ... */ }
};

Any call to draw on a Renderer object compiles to a direct call and is eligible for inlining. This is the easiest win and costs nothing at the API level. The GCC documentation on optimization lists -fdevirtualize as a default pass at -O2 and above, but final dramatically expands its reach.

The second case is whole-program type analysis, most effective with link-time optimization enabled. If the compiler can determine that only one concrete type is ever instantiated for a given base pointer across the entire program, it devirtualizes even without final. This depends on escape analysis and closed-world assumptions about the translation unit set. It works well in self-contained programs with no dynamic loading and poorly in codebases that expose factory APIs across DSO boundaries or load plugins at runtime.

The third form is speculative devirtualization. Clang and GCC can emit a type check followed by an inlined fast path for the most likely concrete type, with a slow indirect fallback. The resulting code looks like:

cmp rax, offset _ZTV12DerivedClass+16  ; check vptr against likely type
jne .slow_path
; fast inlined body for DerivedClass
jmp .done
.slow_path:
call qword ptr [rax + 8]              ; general indirect dispatch
.done:

When one derived type dominates the call site, this works well. With many types in rotation, the speculation check becomes another mispredicted branch and gains nothing beyond an instruction-count penalty.

The gap that matters covers every case outside these three. Virtual calls in library code that ships as a compiled binary, calls through factory-returned pointers in large codebases, and calls inside tight loops over heterogeneous containers all tend to escape compiler analysis. On latency-sensitive paths, that escaped analysis means the overhead stays in place permanently.

CRTP: the classical answer

The Curiously Recurring Template Pattern encodes the static type directly into the base class template parameter, making dispatch resolution a compile-time operation:

template<typename Derived>
class Shape {
public:
    void draw() {
        static_cast<Derived*>(this)->drawImpl();
    }
    float area() const {
        return static_cast<const Derived*>(this)->areaImpl();
    }
};

class Circle : public Shape<Circle> {
public:
    void drawImpl() { /* circle drawing */ }
    float areaImpl() const { return 3.14159f * radius * radius; }
private:
    float radius;
};

The static_cast is purely a compile-time annotation. No indirection occurs at runtime. The call to drawImpl on a Circle resolves at compile time, inlines freely, and the optimizer sees through the abstraction entirely. Object layout contains no vptr, so sizeof(Circle) is not inflated by a hidden pointer.

CRTP is the idiom to reach for in performance-sensitive library code. The Eigen linear algebra library uses expression templates built on CRTP to achieve lazy evaluation and zero-overhead composition of matrix operations. Parts of LLVM use it for instruction selection. The drawback is ergonomics: each instantiation of Shape<T> is a distinct base class, so Shape<Circle> and Shape<Triangle> share no common ancestor at the object level. Heterogeneous collections require a type-erasing wrapper, which reintroduces the overhead you were trying to eliminate.

C++20 concepts as a cleaner constraint layer

Concepts do not replace CRTP for shared-implementation cases, but they address one of its persistent rough edges: the absence of documented interface requirements. With CRTP, a derived class that forgets to implement drawImpl produces a cryptic instantiation error buried deep in template expansion. With concepts, the constraint is explicit:

template<typename T>
concept Drawable = requires(T t) {
    { t.drawImpl() } -> std::same_as<void>;
    { t.areaImpl() } -> std::convertible_to<float>;
};

template<Drawable T>
void render(T& shape) {
    shape.drawImpl();
}

The dispatch is still static. The call still inlines. But the concept gate provides clear diagnostics, documents the expected interface at the type system level, and separates constraint checking from implementation inheritance. For pure interface enforcement without shared base-class implementation, concepts are a better fit than CRTP in C++20 code. The cppreference concepts documentation covers the constraint syntax in detail, including the difference between requires clauses and requires expressions, which is a common source of confusion.

C++23 and deducing this

The explicit object parameter, introduced in C++23 as P0847 and commonly called “deducing this,” eliminates a substantial portion of CRTP boilerplate. The implicit this parameter can now be named and typed explicitly, allowing the derived type to be deduced at the call site:

struct ShapeBase {
    void draw(this auto& self) {
        self.drawImpl();
    }
    float area(this auto const& self) {
        return self.areaImpl();
    }
};

struct Circle : ShapeBase {
    void drawImpl() { /* ... */ }
    float areaImpl() const { return 3.14159f * r * r; }
    float r;
};

// Usage: direct call, fully inlined, no vtable
Circle c{2.0f};
c.draw();       // resolves to Circle::drawImpl at compile time

The generated code is identical to CRTP: zero runtime overhead, full inlining eligibility. The this auto& parameter deduces the concrete type of the object at the call site, so the base class method body sees the derived type directly without any static_cast. This makes the pattern readable to engineers unfamiliar with CRTP and eliminates the class of bugs where the base class uses the wrong cast type.

The feature generalizes beyond simple dispatch. You can combine it with concepts to enforce interface contracts, use it to implement recursive patterns without template parameters, and apply it to value types that were previously excluded from CRTP because they used the wrong inheritance semantics. MSVC 19.33, Clang 18, and GCC 14 all support it; if you are targeting a modern toolchain, there is no reason to avoid it in new code.

std::variant and std::visit for enumerable type sets

When runtime variation is genuinely needed but the set of possible types is finite and known at compile time, std::variant combined with std::visit provides an alternative that often outperforms a vtable hierarchy in cache-sensitive workloads:

using Shape = std::variant<Circle, Rectangle, Triangle>;

float totalArea(const std::vector<Shape>& shapes) {
    float total = 0.0f;
    for (const auto& s : shapes) {
        total += std::visit([](const auto& shape) {
            return shape.areaImpl();
        }, s);
    }
    return total;
}

The compiler can inline the lambda for each type, generate a jump table or short comparison chain for dispatch, and avoid the heap pointer chasing that comes from a container of IShape*. The benchmark in Shahar Mike’s comparison of virtual dispatch and std::variant shows variant winning in tight loops by a meaningful margin on workloads where cache locality dominates.

The tradeoff is closure: the type set must be enumerable at compile time, which is exactly the constraint you accept when choosing static over dynamic dispatch. If the program is a plugin host, a scripting runtime, or a library with users who derive from your types, variant is not an option. If the program controls all the types involved, variant is frequently the right choice.

Choosing the right model

The decision comes down to three questions about the type being dispatched through.

If the type set is genuinely open, virtual dispatch is appropriate. Use final wherever you can to help the compiler devirtualize, enable LTO for whole-program analysis, and accept the remaining overhead as the cost of the open design. Profiling will tell you whether the virtual calls on your actual hot paths are costing enough to act on.

If the type set is closed and fixed at compile time, CRTP or deducing this gives you the same abstract design with direct calls throughout. For new C++23 code, deducing this is the cleaner approach. For C++17 codebases, CRTP remains the standard idiom.

If the type set is enumerable and you need heterogeneous collections at runtime, std::variant with std::visit often gives better throughput than either a vtable hierarchy or a CRTP wrapper with type erasure, because the objects are smaller, the vptr is absent, and the dispatch pattern is predictable.

Game engines, rendering pipelines, and numeric libraries have been routing around virtual dispatch on hot paths for years. What has shifted is that C++20 and C++23 have given the language enough expressive power to write the static version without the contortions that CRTP once required. The zero-overhead abstraction that C++ has always advertised as a design goal is, on these paths, now genuinely achievable without writing code that looks like a template metaprogramming exercise.