The Four Conditions: When the Compiler Can Devirtualize, and When It Cannot

Virtual dispatch is not expensive the way bubble sort is expensive. The overhead is real, but what makes it difficult to reason about is that its magnitude depends on context that changes between hardware generations, compiler versions, and security mitigations. The indirect branch at the core of virtual function calls is a point of friction at several levels simultaneously: cache pressure, branch misprediction, and, since 2018, an active security mitigation that can cost ten to twenty-five cycles per call regardless of prediction accuracy.

David Álvarez Rosa’s writeup on isocpp.org correctly identifies devirtualization and static polymorphism as the tools to reach for. What I want to do here is be specific about when the compiler actually manages devirtualization on its own, because that specificity matters when deciding whether to reach for the alternatives.

The four conditions for compiler devirtualization

GCC and Clang both run a type-based devirtualization pass, enabled by default at -O2 and above. The pass covers four distinct cases, and understanding where each one breaks down tells you something about the shape of the problem.

The final specifier. A class or method marked final has no overrides. The compiler knows this statically and emits a direct call. No analysis needed, no guessing. This is also the most reliable path to vectorization: a loop calling a virtual method on a final object resolves to a direct call that the compiler can inline, after which it can apply constant propagation, dead code elimination, and, critically, loop vectorization.

class FastProcessor final : public Processor {
    void process(float* data, int n) override {
        for (int i = 0; i < n; ++i) data[i] *= 2.0f;
    }
};
// Caller can inline this -- vectorizer sees the loop body directly

Without final, the compiler must assume an override exists somewhere in the translation unit set and emits a vtable call. A for loop over one million elements calling a non-final virtual method on each one will not vectorize. The same loop with final can produce AVX2 instructions. That gap is not a few cycles per call; it is the difference between scalar and vectorized throughput.

Local object escape analysis. If an object is created locally and the compiler can prove that its address never escapes to code outside the current translation unit, the dynamic type is known and the call is devirtualized without any annotation. This works well for objects created on the stack or as temporaries. It fails as soon as the object pointer is passed to a function in another translation unit or stored somewhere the optimizer cannot track.

Speculative devirtualization with PGO. After a profile-guided optimization run, both GCC (-fprofile-use) and Clang (-fprofile-instr-use) can identify call sites where one concrete type dominates. They emit a type guard followed by an inlined fast path, with a fallback indirect call for the other cases:

cmp rax, offset _ZTV14ConcreteWidget+16   ; is this the type we expect?
jne .slow
; inlined ConcreteWidget::update() body here
jmp .done
.slow:
call qword ptr [rax + 16]                 ; general indirect dispatch
.done:

This is effective when one type appears at a call site ninety percent or more of the time. With an unpredictable mix of types, the speculation check becomes another mispredicted branch and the net result is worse than the original.

Whole-program LTO analysis. With link-time optimization enabled (-flto in GCC, -flto=thin in Clang), the compiler can analyze the entire program and determine that only one concrete implementation of a virtual method is reachable. When that proof succeeds, it devirtualizes without final. The conditions are strict: no dynamic loading, no plugin systems, no factory APIs crossing DSO boundaries. In a self-contained binary, ThinLTO commonly produces an additional 10-15% runtime improvement over -O3 on code with heavy virtual dispatch.

The gap that none of these four cases covers is the common situation in library code: virtual calls through pointers returned by factories in compiled binaries, calls in loops over std::vector<Base*> that mixes many derived types at runtime, and any code that has to work across a public API. On latency-sensitive paths in those situations, the overhead stays. The compiler is not coming to save you.

The cost of the overhead it leaves behind

A vtable call costs roughly two memory loads and an indirect branch. On hardware from before 2018, the indirect branch prediction machinery handles monomorphic call sites reasonably well: roughly five to ten cycles if the branch predictor learns the target. The cost rises to fifteen to twenty cycles at a megamorphic call site where many targets appear unpredictably.

After Spectre variant 2, disclosed in January 2018, the calculus changed. Indirect branches are the attack vector; the software mitigation, retpoline, replaces each indirect branch with a trampoline that starves the speculative pipeline. The cost of an indirect call on a retpoline-mitigated system is roughly ten to twenty-five cycles on Skylake-class hardware regardless of prediction accuracy. Intel hardware from Cascade Lake onward ships Enhanced IBRS (eIBRS), which lowers the cost to approximately four to six cycles. CET-IBT, available on Tiger Lake and later, brings it down to near-direct-call cost by enforcing valid indirect branch targets in hardware rather than neutering speculation.

If you are profiling on a modern laptop and comparing with numbers from a 2019 cloud server, you may be working from the wrong baseline. The mitigation cost depends on the CPU generation. On Linux, /sys/devices/system/cpu/vulnerabilities/spectre_v2 reports your current mitigation: Enhanced IBRS means the cheaper hardware path; Retpolines means the full software mitigation is active.

The inlining barrier is often a larger concern than the raw call overhead. When the compiler cannot see through a virtual call, it cannot vectorize, cannot propagate constants across the boundary, and cannot hoist loop-invariant computation. A scalar loop that could vectorize to AVX2 runs at perhaps eight to twelve times lower throughput than the vectorized version. That multiplication on throughput loss typically dwarfs the ten to twenty-five cycles spent on the call itself.

CRTP: the classical static answer

The Curiously Recurring Template Pattern was documented by Jim Coplien in C++ Report in 1995. Its premise is that the concrete derived type can appear as a template parameter of its own base class, making every method call in the base resolve statically at compile time:

template<typename Derived>
class Renderer {
public:
    void render(const Scene& s) {
        static_cast<Derived*>(this)->renderImpl(s);
    }
};

class SoftwareRenderer : public Renderer<SoftwareRenderer> {
public:
    void renderImpl(const Scene& s) { /* ... */ }
};

The static_cast carries no runtime cost. The compiler resolves renderImpl to SoftwareRenderer::renderImpl at instantiation time and inlines freely. No vptr is inserted into the object; sizeof(SoftwareRenderer) contains no hidden pointer. The Eigen linear algebra library, which uses expression templates built on CRTP for lazy evaluation of matrix expressions, is the canonical proof at scale: it achieves near-hardware-optimal throughput precisely because no virtual calls interrupt the optimizer.

The limitation is that Renderer<SoftwareRenderer> and Renderer<HardwareRenderer> are distinct base class types with no shared ancestor. Storing a heterogeneous collection requires a type-erasing wrapper, which can reintroduce overhead. CRTP is the right tool for a homogeneous interface that a caller uses without knowing the concrete type, not for storing mixed types in a container at runtime.

C++20 concepts: better diagnostics, same machine output

Concepts do not change what happens at runtime; they change what the compiler checks at the constraint boundary. The practical value for static polymorphism is documentation and error quality. A CRTP base that expects its derived class to implement renderImpl produces a cryptic instantiation error when the method is missing. A concept produces an error at the constraint violation point:

template<typename T>
concept Renderable = requires(T t, const Scene& s) {
    { t.renderImpl(s) } -> std::same_as<void>;
};

template<Renderable T>
void render(T& r, const Scene& s) {
    r.renderImpl(s);  // direct call, full inlining eligibility
}

The generated code is identical to a CRTP call on an unconstrained template. The concept gate exists entirely for the programmer, not the machine. For pure interface enforcement without shared base-class implementation, this is the cleaner approach in C++20 code. The cppreference constraints documentation covers the distinction between requires clauses and requires expressions, which is a consistent source of confusion in early adoption.

Concepts also enable overload resolution via constraint subsumption. If you have two template overloads where one satisfies a more specific concept than the other, the compiler selects the more constrained version without an explicit priority mechanism. That is the kind of zero-cost abstraction that SFINAE approximated badly for two decades before C++20 arrived.

C++23 deducing this: eliminating the CRTP boilerplate

P0847R7, merged into C++23, introduces the explicit object parameter. The implicit this argument to a member function can now be declared and typed explicitly, allowing the compiler to deduce the concrete type of the receiver at the call site. The CRTP static_cast disappears entirely:

struct RendererBase {
    void render(this auto& self, const Scene& s) {
        self.renderImpl(s);   // concrete type deduced, resolved at compile time
    }
};

struct SoftwareRenderer : RendererBase {
    void renderImpl(const Scene& s) { /* ... */ }
};

SoftwareRenderer r;
r.render(scene);   // calls SoftwareRenderer::renderImpl directly, inlines

The generated code is identical to CRTP. There is no hidden indirection, no vptr, no polymorphic overhead. The difference is that the implementation is readable to engineers unfamiliar with the CRTP pattern and eliminates the category of bugs that arise from incorrect static_cast types in the base. MSVC 19.33, Clang 18, and GCC 14 all support deducing this. For new code targeting a modern toolchain, there is no reason to write the CRTP version.

The feature generalizes beyond simple dispatch. You can combine the explicit object parameter with concepts to enforce interface contracts at the same call site, apply it to value types that were previously excluded from CRTP because they used the wrong inheritance semantics, and use it to implement recursive patterns without template parameters.

std::variant for closed type sets with runtime variation

When the set of possible types is finite and enumerable at compile time, and you genuinely need to dispatch at runtime, std::variant combined with std::visit often outperforms a vtable hierarchy:

using Shape = std::variant<Circle, Rectangle, Triangle>;

float totalArea(const std::vector<Shape>& shapes) {
    float total = 0.0f;
    for (const auto& s : shapes) {
        total += std::visit([](const auto& sh) {
            return sh.area();
        }, s);
    }
    return total;
}

The compiler inlines the lambda for each concrete type and generates a jump table or short comparison chain for the discriminant. The objects are stored by value inside the variant: no heap allocation, no pointer chasing, no vptr. In a tight loop over a container, the cache behavior is fundamentally different from a std::vector<IShape*> where every element might point to a separately heap-allocated object with its own cold vptr and vtable. The variant approach trades open extensibility for better data locality and more predictable dispatch.

The constraint is closure. The type set must be fully enumerable when the code compiles. Plugin systems, scripting runtimes, and libraries that users derive from cannot use this approach. For code that controls all its types, variant is frequently the right choice and the one most often overlooked when engineers reach reflexively for a virtual hierarchy.

A decision framework

Three questions determine which tool fits a given situation.

Is the type set open? Virtual dispatch is appropriate. Use final wherever the design allows, enable LTO for whole-program analysis, and profile before concluding that the remaining overhead is a problem on your specific hot path. The compiler will devirtualize what it can; final tells it what you already know.

Is the type set fixed at compile time with no runtime variation needed? CRTP or deducing this gives you the abstract interface with all calls resolved statically. In new C++23 code, deducing this is the cleaner form. In C++17 codebases, CRTP is the standard idiom and Eigen’s track record across thousands of numerical computing projects demonstrates it scales.

Is the type set enumerable but needed at runtime? std::variant with std::visit typically provides better cache behavior and avoids the type-erasure overhead that CRTP wrapper approaches reintroduce when you need heterogeneous storage.

The underlying observation in David Álvarez Rosa’s article is that virtual dispatch is the right default for an open-ended design, not a universal default. When you know the type set, C++ has always had the tools to eliminate the overhead. What C++20 and C++23 have added is the ability to write those tools in code that does not look like a metaprogramming puzzle.