The Real Cost of Virtual Dispatch and What Static Polymorphism Buys You

Virtual functions are load-bearing in most C++ codebases, and for good reason. They let you write code that works against an interface without knowing the concrete type at the call site. The mechanism is clean, the syntax is familiar, and the compiler enforces the contract. The cost, though, is not zero, and on latency-sensitive paths the overhead compounds in ways that are worth understanding precisely.

David Álvarez Rosa’s article on isocpp.org lays out the core argument for replacing virtual dispatch with static polymorphism in performance-critical code. The case is solid, but I want to go further: what exactly is the compiler doing when it tries to devirtualize, where does that effort fail, and when you do reach for static polymorphism, which of the available techniques fits your situation best.

What the CPU Does with a Virtual Call

The mechanism is straightforward at the language level: call a method through a base class pointer, the right implementation runs. Under the hood, the hardware does several things in sequence. First, the processor loads the vptr from the beginning of the object. Second, it indexes into the vtable using a fixed offset for the specific method. Third, it loads the function pointer stored in that slot. Fourth, it branches to that address.

That sequence involves two memory loads before the actual function body begins. If the object, the vtable, and the function are all in L1 cache and the branch predictor has seen this pattern before, the overhead is modest, typically a few nanoseconds on modern hardware. The picture changes when any of those conditions are not met. A vtable lookup that misses to L2 cache can cost 5-10 ns; a miss to main memory, 50-100 ns. These are not hypothetical numbers; they follow from the fundamental latency characteristics of the cache hierarchy on current x86-64 processors.

Branch prediction is the subtler cost. Modern CPUs use the indirect branch predictor to guess which function will be called before the vtable load completes. When the concrete type varies across loop iterations because you have a heterogeneous collection of base class pointers, that predictor misfires frequently. Each misfire flushes the pipeline and restarts execution from the correct path. On a processor with a 15-20 stage pipeline, that is 10-20 cycles of stalled execution per misprediction.

The inlining barrier is arguably more significant than the dispatch overhead itself. Inlining allows the compiler to see through function boundaries and apply optimizations that span multiple operations: loop hoisting, constant propagation, auto-vectorization of arithmetic in tight loops. A virtual call that cannot be devirtualized prevents all of those. Even a function that does trivial arithmetic loses its optimization potential if the compiler cannot see through the call boundary.

What the Compiler Will Try

GCC and Clang both implement devirtualization passes that attempt to replace virtual calls with direct calls when they can prove which function will execute. The most reliable path is the final keyword:

class Renderer final : public IRenderer {
public:
    void draw(const Scene& scene) override;
};

// The compiler knows Renderer has no subclasses.
// Calls through Renderer& or Renderer* can be resolved statically.
Renderer r;
r.draw(scene); // becomes a direct call, eligible for inlining

Without final, the compiler can still devirtualize when the object was constructed in visible scope and has not escaped to external code. If you create a Circle on the stack, call a method, and the compiler can trace that pointer within the function, it may devirtualize even through a base class pointer. This analysis is conservative and breaks down across function call boundaries without link-time optimization.

Link-time optimization changes the picture considerably. With -flto on GCC or Clang, the compiler analyzes the entire program as a unit after individual translation units are compiled. If only one class ever overrides a particular virtual method across the whole program, LTO can safely devirtualize all calls to it. This is effective for programs that do not use dynamic libraries. The moment a shared library enters the picture, the compiler must assume that new types implementing the interface could be loaded at runtime, and the analysis becomes impossible.

GCC exposes two relevant flags beyond LTO: -fdevirtualize, enabled by default at -O2 for obvious cases, and -fdevirtualize-speculatively, enabled at -O3 for cases where the compiler guesses the likely type and guards the optimized path with a runtime type check. Speculative devirtualization can help in tight loops where one concrete type dominates call sites, but it adds a conditional branch and does not eliminate the fallback to virtual dispatch when the guess is wrong.

CRTP: Compile-Time Dispatch Through the Base Class

The Curiously Recurring Template Pattern was the practical answer to static polymorphism before C++20. The idiom works by parameterizing the base class on the derived class itself, then casting to the derived type to call the implementation:

template<typename Derived>
class Shape {
public:
    void draw() {
        static_cast<Derived*>(this)->drawImpl();
    }
    double area() const {
        return static_cast<const Derived*>(this)->areaImpl();
    }
};

class Circle : public Shape<Circle> {
public:
    void drawImpl() { /* circle rendering */ }
    double areaImpl() const { return M_PI * radius_ * radius_; }
private:
    double radius_;
};

The cast resolves at compile time. No vtable is generated; the compiler knows the concrete type at every call site that holds a Circle or Shape<Circle>. The implementation can be inlined completely. Object layout does not include a vptr, so Circle is smaller than a virtually-derived equivalent by 8 bytes on 64-bit systems.

The limitation that matters most is the inability to use heterogeneous collections transparently. You can have a Circle and a Rectangle, but you cannot put both into a std::vector<Shape*> and call draw() without going back to virtual dispatch or writing type erasure by hand. CRTP gives you performance at the cost of losing the runtime polymorphism model for those cases where it is genuinely needed.

Error messages from CRTP failures were historically severe. A type constraint violation inside a deeply instantiated template could produce pages of output before identifying the actual problem. C++20 concepts address this directly, which is why most new code should prefer concepts over CRTP for the interface-enforcement role.

C++20 Concepts: Cleaner Interface Constraints

Concepts let you express what a type must provide without imposing a shared base class:

template<typename T>
concept Shape = requires(const T& s) {
    { s.draw() } -> std::same_as<void>;
    { s.area() } -> std::convertible_to<double>;
};

template<Shape S>
void renderAll(std::span<S> shapes) {
    for (auto& s : shapes) {
        s.draw(); // direct call, inlined if the body is visible
    }
}

s.draw() is a direct call resolved at instantiation time. No vtable, no indirection, full inlining opportunity. The concept constraint gives the compiler early violation detection and produces a readable error at the call site when a type does not satisfy the requirements, rather than deep inside the template body.

Concepts do not replace CRTP when the base class needs to provide shared implementation logic. If your base is doing meaningful work that all derived types inherit without duplication, CRTP is still the right tool. For generic algorithms that simply need a type to conform to a protocol, concepts are preferable: cleaner to write, cleaner to read, and they produce substantially better diagnostics.

std::variant and std::visit: The Closed-Set Alternative

A path that often goes unmentioned in these comparisons is std::variant with std::visit. If the set of possible types is fixed and known at compile time, variant dispatch is frequently the fastest option:

using ShapeVariant = std::variant<Circle, Rectangle, Triangle>;

double totalArea(const std::vector<ShapeVariant>& shapes) {
    double sum = 0;
    for (const auto& s : shapes) {
        sum += std::visit([](const auto& shape) {
            return shape.area();
        }, s);
    }
    return sum;
}

std::visit dispatches based on the variant’s internal type index rather than a vtable pointer. The dispatch mechanism is typically a jump table into a small set of cases. No double indirection; the concrete type’s method is called directly inside the generic lambda and can be inlined by the compiler. Storage is contiguous within the variant, so there is no heap allocation and no pointer chasing.

Benchmarks comparing std::variant to virtual dispatch consistently show variant to be faster, particularly when the type set is small and call patterns are predictable. The trade-off is rigidity: adding a new type requires changing the variant definition and recompiling everything that uses it. For plugin systems or extensible APIs, this is disqualifying. For domain types that represent a closed set of entities, it is a reasonable constraint.

The variant approach also interacts well with value semantics. Objects can live in contiguous arrays without pointer indirection to heap-allocated polymorphic instances, which improves cache utilization in data-heavy workloads like spatial queries or entity component systems.

Choosing the Right Tool

Virtual dispatch belongs where you need genuine runtime extensibility: plugin architectures, interfaces exported across library boundaries, types loaded from external modules at runtime. Mark concrete implementations final wherever appropriate, and build with -flto if your pipeline supports it. Those two steps give the compiler the maximum opportunity to devirtualize without any source-level changes.

CRTP belongs in library code that provides shared base behavior through a configurable interface. If you are building a policy-based design where the base provides real implementation that derived types share or specialize, CRTP is appropriate. The STL’s older iterator infrastructure used variations of this pattern for similar reasons.

Concepts belong in generic algorithms. If your template function just needs a type to satisfy a set of operations, constrain it with a concept. You get zero-overhead dispatch, readable errors at constraint-violation sites, and no inheritance complexity in your class hierarchy.

std::variant belongs to closed type sets with value semantics. If you have a finite collection of types that do not need to live behind a common pointer at runtime, variant produces tight, inlinable dispatch code with a simpler object model than either CRTP hierarchies or virtual inheritance.

The underlying principle across all of these is the same one the isocpp.org article articulates: when the compiler can resolve a call at compile time, it eliminates not just the dispatch overhead but also the optimization barriers that indirect calls impose. The question is always whether the complexity cost of achieving that resolution manually is justified by the profiling data. Most code is not on a latency-sensitive path where virtual dispatch is the bottleneck. When it is, the options are better than they have ever been.