The Inlining Firewall: What Virtual Functions Block the Compiler From Doing

Virtual functions sit at the center of most object-oriented C++ codebases. They are clean, they express intent well, and they work. The performance cost is often waved away as negligible, or acknowledged as a few nanoseconds of pointer indirection. That framing is incomplete, and the incompleteness is what tends to surprise people when they profile a hot path.

David Álvarez Rosa’s piece on devirtualization gives a solid breakdown of the mechanics. This post goes further into why the vtable lookup itself is rarely the bottleneck, what the compiler loses when it cannot inline across a virtual call, and how static polymorphism has evolved from CRTP boilerplate to the much cleaner C++23 explicit object parameter syntax.

The Vtable Lookup Is Not Your Problem

When a virtual function gets called, the CPU needs two memory loads and an indirect branch. It loads the vptr from the object, then indexes into the vtable to get the function pointer, then calls through it. On a warm cache with a predictable call target, this costs roughly 2 to 5 nanoseconds. On a cold vtable or with a megamorphic call site (four or more concrete types mixing at a single dispatch point), you are looking at cache miss penalties and branch mispredictions that together reach 50 to 80 nanoseconds.

Those numbers matter in aggregate, but they are not the main story. The main story is what happens to the code around the call.

Compilers work aggressively to optimize code by seeing across boundaries. When you call a regular function, the compiler can inline it, propagate constants into it, eliminate dead branches inside it, and merge the result back with the calling context. An indirect virtual call is an opaque barrier. The compiler does not know at compile time what code will execute, so it cannot optimize across the boundary. The callee’s entire body is invisible to the optimizer.

This shows up most clearly in loops. Consider a virtual update() method called on a container of objects. The compiler cannot vectorize that loop with SIMD because the call target is unknown. With AVX2, a tight numerical loop can process eight floats per cycle; a version with a virtual call in the body processes one. The dispatch overhead is measurable in nanoseconds, but the vectorization loss is a factor of eight in throughput.

The same logic applies to constant folding and dead code elimination. If the concrete implementation branches on a value that the caller already knows, a devirtualized call can eliminate those branches entirely. A virtual call cannot.

Spectre Made Production Costs Worse

For much of C++‘s history, virtual dispatch on predictable monomorphic call sites was essentially free on modern out-of-order CPUs. The indirect branch predictor learned the target and the pipeline never stalled. Spectre changed that.

The Spectre v2 mitigations required replacing indirect branches with retpoline sequences to prevent speculative execution from leaking memory through branch prediction. On hardware without eIBRS (Enhanced Indirect Branch Restricted Speculation, available from Intel’s Cascade Lake generation in 2019), every indirect call in a retpoline-mitigated binary costs 30 to 80 cycles regardless of whether prediction would have been accurate. This is production kernel and server binary behavior; development machines on newer hardware often skip the mitigation.

The practical effect: codebases that benchmarked virtual dispatch as “fast enough” on development hardware found it costing 10x more in production before eIBRS was widespread. This renewed interest in devirtualization techniques around 2018 to 2020.

What Compilers Can Do Without Help

Before reaching for manual static polymorphism, it is worth understanding what modern compilers already handle. GCC and Clang devirtualize in several situations:

Stack-allocated objects with visible type. If you construct a Circle on the stack and immediately call a virtual method on it, the compiler knows the concrete type and calls directly. This requires no source changes.

The final keyword. Marking a class or method final tells the compiler no override exists. The call becomes a direct call, fully inlinable. It is underused and should generally be the first tool considered. It also serves as documentation.

class Circle final : public Shape {
    void draw() override { /* ... */ }  // devirtualized at all static call sites
};

Link-time optimization. With -flto, GCC runs an interprocedural devirtualization pass that analyzes the entire program’s class hierarchy. If only one override of a method exists across the whole binary, the call gets devirtualized. Clang’s -fwhole-program-vtables with ThinLTO does the same. Google and Meta use this in production.

Profile-guided optimization. PGO records which concrete type appears most frequently at each call site. The compiler emits a type check and an inlined fast path for that type, with a fallback for others. Speculative devirtualization under -O2 without PGO does the same thing heuristically.

The limits: heap-allocated objects with factory-returned pointers, shared library boundaries, virtual base classes, and call sites where many types genuinely appear with no dominant one. These are exactly the situations where manual static polymorphism becomes relevant.

CRTP: Static Dispatch Before C++23

The Curiously Recurring Template Pattern has been the canonical approach to static polymorphism for decades. The derived class passes itself as a template argument to the base, and the base casts this to the derived type for dispatch.

template<typename Derived>
class Shape {
public:
    void draw() {
        static_cast<Derived*>(this)->drawImpl();
    }

    double area() {
        return static_cast<Derived*>(this)->areaImpl();
    }
};

class Circle : public Shape<Circle> {
public:
    void drawImpl() { /* circle-specific */ }
    double areaImpl() { return 3.14159 * radius * radius; }
private:
    double radius;
};

The static_cast is purely compile-time. The generated code is a direct function call, inlinable by the compiler, with no vtable, no vptr overhead, and no indirect branch. Eigen uses this pattern throughout its matrix expression templates to achieve lazy evaluation and vectorization that would be impossible through virtual dispatch. LLVM’s instruction selection infrastructure uses it similarly.

The drawback is structural. Shape<Circle> and Shape<Square> are unrelated types. Storing a mixed collection of shapes requires either a common non-templated base (which reintroduces virtual dispatch) or type erasure. Template instantiation also inflates compile times and binary size, and the static_cast<Derived*>(this) pattern is error-prone to write correctly.

C++23 Deducing This: CRTP Without the Ceremony

C++23 introduced explicit object parameters, standardized in P0847. The this auto& syntax deduces the concrete derived type at the call site, producing the same zero-overhead direct dispatch as CRTP, without the template parameter in the class definition.

class Shape {
public:
    void draw(this auto& self) {
        self.drawImpl();
    }

    double area(this auto const& self) {
        return self.areaImpl();
    }
};

class Circle : public Shape {
public:
    void drawImpl() { /* circle-specific */ }
    double areaImpl() const { return 3.14159 * radius * radius; }
private:
    double radius;
};

The generated assembly is identical to the CRTP version: a direct call that the compiler can inline. Shape is now a single shared base class rather than a distinct instantiation per derived type, which eliminates the unrelated-types problem from CRTP. Const and non-const overloads collapse into one method. Compiler support landed in GCC 13, Clang 17, and MSVC 2022 17.4.

This is the pattern to reach for in new C++23 code when you need guaranteed static dispatch on a shared interface.

std::variant for Closed Type Sets

For cases where the set of concrete types is known and fixed at compile time, std::variant with std::visit offers a different trade-off. The dispatch mechanism is a jump table or comparison chain over a small integer discriminant, not a vtable. All type arms are visible to the optimizer for inlining. But the larger benefit is data layout.

using Shape = std::variant<Circle, Rectangle, Triangle>;

void drawAll(std::vector<Shape>& shapes) {
    for (auto& s : shapes) {
        std::visit([](auto& shape) { shape.draw(); }, s);
    }
}

A std::vector<Shape> stores objects contiguously in memory. A std::vector<Base*> stores pointers to heap-allocated objects scattered across the address space. For large collections traversed repeatedly, the cache locality difference dominates everything else. The Abseil team measured 1.8 to 2.4x throughput improvements in heterogeneous collection traversal from this layout difference alone, in work on protocol buffer parsing.

The constraint is extensibility. Adding a new type requires modifying the variant definition and every visit site that needs to handle it. Plugin architectures and runtime-extensible type systems require virtual dispatch or an equivalent mechanism. variant is not a replacement for polymorphism in general; it is a better tool when the type set is genuinely closed.

When to Use Which

Profiling comes first. Most code paths are not hot enough to care, and virtual dispatch on a code path that executes thousands rather than millions of times per second is not a performance problem.

For code that profiles as slow and involves virtual dispatch, the order of tools to consider: mark stable leaf classes final, check whether LTO is enabled for the build, add PGO if the call site is reliably monomorphic in practice. These require no source changes or minimal ones and frequently resolve the problem.

When the compiler cannot devirtualize and the path is genuinely critical, reach for explicit object parameters in C++23 codebases. The ergonomics are significantly better than CRTP for new code. For existing CRTP-heavy libraries, the patterns are functionally equivalent and there is no strong reason to migrate.

std::variant fits narrowly where the type set is small, known, and unlikely to expand, and where you are iterating over heterogeneous collections in a tight loop. The cache locality benefit is real and worth measuring in that specific scenario.

The underlying principle across all of these is the same: give the compiler enough information to resolve dispatch at compile time, and it will produce better code than you can write manually, through inlining and the cascade of optimizations that inlining enables.