CFI in C++: What Each Implementation Actually Enforces

The central observation in James McNellis’s keynote at Meeting C++ 2025 is deceptively simple: CFI does not fix memory bugs. It addresses what happens after a memory bug is exploited. That framing matters because most discussion of CFI drifts toward activation ceremonies, which flags to pass, which sanitizers to enable, without confronting what the protection actually does and where it stops.

The Two Attack Classes

Virtual dispatch in C++ emits code roughly like:

mov rax, [rcx]       ; load vptr from object
call [rax + offset]  ; dispatch through vtable slot

The CPU follows that pointer with no validation. A heap buffer overflow that overwrites the vptr in an adjacent object redirects the next virtual call to an attacker-controlled vtable. The attack manipulates data, not code, so it bypasses W^X entirely. Vtable hijacking has been the dominant browser exploit technique for over a decade because heap bugs are common and the exploit primitive is reliable.

Return-Oriented Programming exploits the unconditional behavior of ret:

pop rip  ; load return address from stack, jump to it

No check. A stack buffer overflow that reaches the saved return address can chain together short existing instruction sequences called gadgets, each ending in ret, to call arbitrary system operations without injecting a single byte of new code. Stack canaries detect sequential overwrites; ASLR raises the bar but entropy can be insufficient, especially in 32-bit contexts, and addresses can sometimes be leaked. Neither fully closes the door.

CFI addresses both by constraining where indirect transfers are allowed to land.

Software CFI: Clang’s Type-Based Approach

Clang’s -fsanitize=cfi family is the most precise software implementation available. It is also the most demanding to deploy correctly.

The core mechanism for virtual calls, -fsanitize=cfi-vcall, works at the type level. At compile time, Clang builds the full type hierarchy using LLVM’s type metadata. Each class gets a type identifier; the linker merges these identifiers across translation units to compute which vtable pointers are valid targets at each call site. Before every virtual dispatch, the compiler emits a check:

// Conceptual inline expansion at a virtual call site
uintptr_t vtable_offset = (uintptr_t)vptr - base_of_valid_range;
if (vtable_offset >= range_size)
    __builtin_trap();
call *vptr[slot];

The actual implementation uses an unsigned subtraction and comparison against a precomputed range, typically two to four instructions. Derived classes are always valid targets because Liskov substitution demands it, so the valid range expands with the depth and breadth of the class hierarchy. Shallow hierarchies with few implementations give tighter protection than sprawling ones.

This mechanism requires Link-Time Optimization and typically -fvisibility=hidden. Without whole-program visibility, the compiler cannot compute a sound set of valid targets. Shared libraries compiled without CFI create unchecked call boundaries. Cross-DSO CFI support exists but adds overhead and complexity. This is not a theoretical limitation; it is the primary reason software CFI adoption lags behind its documentation.

For indirect function calls, -fsanitize=cfi-icall validates that the target function’s type signature matches the call site. Real-world C++ code frequently casts function pointers in ways that technically violate strict type rules even when the cast is safe in practice. These casts produce CFI violations, and auditing large codebases for them is a significant onboarding cost. Individual functions can opt out with __attribute__((no_sanitize("cfi"))), but doing that thoughtfully requires understanding every affected call site.

Clang exposes two violation behaviors: -fsanitize-trap=cfi produces an immediate trap, while -fsanitize-recover=cfi logs the violation and continues. Auditing a codebase in recover mode before switching to trap mode is the standard hardening workflow. The full invocation looks like:

clang++ -fsanitize=cfi -flto -fvisibility=hidden -fsanitize-trap=cfi source.cpp

Performance overhead for software CFI on virtual-call-heavy C++ is noticeable. The two to four instruction check at every indirect call site adds roughly 10-15% overhead on call-intensive benchmarks, depending on hierarchy depth and call frequency.

Hardware CFI: Intel CET

Intel’s Control-flow Enforcement Technology moves enforcement into the CPU. It shipped in Tiger Lake and later processors (11th-generation Core and beyond, available since late 2020). CET has two independent primitives with different performance characteristics.

Indirect Branch Tracking (IBT) requires that every legitimate indirect branch target begin with an ENDBR64 instruction:

endbr64          ; F3 0F 1E FA, required at valid indirect targets
push rbp
mov rbp, rsp
; function body

ENDBR64 is a prefixed NOP in normal sequential execution and costs nothing there. When the CPU executes an indirect branch, it enters a WAIT_FOR_ENDBR state. If the next instruction is not ENDBR64, the processor raises a #CP (Control Protection) exception. An attacker redirecting a virtual dispatch to an arbitrary address almost certainly lands mid-function, not at a valid ENDBR64-marked entry point, and the CPU terminates the process before any damage.

Shadow Stack (SHSTK) addresses the backward edge. The CPU maintains a parallel stack in hardware-enforced write-protected memory. Normal store instructions cannot write to shadow stack pages; only privileged WRSS and INCSSP instructions can. Every call pushes the return address to both the normal and shadow stacks; every ret compares the normal stack value against the shadow stack top. Mismatch raises #CP. A stack buffer overflow can corrupt the normal stack’s return address all it wants. The shadow stack remains untouched, and the comparison fails.

Performance is significantly better than software CFI because enforcement happens in dedicated silicon rather than through emitted instruction sequences. Intel reports roughly 2-4% overhead for IBT on typical workloads and under 1% for shadow stack. GCC and Clang expose this via:

-fcf-protection=branch   # IBT only
-fcf-protection=return   # shadow stack only
-fcf-protection=full     # both

These flags emit ENDBR64 at appropriate sites and generate shadow stack metadata; the CPU does the rest. Windows 10 20H1 enabled user-mode CET on compatible hardware. Linux kernel 6.6 added user-mode shadow stack support. The coverage still has a gap: libraries compiled without -fcf-protection=branch lack ENDBR64 at their entry points, but IBT only fires on indirect transfers, not direct calls, so this matters primarily at indirect call boundaries.

ARM Pointer Authentication

ARM took a cryptographic approach on ARMv8.3-A and later, including all Apple Silicon. Pointer Authentication (PAC) signs pointers before storing them:

; AArch64 function prologue with PAC enabled
paciasp              ; sign return address in LR using SP as context key
stp x29, x30, [sp, #-16]!

; Function epilogue
ldp x29, x30, [sp], #16
autiasp              ; authenticate and strip signature
ret

paciasp computes an authentication code using a per-process secret key and the current stack pointer, storing the result in the upper bits of the register. autiasp verifies the code and strips it before use. An attacker who overwrites the saved x30 cannot forge the authentication code without knowing the key. The corrupted pointer produces an invalid address, and the ret faults. Branch Target Identification (BTI) provides the analogous forward-edge protection, requiring bti instructions at valid indirect-call targets. Both are enabled with:

-mbranch-protection=standard

The overhead is low: authentication executes in a single cycle and only appears at function boundaries. PAC and BTI are enabled by default throughout Apple’s toolchain and are the active enforcement mechanism on modern Android devices with qualifying ARM hardware.

The Deployment Gap

McNellis’s keynote is useful precisely because it comes from someone who has deployed these protections at production scale at Microsoft, where Control Flow Guard (/guard:cf) has been enabled across Windows system binaries for years. CFG validates that indirect call targets are registered valid function entry points, coarser than Clang’s type-based check but effective and widely deployed. The practical lesson from that experience is that the gap between enabling a flag and claiming actual protection is larger than documentation suggests.

Coverage is not binary. Software CFI that covers your application binary does not extend to CFI-unaware dependencies. A virtual dispatch into a shared library compiled without LTO and -fsanitize=cfi is unchecked at that boundary. Third-party libraries, system libraries compiled with different toolchains, and dynamically loaded plugins all create gaps. Hardware CFI narrows this because it requires only that target code emits ENDBR64 at entry points, not full LTO, but the core issue remains: the weakest link determines the coverage.

The overapproximation problem affects software CFI specifically. The valid-target set at any call site is all type-compatible targets in the program. In a codebase with deep inheritance hierarchies, that set can be large. Attacks that stay within the valid-target set are called COOP (Counterfeit Object-Oriented Programming) and are possible in principle against software CFI. The attack surface is dramatically reduced compared to unchecked vtable dispatch, but it is not zero.

JIT compilation is a structural problem for any CFI scheme. JIT engines generate code at runtime, producing new function entry points that cannot be registered in a compile-time valid-target set. Browsers require specialized CFI integration that marks JIT-compiled code regions after generation, using privileged APIs to register new ENDBR64 targets or equivalent. This is nontrivial and is one reason browser security teams maintain bespoke CFI tooling.

Real-World Adoption

Chromium ships with -fsanitize=cfi-vcall and -fsanitize=cfi-icall and has done so for years. It is the most thoroughly documented large-scale software CFI deployment and the reference point for anyone evaluating software CFI overhead on real code. The Android build system uses Clang for system components and has progressively enabled CFI across platform libraries; on modern Android devices PAC and BTI are the active enforcement hardware. The Windows kernel uses CFG and hardware CET protections on compatible hardware. Chrome, Windows, and Android together represent a substantial fraction of deployed software, which makes CFI less exotic than it might appear from the tooling documentation.

What CFI Changes About Your Security Posture

CFI is a containment strategy, not a prevention strategy. A program protected by full Clang CFI, Intel CET IBT, and shadow stack is a significantly harder exploitation target. Vtable hijacks fail the type check or the ENDBR64 check. ROP chains fail the shadow stack comparison. That is real, measurable hardening.

What it does not do is eliminate the memory corruption bug that gave the attacker a foothold. Memory-safe languages address the cause; CFI addresses the consequence. For the substantial portion of the software ecosystem that will remain in C++ for years, understanding CFI at the implementation level, not just the flag level, is the difference between checking a security box and actually closing attack paths. The McNellis keynote is a practical entry point. The harder work is understanding what your specific codebase and its dependency graph need before the flag actually does what you think it does.