C's Undefined Behavior Was Never One Thing: The Formal Split Coming in C2Y

The C standard has listed undefined behavior in Annex J.2 since C89. Over 200 entries, each with identical formal status: the implementation may do anything. This monolithic treatment was never a description of a homogeneous phenomenon. It was a filing cabinet where the committee placed everything it could not, or would not, decide on behalf of all conforming implementations.

WG14 paper N3861, titled “Ghosts and Demons: Undefined Behavior in C2Y,” makes the formal argument that this conflation has produced serious problems and proposes a taxonomy to address them. C2Y is the working name for the next C standard after C23, targeting publication around 2028-2029.

The Folklore and the Precision Behind It

The “nasal demons” phrase originated in a 1992 post on comp.std.c by Henry Spencer, who observed that the C standard technically permits a conforming implementation to cause demons to fly from the programmer’s nose in response to undefined behavior. The joke was technically precise: undefined behavior does not mean “your program crashes” or “you get garbage.” It means the standard imposes no requirements on the implementation whatsoever.

N3861’s title borrows this vocabulary with two distinct meanings operating simultaneously.

In formal semantics, “demonic nondeterminism” is a term from refinement calculus and predicate transformer semantics. A demonic choice is made adversarially; the nondeterministic agent picks the worst possible outcome for program correctness. The usage in N3861 is specific to that definition: it describes exactly what the optimizer does when it encounters undefined behavior. The optimizer has a proof obligation: “this code path can only be reached if this precondition holds.” When that precondition is the absence of UB, the optimizer concludes that code reachable only via UB is dead, and removes it.

“Ghost” variables come from Hoare logic and separation logic. A ghost is a logical quantity present in the proof but absent from runtime representation. Pointer provenance in C is ghost-like in exactly this sense: a formal model of C memory semantics must track which allocation a pointer came from, even when that information is absent from the pointer’s bit representation after an integer round-trip through uintptr_t. The PNVI provenance model developed at Cambridge formalizes this tracking.

So “ghosts and demons” simultaneously references thirty years of C folklore and two precise technical terms from program verification literature. That dual usage is intentional and carries real meaning throughout the paper.

What Makes Two UBs Different

The core insight in N3861 is that Annex J.2 has always contained two fundamentally different kinds of undefined behavior, conflated by identical treatment.

The first kind: UBs left undefined because the 1989 hardware landscape was genuinely diverse. Ones’-complement arithmetic, sign-magnitude integers, trap representations. These required undefined behavior because specifying behavior across all conforming hardware was either impossible or would have mandated a software emulation layer. Call these the ghosts: they persist in the standard, but no current compiler exploits them for optimization. They are harmless remnants of dead hardware.

The second kind: UBs that compilers actively use as optimizer handles. Signed integer overflow, strict aliasing violations, null pointer arithmetic. These work because the optimizer can treat UB as proof that certain execution paths are unreachable, then eliminate code depending on that proof. Call these the demons: active, exploited, and responsible for a substantial fraction of security vulnerabilities in C codebases.

Both kinds carry the same formal status in the current standard. Reclassifying a ghost costs nothing: no compiler changes, no performance regression, just formal acknowledgment of reality. Reclassifying a demon has real costs and requires committee agreement on what the replacement behavior should be.

The Erroneous Behavior Tier

The most concrete C2Y proposal in N3861 is a new formal tier called “erroneous behavior” (EB). It occupies the space between undefined behavior and implementation-defined behavior.

Under EB:

The program has committed a programming error.
The implementation must produce some defined response: trap, produce an unspecified value, or substitute an implementation-specific result.
The implementation cannot use the erroneous operation as justification for eliminating surrounding code through optimizer proof reasoning.

Consider signed integer overflow. Under current C, overflow is UB, so GCC and Clang at -O1 or above will compile bool will_overflow(int x) { return x + 1 > x; } to return true. The overflow makes the false branch unreachable by UB reasoning, so the comparison is eliminated entirely. Chris Lattner documented this class of transformations in 2011. The transformation is correct per the standard and produces a security vulnerability when the original intent was a bounds check.

Under erroneous behavior, an implementation may:

Trap, as UBSan does.
Produce two’s-complement wraparound, as -fwrapv does.
Produce an unspecified but representable int value.

All three are valid EB implementations. What the optimizer loses is the license to treat overflow as proof that surrounding code is dead. The bounds-check elimination disappears as valid behavior.

The same logic applies to uninitialized reads. Under EB, the read produces an unspecified value, whatever bits happen to be in memory or registers, matching physical reality. The compiler cannot globally eliminate surrounding code on the basis of the uninitialized read. GCC’s -ftrivial-auto-var-init and LLVM’s automatic stack initialization become conforming EB implementations rather than unofficial debugging extensions.

What This Changes for Tooling

The EB tier has a non-obvious consequence for sanitizer tooling. Under current C, UBSan operates outside the standard model. A program that traps under UBSan is not more correct per the standard than one that continues running. This creates a practical problem: the sanitizer instrumentation pass must run before optimization, or the optimizer may legitimately eliminate the code paths the sanitizer was meant to check.

Under erroneous behavior, UBSan-style instrumentation becomes a valid conforming implementation choice. The standard requires the implementation to produce a defined response to erroneous operations; trapping is one such response. This formal alignment matters for safety-critical deployments where the difference between “non-standard debugging tool” and “conforming implementation behavior” carries regulatory weight.

The Linux kernel has dealt with the practical consequences of demon UBs for years. The kernel’s build configuration has carried -fno-strict-aliasing, -fwrapv, and -fno-delete-null-pointer-checks as standard flags because the alternative is compilers eliminating security-critical checks. A 2009 null pointer vulnerability in the tun device was caused directly by GCC eliminating a null check that was preceded by a dereference of the same pointer, a transformation the standard permits. These flags are workarounds for a standard model that does not match how systems programmers reason about their code.

The Cross-Language Parallel

Zig took the most direct approach to this problem. It distinguishes build modes explicitly:

// In Debug and ReleaseSafe: traps with a stack trace
// In ReleaseFast: platform behavior (equivalent to C at -O2)
var x: i32 = std.math.maxInt(i32);
x += 1; // overflow

// Explicit operators for all build modes:
x +%= 1; // wrapping
x +|= 1; // saturating

Zig calls these situations “detectable illegal behavior” rather than undefined behavior, which is more accurate about the operational semantics. The distinction between “the compiler may assume this never happens” and “this will panic in debug builds” is one C has never formally drawn. N3861’s erroneous behavior tier is an attempt to draw it.

Rust went further: safe code has no undefined behavior by design. The borrow checker handles memory safety statically; bounds checks handle the rest at runtime. For arithmetic, Rust separates modes explicitly: + panics on overflow in debug, wraps in release. The checked_add, wrapping_add, saturating_add, and overflowing_add methods cover every case explicitly in any build mode. The Unsafe Code Guidelines project is doing formally for unsafe Rust what N3861 is doing for C: enumerating exactly what UB exists and what its semantics are.

The C++ committee is tracking the same territory through P2795R5, “Erroneous Behaviour for Uninitialized Reads,” in the C++26 cycle. WG14 and WG21 share participants through Study Group 12, and the erroneous behavior terminology has cross-pollinated in both directions. C++23 adopted std::unreachable() on the same timeline as C23’s unreachable().

What Actually Has to Change

The ghost UBs are straightforward. They can be reclassified as implementation-defined or unspecified with no compiler changes and no performance cost. The committee needs only to acknowledge that the hardware justification is gone.

The demon UBs are harder. The core contention in WG14 is not what the replacement behavior should be, but what the performance cost of removing optimizer licenses would be. The benchmarks showing vectorization regressions under -fwrapv are real. So are the CVEs from optimized-away overflow checks. The erroneous behavior tier is a structural proposal to let implementations maintain current performance while conforming to a more constrained standard: behavior that still enables optimization, but within bounds that exclude code elimination based on UB reasoning.

C2Y will not arrive before 2028. N3861 is a working paper in the early stages of the cycle. The formal audit of Annex J.2 entries against the ghost/demon taxonomy is not complete. The erroneous behavior tier still needs formal specification text. The pointer provenance work from the Cerberus project at Cambridge still needs integration.

The direction is clear, though. C’s monolithic undefined behavior category is being formally split for the first time since C89, and the split follows a principled taxonomy rather than ad hoc exceptions. Whether the final standard narrows demons enough to satisfy security engineers while preserving enough optimizer latitude to satisfy performance engineers remains an open question. The fact that WG14 is asking it formally, with vocabulary borrowed from refinement calculus and separation logic, represents a significant shift in posture from every previous C standardization cycle.