· 6 min read ·

The Discriminated Union Emacs Had to Build by Hand

Source: lobsters

Emacs Lisp has one universal value type: Lisp_Object. Every integer, every string, every cons cell, every buffer object exists as a value of that type. At the machine level, it is a 64-bit integer. The entire type system is encoded into that integer by hand, using bit manipulation macros in src/lisp.h. This walkthrough of the implementation makes the mechanics concrete: three low-order bits carry a type tag, heap pointers fill the upper bits, and fixnum values live directly in the word without any allocation.

What makes this worth examining beyond the bit arithmetic is the question of why it exists this way. C does not have discriminated unions. Every other approach Emacs could have taken would have been worse given its constraints, and tracing through why reveals something about what dynamic type dispatch actually requires of a runtime.

What the Language Does Not Provide

A discriminated union is a type that holds one of several possible variants, with a tag identifying which variant is currently present. In Rust, an enum is exactly this:

enum LispObject {
    Symbol(NonNull<Symbol>),
    Fixnum(i62),
    Str(NonNull<LispString>),
    Cons(NonNull<Cons>),
    Float(NonNull<f64>),
    Vectorlike(NonNull<VectorlikeHeader>),
}

The compiler chooses a layout, inserts a discriminant, and generates match-arm dispatch. You cannot access the inner value of a Cons variant when the discriminant says Fixnum. The type system enforces this at compile time.

C has union, but a C union is just overlapping storage. There is no discriminant, no enforcement, and no tracking of which variant is live. A discriminated union in C means writing one yourself: a tag integer plus a payload union, or a tagged pointer scheme where the tag lives in the low bits of a pointer-width integer.

Emacs chose the tagged pointer approach rather than a separate tag field for one primary reason: it avoids allocating integers. A separate-tag scheme like { int type; union { long int_val; void *ptr; } data; } requires two words per value. For a text editor that manipulates enormous quantities of small values, cons cells and characters and fixnums, that doubles the memory pressure on the garbage collector. The tag-in-pointer approach fits a fully typed value in a single machine word, and small integers require no heap allocation at all.

The remacs Lesson

Between roughly 2017 and 2020, the remacs project attempted to port Emacs’s C core to Rust incrementally, replacing one C file at a time while keeping the build functional. When they reached Lisp_Object, the representation question was unavoidable.

The natural Rust choice would be an enum. The remacs developers chose a newtype wrapper instead:

#[repr(C)]
pub struct LispObject(pub EmacsInt);

The bit layout is identical to C’s Lisp_Object. Tag extraction is manual bitmask arithmetic. Pointer recovery is explicit subtraction of the tag value. Everything the Emacs C code does is mirrored in Rust by hand.

The reason is ABI. Rust functions in remacs needed to be called from C and to call C functions, passing LispObject values across the boundary. A Rust enum has a compiler-chosen layout. It might place the discriminant at a different offset, choose a different size for the payload union, or apply niche optimizations that compress certain variants. Any of these would break the assumption that a LispObject value means the same thing on both sides of the FFI boundary.

ABI compatibility forced remacs to think in exactly the same terms as the original C code. The hand-rolled tagged pointer scheme is the interface contract, and interface contracts must be stable across every platform and compiler Emacs has ever supported. The cost of adopting a language-native discriminated union turned out to be: you cannot, without giving up C interoperability.

What the Garbage Collector Needs

Emacs uses a precise garbage collector, not a conservative one. A conservative GC treats any word that looks like a valid heap address as a potential pointer, never missing a live object but potentially retaining dead objects that coincidentally resemble addresses. It cannot safely move objects, because an integer that happens to equal a heap address would become a dangling reference after relocation.

A precise GC knows exactly which words are pointers. It can trace only live pointers, and it can support object relocation. Emacs’s GC is precise in this sense: it knows the exact type of every Lisp_Object slot and therefore knows which slots hold pointers and which hold immediate values.

The tagged pointer scheme gives the GC this knowledge cheaply. To decide whether a given Lisp_Object holds a pointer, the GC reads the three low bits and compares against the pointer tag values. Fixnum tags mean no heap reference. Cons, string, float, symbol, and vectorlike tags mean a pointer that must be traced. No secondary table or runtime type descriptor is required.

For vectorlike objects, the GC also needs to know how many Lisp_Object slots each object contains. The vectorlike_header.size field encodes this count in its lower bits, with the pseudovector subtype in the upper bits. The GC reads the header, masks off the upper bits, and scans that many consecutive slots. The scheme is uniform across every vectorlike type, whether it is a hash table, a window, a frame, or a compiled function. The representation was designed around GC requirements; the uniformity is intentional.

The Aliasing Problem in the Background

Emacs has far more than eight built-in object types. There are only eight possible three-bit tags. The solution is a two-level dispatch: all types beyond the core set share the Lisp_Vectorlike tag, and a pvec_type value in the header identifies the specific subtype.

Every pseudovector struct begins with an identical first member:

struct buffer {
    union vectorlike_header header;
    Lisp_Object name_;
    /* ... */
};

struct window {
    union vectorlike_header header;
    Lisp_Object frame;
    /* ... */
};

Any pseudovector pointer can be cast to struct Lisp_Vector * to read the header. C99 guarantees this is safe via the common-initial-sequence rule, which allows accessing common initial members of structs through a containing union. Emacs sometimes performs these casts without going through an explicit union, which pushes past what the standard strictly permits under strict aliasing. The compiler is allowed to assume that a struct buffer * and a struct Lisp_Vector * cannot point to the same memory and to optimize accordingly.

Emacs’s build process disables the optimization by adding -fno-strict-aliasing to the compiler flags. This is the practical engineering choice for a codebase of this age. The alternative, auditing every pseudovector cast and routing them through properly declared union types, would require changes across hundreds of files for a correctness guarantee that convention already enforces.

This is worth knowing if you are using the pattern elsewhere. The cast from a common-header struct to its parent type is valid in the C standard’s intent, but triggering strict-aliasing undefined behavior is easy and the bug it introduces is subtle: the optimizer quietly produces incorrect code that passes most tests. Disabling the optimization is a blunt but reliable fix.

A Design Under Pressure

The tagged pointer scheme and the stop-the-world mark-and-sweep GC that relies on it have served Emacs for decades. But work on improving the GC is ongoing, and the interface between the representation and the collector is where the complexity accumulates.

Recent development work has included experimental integration with the Ravenbrook Memory Pool System, a precise GC library with generational and incremental collection. Integrating MPS with Emacs requires giving it a format descriptor for every managed object type, telling it how to find live pointers inside each one. For Lisp_Object, that means the tag scheme must be legible to MPS’s scanning protocol. The existing representation makes this tractable: the tag rules and the vectorlike header slot-count encoding give MPS everything it needs without additional metadata.

This is the deepest reason the representation exists in its current form. It is not only an optimization for integer storage, and it is not only an ABI requirement for external compatibility. It is the interface between Emacs Lisp’s semantics and every garbage collector the project will ever use. The hand-built discriminated union in lisp.h is load-bearing in a way that spans the entire system, which is why a design from the 1980s still defines the shape of current development work.

Was this interesting?