· 8 min read ·

SSA's Free Lunch: How Cranelift Made Equality Saturation Practical

Source: lobsters

The phase-ordering problem has been a known nuisance in compiler construction for decades. Run constant folding before GVN and you miss opportunities that GVN would have exposed; run GVN first and constant folding might have simplified the inputs before GVN needed to work. LLVM’s answer is to run InstCombine multiple times at different points in the pipeline, surround it with EarlyCSE, GVN, SCCP, and half a dozen other passes, and tune the ordering empirically over years. The result is correct, but it is not principled, and the interaction surface between passes grows every time someone adds a new one.

The theoretical fix has existed since at least 2009, when Tate et al. published “Equality Saturation: A New Approach to Optimization” at POPL. The idea is to apply rewrite rules non-destructively: instead of replacing expressions, you accumulate equivalences in a data structure called an e-graph, where equivalence classes (e-classes) group all known-equivalent forms of a value. Once all rules have been applied to fixpoint, you extract the cheapest program from the saturated graph. Because you never throw information away, phase ordering stops being a concern. Everything is visible at extraction time.

The egg library (POPL 2021 Distinguished Paper, Willsey, Nandi, Wang, Flatt, Tatlock, Panchekha) made equality saturation fast enough to matter, via a “rebuilding” algorithm that deferred e-graph invariant maintenance to a batched phase after each iteration rather than after every individual union. Egg found real applications: Herbie uses it for floating-point accuracy optimization, Tensat uses it for tensor graph rewriting, and a raft of research prototypes followed. What egg did not do was ship inside a production JIT.

Cranelift, the compiler backend behind Wasmtime and Firefox’s WebAssembly compiler, is now the first production compiler to deploy equality saturation in its mid-end optimizer. Chris Fallin’s writeup explains the design, which goes by the name aegraph (acyclic e-graph). The key insight is something the e-graph research community didn’t leverage because research compilers don’t usually start from SSA: in SSA form, the def-use graph is already a DAG.

Why Cycles Are the Hard Part

A standard e-graph can represent cyclic structures. An e-class for a value can contain e-nodes that reference other e-classes that eventually reference the original, through a chain of equalities discovered by rewriting. This is fine for term rewriting systems and theorem provers where you might want to represent x = f(f(f(...))). For a compiler optimizer, it creates three compounding problems.

First, extraction from a cyclic e-graph is NP-hard in general. Choosing the minimum-cost representative from each e-class resolves to a minimum-cost hypertree selection problem, equivalent to weighted tree automaton intersection. Egg provides hooks for custom extractors specifically because this is unsolvable in general and practical extractors need to cut corners. For a JIT serving latency-sensitive requests at Fastly’s edge compute infrastructure, worst-case NP extraction is not acceptable.

Second, standard e-graphs have no notion of dominance. SSA’s fundamental property is that every value is defined exactly once and every use is dominated by its definition. An e-graph rewrite that moves a computation to a new location has no mechanism to verify that all operands are still available there, or that the move doesn’t introduce a use-before-definition violation.

Third, the union-find data structure at the heart of standard e-graphs requires propagating changes transitively through parent pointers whenever two classes are merged. Each merge potentially touches a large fraction of the graph, and the need for congruence closure, where merging two e-classes forces merging of all e-nodes with those classes as children, compounds the work. Egg’s rebuilding defers this maintenance to batch phases, which is why it’s faster than prior implementations, but the fundamental transitive cost remains.

The SSA Observation

Cranelift’s IR is CLIF, a conventional SSA IR with explicit basic blocks and block parameters as its phi-equivalent. In SSA form, if you draw the def-use graph and ignore back-edges from loop bodies to loop headers, the result is a directed acyclic graph. Values are defined once; uses are dominated by definitions; the topological order of definitions is a total order compatible with def-use flow.

The aegraph enforces one invariant: every e-node’s children must reference e-classes defined strictly earlier in topological order. Forward references only. This is exactly what SSA already guarantees. The constraint is not an artificial restriction; it is a reflection of what SSA programs structurally are.

This single change collapses the three problems above. Extraction from an acyclic e-graph is O(n) by recursive descent from roots, with memoization to avoid re-processing shared nodes. No NP-hard tree selection is needed because cycles cannot create ambiguous sharing patterns. Dominance is preserved structurally, because any e-node reachable by following child references is, by the acyclicity constraint, defined before the node that references it. The union-find simplification becomes trivial: when a rewrite fires and adds a new equivalent form for a value, that form is appended to the value’s e-class list. No pointer chasing, no transitive propagation, no congruence closure needed in the cyclic sense, because the acyclicity prevents the problematic patterns that congruence closure exists to handle.

Eager Rewriting and the Completeness Argument

Egg applies rules iteratively to the entire e-graph until global fixpoint. This is necessary in the general case because a rewrite anywhere can expose opportunities elsewhere, including at already-processed nodes, via backward propagation through cycles and equivalences.

With an acyclic graph and a topological construction order, Cranelift can apply rules eagerly: when a new e-node is inserted, fire all matching rules immediately. Because the graph is acyclic and all children of the current node have already been fully processed (the acyclicity guarantee ensures this in topological order), no future child state can retroactively expose new opportunities on the current node. One forward pass is complete for the DAG case.

In practice, Cranelift does not run to true saturation even with this property. It applies a bounded rule set in a single forward pass and then extracts. True saturation would find more equivalences at the cost of more compile time. For Wasm JIT workloads, the productive rules (constant folding, algebraic identities, strength reduction) fire in the first pass, and the trade-off is explicitly in favor of bounded compilation latency.

Scoped Elaboration

Extraction from the aegraph is called elaboration in Cranelift’s source. It is a recursive descent from roots (values actually used: block terminators, effectful instructions) that selects the cheapest e-node from each e-class’s list using an opcode cost model and recursively elaborates operands. Values shared by multiple roots are elaborated once and reused.

The placement of elaborated values is dominance-aware. Each e-class carries the loop depth at which its defining instruction originally appeared. During elaboration, a value is placed at the shallowest loop depth at which all its operands are available. Three optimizations fall out of this placement logic as structural consequences, without separate analysis passes:

Loop-Invariant Code Motion: Any computation whose operands are all defined outside a loop body has a shallow available depth and gets placed before the loop header. LICM requires no additional analysis.

Global Value Numbering: Two e-nodes with the same opcode and the same child e-classes are structurally identical. When the e-graph is constructed, they resolve to the same e-class by congruence. During elaboration, they materialize as the same SSA value. GVN is a consequence of e-class identity.

Dead Code Elimination: Any value not reachable from a root during elaboration is never placed. DCE is a consequence of recursive descent from roots.

These were separate passes in Cranelift’s previous mid-end. The aegraph does not replace them with better versions of themselves; it replaces them with a single data structure construction and extraction that subsumes what they were each doing independently.

ISLE: One Language for Both Levels

Cranelift’s ISLE language (Instruction Selection and Lowering Engine) was designed for backend instruction selection, lowering CLIF to machine instructions via pattern-matching rules compiled at build time into Rust decision trees. The aegraph reuses the same language for mid-end rewrites.

;; Additive identity
(rule (simplify (iadd ty x (iconst ty 0))) x)

;; Strength reduction: multiply by power of two  
(rule (simplify (imul ty x (iconst ty 2)))
      (ishl ty x (iconst ty 1)))

;; Constant folding
(rule (simplify (iadd ty (iconst ty k1) (iconst ty k2)))
      (iconst ty (add_overflow ty k1 k2)))

In instruction selection, a matching rule replaces its input with a machine instruction sequence. In the aegraph, a matching rule adds the result as a new equivalent in the matching e-class. The rule syntax is identical; the framework determines the semantics. One rule language, one compiler for it, one testing infrastructure, one mental model.

Contrast this with LLVM, where InstCombine rewrites are hand-written C++ in lib/Transforms/InstCombine/, instruction selection patterns are in TableGen .td files, and there is no formal connection between the two. Adding a new peephole optimization in LLVM means finding the right C++ function in a tens-of-thousands-of-lines file and reasoning manually about interactions with every other pattern. In Cranelift, it means writing one ISLE clause. ISLE rules are also amenable to automated verification via SMT solving, a capability the C++ approach lacks entirely.

Where This Sits in the Landscape

Other approaches have addressed the same underlying problem from different angles. The Sea of Nodes IR used in V8’s TurboFan and historically in HotSpot C2 unifies data and control dependencies as graph edges, giving natural code motion and GVN. It does not maintain equivalence classes and cannot defer the choice between equivalent forms until extraction time. Sea of Nodes also carries a well-documented reputation for subtle scheduling complexity.

RVSDG (Regionalized Value State Dependence Graph) encodes control flow via hierarchical region nodes, gamma nodes for conditionals and theta nodes for loops, which would in principle allow e-graph rewrites that cross loop boundaries. The Jlm compiler uses this approach. The obstacle for Cranelift is that converting a general CFG to RVSDG requires a full restructuring pass, not all programs have reducible control flow suitable for the transformation, and the migration cost would be a near-complete IR rewrite. The aegraph layers onto the existing CLIF SSA IR without touching the control flow representation.

MLIR’s pattern rewriting infrastructure runs rules to a local fixpoint within a region with topological ordering for termination, but does not maintain equivalence sets and cannot defer extraction.

The aegraph occupies a specific position: more expressive than GVN, more tractable than unrestricted equality saturation, deployable on an existing SSA IR without structural changes. That positioning is precise and intentional, not a compromise born of implementation convenience.

The Production Constraint Drives the Design

Wasmtime compiles Wasm functions before serving requests; some configurations compile on demand. Per-function compile time is user-visible. Wasm code typically arrives pre-optimized by upstream toolchains (clang, wasm-pack), so the rewrites Cranelift needs to do are IR-level and target-specific, not whole-program heroics. The bounded, linear behavior of the aegraph, work proportional to function size rather than an exponent of it, fits these requirements precisely.

The loop-carried values that fall outside the aegraph’s scope, induction variable strength reduction, memory alias analysis, dead store elimination, are handled by separate analysis passes that operate on the control flow graph. The aegraph handles pure value semantics; the effect skeleton handles everything that touches memory or control flow. The boundary is clean and the scope of each piece is explicit.

Equality saturation has been theoretically compelling for fifteen years. The egg library made it fast enough to be practically useful in research settings. What it took to deploy it in production was recognizing that the hard problems of cycles, extraction complexity, and dominance reasoning are not fundamental to the optimization task. They are artifacts of the general formulation. An SSA-based compiler’s def-use graph is already acyclic, and that property, when taken seriously, turns a hard problem into a tractable one.

Was this interesting?