· 7 min read ·

The Acyclic Constraint That Makes E-Graph Optimization Work in Production

Source: lobsters

The Phase-Ordering Problem and Why It Has Resisted Simple Fixes

Compiler writers have lived with phase-ordering constraints for decades. The problem is straightforward to state: optimization passes interact. Applying constant folding before inlining may enable more folding opportunities. Running GVN after LICM may reveal redundancies that were not visible before loop hoisting. There is no canonical ordering that is universally optimal, and getting it wrong means leaving code quality on the table.

LLVM’s answer has been engineering experience accumulated over twenty years: a manually tuned pass pipeline, with some passes running multiple times at different points in the sequence. InstCombine alone handles hundreds of peephole patterns and runs at multiple pipeline positions because the interactions are too complex to resolve once. This works well in a compiler where compile time is measured in seconds and the pipeline is stable. It scales poorly to a JIT compiler like Cranelift, where you need fast compilation and you want optimization behavior to be predictable and auditable.

Chris Fallin’s post on the acyclic e-graph describes Cranelift’s answer to this problem: a mid-end optimizer built on a restricted form of e-graphs, specifically designed to stay tractable in a production compiler.

What E-Graphs Bring to Compiler Optimization

E-graphs were originally a theorem-proving data structure, but their application to compiler optimization was articulated clearly by Tate et al. in their 2009 paper “Equality Saturation: A New Approach to Optimization”. The idea is elegant: rather than applying rewrites destructively, you maintain all equivalent forms of an expression simultaneously. An e-class holds a set of equivalent e-nodes; when a rewrite rule fires, the new expression is added to the same e-class rather than replacing the original. After all rules have been applied to fixpoint (saturation), you extract the cheapest expression from the graph using a cost model.

The phase-ordering problem dissolves because all equivalent expressions coexist. There is no decision about whether to run pass A before pass B, because neither pass destroys the other’s opportunities. The graph accumulates equivalences; extraction finds the globally cheapest representation.

The egg library (POPL 2021) made this approach practical in Rust and demonstrated it across program synthesis, floating-point optimization via Herbie, and tensor expression compilers. egg introduced e-class analyses — a way to attach computed properties to e-classes (such as “this e-class is always the constant 4”) — and a deferred rebuilding algorithm that makes union-find maintenance efficient enough for larger graphs.

Why Full Equality Saturation Does Not Drop Into a JIT

General equality saturation has three properties that make it difficult to use unmodified in a production compiler. First, programs with loops create cycles in their dataflow graphs. SSA phi nodes represent values that depend on earlier iterations, and a naive e-graph representing the full CFG would contain back-edges that complicate traversal, union-find, and extraction. Second, extraction from a fully saturated e-graph is NP-hard in general: finding the globally cheapest tree decomposition of a graph with shared subexpressions is a variant of the minimum-cost hypertree problem. Third, for a JIT compiler where compile time matters, running to full saturation on large functions is prohibitively expensive.

Cranelift’s aegraph (acyclic e-graph) solves all three problems with one structural restriction: the e-graph only represents pure, side-effect-free computations, scoped within the SSA dominance tree.

The Acyclic Restriction

Every SSA value in Cranelift IR has a well-defined definition point in the dominator tree. A value defined in block B is available to any block that B dominates. Pure computations — arithmetic, bitwise operations, comparisons, loads from immutable memory — can be freely moved and duplicated within this dominance scope without changing program semantics.

The aegraph contains exactly these pure computations. Instructions with side effects (stores, calls, traps, branches) remain as a “skeleton” of the original IR, anchored to their original positions in the CFG. The aegraph operates on the value side of the program; the skeleton handles the effect side. Because pure computations over SSA values form a DAG, values only depending on values defined earlier in the dominance order, cycles cannot arise.

This is not a hack around a limitation. SSA’s dominance structure is the natural scope for value equivalences. Two computations of (a + b) in blocks that dominate a common use point are genuinely equivalent in a way that justifies rewriting. The acyclic restriction maps exactly onto the semantics you care about.

What Falls Out for Free

Global value numbering (GVN) is normally a separate compiler pass. In the aegraph, it is emergent. Two e-nodes with the same operator and children in the same e-classes are structurally identical; the e-graph’s congruence closure merges them automatically into one e-class. There is no separate GVN pass, no separate data structure for tracking “this SSA value equals that one.” It is simply a property of e-graph construction.

Constant folding works the same way. E-class analyses can annotate each e-class with a constant value if one can be determined. When a constant is determined for an e-class, an e-node representing that constant is added to the class. Extraction then prefers the constant because it has zero cost at the use site. Again, no separate constant-folding pass; it is a side effect of how e-class analyses propagate information.

Common subexpression elimination (CSE) across the dominator scope falls out as well. Any two computations that the e-graph determines to be in the same e-class will be materialized to the same SSA value during extraction.

ISLE as the Unified Rule Language

Cranelift uses ISLE — a term-rewriting DSL — for both instruction selection and mid-end optimization rules. The same rule syntax that describes how to lower an iadd into an x86 ADD instruction also describes how to simplify x + 0 to x. An aegraph mid-end rule looks like this:

;; Additive identity
(rule (iadd ty x (iconst 0)) x)

;; Strength reduction: multiply by power of two
(rule (imul ty x (iconst 2)) (ishl ty x (iconst 1)))

;; Constant folding for addition
(rule (iadd ty (iconst a) (iconst b))
      (iconst (u64_add a b)))

The semantics differ from instruction selection: in instruction selection, a matching rule replaces its input; in the aegraph, a matching rule adds the result as a new equivalent in the same e-class. The rule itself is identical. The framework determines the interpretation.

This unification has engineering consequences. There is one rule language to learn, one compiler for that language, one testing infrastructure. Adding an optimization is writing a declarative rule, not locating the right place in a C++ pass and reasoning about interactions with other rewrites in the same function.

Bounded Rewriting and Compile-Time Budget

Cranelift does not run to full saturation. Saturation guarantees that no further rewrite can fire, which requires iterating until fixpoint; for a large function this can be slow and the marginal gains from later iterations diminish quickly. Instead, Cranelift applies a bounded set of rewrite rules in a single pass over the e-graph, then extracts. The depth of rewriting is controlled, and the extraction cost model ensures that the cheapest form is always chosen among whatever equivalences were discovered.

This is an explicit trade-off: full saturation would find more equivalences but cost more compile time. For a JIT compiler processing WebAssembly modules, the compile-time budget is tight, and a single structured pass that subsumes GVN, constant folding, and algebraic simplification already represents a meaningful improvement over the prior multi-pass approach.

Comparison to LLVM’s Approach

LLVM’s mid-end is a sequence of passes: InstCombine for peephole rewrites, GVN for value numbering, SCCP for sparse conditional constant propagation, EarlyCSE, and others. Many of these passes run at multiple points in the pipeline because their results enable other passes. InstCombine alone is tens of thousands of lines of C++ encoding hundreds of patterns with implicit ordering dependencies between them.

Cranelift replaces equivalent functionality with ISLE rules that are shorter, compositional, and structurally amenable to automated verification. Work on ISLE verification uses symbolic execution and SMT solving to check that rules are semantics-preserving at compiler-compile time, before the compiler is ever run on real code. This is not possible with hand-written C++ passes, where correctness depends on subtle invariants maintained across thousands of lines of code.

The other relevant comparison is to Sea of Nodes, the IR used in HotSpot and V8’s TurboFan, which unifies control and data flow into a single graph and enables powerful optimizations through that unification. Cranelift deliberately keeps control and data flow separate, with the aegraph handling only the data side. Sea of Nodes is powerful but makes the IR genuinely difficult to reason about; Cranelift’s separation keeps complexity bounded and the semantics of each side clear.

The Engineering Lesson

The acyclic e-graph is worth studying not just for its immediate application to Cranelift, but for what it demonstrates about principled engineering constraints. Full equality saturation is theoretically elegant and practically unwieldy. The acyclic restriction gives up completeness in exchange for tractability, but what it gives up is exactly the part that did not fit the problem domain: cyclic dataflow in loop bodies, where the semantics of rewriting are more complex anyway.

The result is a system where the phase-ordering problem is solved within its scope, where optimization rules are first-class declarative objects rather than imperative code, and where the formal verifiability of those rules is a structural property of how they are written. That combination is unusual in production compiler engineering, and it is a more interesting result than any individual benchmark. The broader lesson is one that comes up repeatedly in systems design: the most useful theoretical tools usually need a precisely chosen restriction to become engineering tools, and finding the right restriction requires understanding both the theory and the problem domain well enough to know what to give up.

Was this interesting?