· 6 min read ·

Three IRs, One WASM File: Looking Back at Zig's Self-Hosting Transition

Source: zig

The announcement came in December 2022, a few weeks after Zig 0.10.0 shipped: the self-hosted compiler was now the default, and the C++ implementation was on its way out. The original post was understated given what the transition represented. Looking back at those architectural decisions reveals a set of deliberate choices about compiler design, bootstrapping, and long-term maintainability that differ from how comparable language projects handled the same problem.

The original Zig compiler, written in C++, was never meant to be permanent. Andrew Kelley built it as a bootstrap tool: a way to get Zig working well enough to implement the real compiler in Zig itself. This is a standard pattern in language design. The C++ implementation, called “stage1” in Zig’s development vocabulary, served its purpose for six years before being retired. Stage1 had roughly 80,000 to 100,000 lines of C++ code, used LLVM as its only code generation backend, and carried known correctness issues the team had essentially stopped trying to fix. The replacement was coming; patching the C++ was not worth the effort.

The Three-Layer IR Pipeline

The self-hosted compiler introduced a clean three-stage intermediate representation pipeline that stage1 never had.

ZIR (Zig Intermediate Representation) is the first IR, produced directly from parsing. It is untyped and represents the structural content of source code without resolving types or evaluating comptime expressions. Generic functions and comptime blocks live at the ZIR level until instantiation.

AIR (Analyzed Intermediate Representation) is produced by semantic analysis of ZIR. By this stage, all comptime evaluation has been performed, generics have been monomorphized for their specific type arguments, and the representation is in SSA (Static Single Assignment) form. AIR is fully typed and concrete. This is the IR that backends consume.

MIR (Machine Intermediate Representation) sits below AIR in the non-LLVM backends. When targeting x86-64 or WebAssembly without going through LLVM, the compiler lowers AIR to a backend-specific MIR before emitting machine code or bytecode.

The separation between these layers matters in ways that are not immediately obvious. In stage1, the pipeline was murkier: code went through LLVM IR, which meant LLVM was deeply entangled in the compilation process even for debug builds. Incremental compilation, where only the parts of a program that changed get re-analyzed and re-compiled, was essentially impossible to implement cleanly in stage1’s architecture. The stage2 design, with explicit dependency tracking between declarations at the ZIR and AIR levels, is what makes incremental compilation architecturally feasible. The feature was not fully shipped in 0.10.0, but the foundation was there in a way it never was in C++.

Multiple Backends and LLVM Bypass

Stage1 had one backend: LLVM. Every compilation went through LLVM’s optimization and code generation pipeline, even debug builds that do not need optimization. LLVM is a powerful but heavyweight dependency. Running it for a quick zig build-exe during development adds significant latency that compounds across every save-compile cycle.

The self-hosted compiler ships with multiple backends:

  • The LLVM backend is retained for release and optimized builds, where LLVM’s optimization passes justify the overhead.
  • The x86-64 machine code backend generates native x86-64 code directly, bypassing LLVM entirely. This backend is used for debug builds, where most iterative compilation happens during development.
  • The WebAssembly backend generates WASM bytecode directly without LLVM.
  • The aarch64 backend targets ARM64 natively.
  • The C backend generates C source code from AIR. Zig can target platforms where no other backend is available by generating C and relying on whatever C compiler is present.
  • A SPIR-V backend targets GPU shader compilation for Vulkan workloads.

The performance difference from bypassing LLVM for debug builds is substantial. Andrew Kelley’s mid-2022 post, “Self-Hosted Compiler Now Outperforms C++ Compiler”, documented both compilation speed improvements and roughly a 60% reduction in peak memory use in debug build scenarios. The gains come not just from architectural efficiency but from eliminating LLVM’s overhead in the hot path for everyday development. The design is also more honest about what debug builds need: correctness and speed, not the optimizations that matter for release.

The Bootstrap Chain

The bootstrap story is where Zig’s approach diverges most interestingly from prior self-hosting efforts.

When Rust self-hosted, it adopted a model where a pre-compiled binary of a specific rustc version (called “stage0”) is used to compile the current source. That binary is platform-specific and must be downloaded for each target architecture. Building Rust from source means fetching a platform-appropriate binary seed, which you are expected to trust.

Go took a similar approach. Go 1.5 introduced the self-hosted compiler, but building from source required Go 1.4, the last C-implemented version. For years, a Go 1.4 binary was a required dependency for building Go itself. More recent Go releases relaxed this by allowing any recent Go binary to serve as the bootstrap compiler, but the model still requires a platform-specific binary.

Zig’s approach uses a single WebAssembly binary called zig1.wasm. The bootstrap sequence works like this:

  1. A small C program is compiled with any available C compiler: GCC, Clang, MSVC, or anything else.
  2. That C program links in a minimal WASM interpreter (roughly 2,000 lines of C, bundled in the source tree) and uses it to execute zig1.wasm.
  3. zig1.wasm is an older version of the self-hosted compiler, frozen in WASM bytecode. It compiles the current Zig compiler source into a native binary.
  4. That native binary compiles itself once more, producing the final release binary.

The critical advantage is that zig1.wasm is architecture-neutral. One WASM file bootstraps on x86-64, aarch64, RISC-V, or any other platform that can run the bundled C interpreter. Rust and Go require per-platform seeds; Zig requires only a C compiler and a C standard library, both of which are available essentially everywhere. For reproducible builds and supply chain security, this is a meaningful structural difference: the trusted binary in the bootstrap chain is a single auditable WASM file rather than a matrix of platform-specific executables.

The WASM interpreter used in the bootstrap is intentionally minimal. Its small size makes it tractable to inspect for anyone concerned about what the bootstrap chain does. This addresses, at least partially, the “trusting trust” problem that Ken Thompson identified in 1984: some trusted binary must start the chain, but the smaller and more inspectable that binary is, the more confidence you can have in it. A 2,000-line WASM interpreter is more auditable than a multi-megabyte platform-specific compiler binary.

There was earlier discussion in the Zig community about using the C backend for bootstrapping instead: compile the Zig compiler to C, then compile that C with any C compiler. The WASM approach won out because WASM is a more controlled, sandboxed representation than C, and the WASM interpreter is smaller and more predictable than the full C compilation pipeline.

What the Removal of C++ Actually Changed

Beyond performance and architecture, the retirement of stage1 changed the contributor experience in a concrete way. Anyone working on the Zig compiler previously needed to be comfortable with C++ and the LLVM codebase. With the self-hosted compiler, all compiler work happens in Zig. Contributors can read, modify, and debug the compiler using the same language and tooling they use for everything else in the ecosystem.

Known correctness bugs in stage1 had been left unfixed for years because patching them in C++ was not worth the investment when the replacement was underway. Those bugs do not exist in stage2, which was designed without inheriting stage1’s accumulated compromises. The clean IR pipeline also enabled better error messages: precise source tracking in AIR makes it possible to attach notes to errors that point at specific declarations, something stage1 could not do reliably.

The December 2022 post marking the end of the C++ era was a checkpoint, not a completion. Incremental compilation, concurrent compilation, and further backend improvements were still in progress. But the architectural prerequisites were finally in place, and the C++ implementation that had served as scaffolding for six years was no longer in the critical path. Zig 0.11.0, released in 2023, removed the stage1 code entirely.

The transition illustrates a general principle in compiler engineering: the compiler you write to bootstrap a language and the compiler you want to maintain long-term are almost never the same thing. The gap between them is the work. Zig spent six years closing that gap, and the result is a compiler architecture that treats multiple backends, portable bootstrapping, and incremental compilation as first-class requirements rather than afterthoughts.

Was this interesting?