· 5 min read ·

Two Papers That Make Compiler Construction Feel Possible

Source: hackernews

The standard advice for learning to write a compiler is to read the Dragon Book. Aho, Lam, Sethi, and Ullman’s “Compilers: Principles, Techniques, and Tools” is comprehensive, authoritative, and nearly 1000 pages long. Most people who start it don’t build a compiler. They read about lexers, then about parsing theory, then about intermediate representations, and somewhere around chapter six they realize they haven’t written a single line of code that compiles anything.

James Hague’s 2008 blog post points toward a different path: two academic papers that together give you enough to build a real compiler. The post keeps circulating on Hacker News, most recently surfacing again this week with over 500 points and a comment thread full of people either confirming the advice or discovering it for the first time. The reason it keeps coming back is that the advice is still correct.

The central paper is Abdulaziz Ghuloum’s “An Incremental Approach to Compiler Construction”, published at the 2006 Scheme Workshop. It is 26 pages. You can read it in an afternoon. By the end of that afternoon, you will understand how to write a compiler that targets x86 assembly from a subset of Scheme, with working tests at every step.

Starting From Zero and Meaning It

Ghuloum’s approach is to begin with the smallest possible subset of a language that can be compiled to something executable, and then add one feature at a time. Each step produces a working compiler. Each step is tested before moving to the next.

The first step is compiling integer literals. The compiler takes a source expression like 42 and emits this:

.globl scheme_entry
scheme_entry:
    movl $168, %eax    ; 42 shifted left by 2, tagged as fixnum
    ret

That’s it. You call this from a small C driver, check the return value, and verify that your compiler produced correct output. You have a working compiler after step one.

The fixnum encoding is worth pausing on. Ghuloum uses a tagged representation where integers are shifted left by two bits, leaving the low bits for type tags. Booleans use a different tag. Characters use another. This representation is established in the first step so that every subsequent feature integrates cleanly with the ones before it. The design isn’t incidental; it’s what makes the whole incremental approach work without requiring rewrites as complexity grows.

Step two adds booleans: #t and #f compile to constants with a specific tag pattern. Step three adds characters. Step four adds (add1 x) and basic arithmetic. Each step requires understanding exactly one new concept, and you verify correctness with a test suite that grows alongside the compiler. By the end of the paper, you have compiled closures, tail calls, and heap-allocated data structures, without ever staring at a 1000-page textbook wondering where to start.

What the Dragon Book Is Actually For

This is not an argument that the Dragon Book is bad. It’s an argument about what it’s for. The Dragon Book is a reference for compiler engineers who already understand what compilers do. It covers parsing algorithms exhaustively because parsing is genuinely hard and there are decades of research worth knowing. It covers optimization theory because production compilers need to generate fast code and the theory is non-trivial.

None of that is where you should start. The Dragon Book’s structure assumes you want to understand compiling before you want to do it. Ghuloum’s structure assumes the opposite: get something working, then understand why it works. The difference in outcomes is significant. People who follow the incremental approach tend to finish. People who start with comprehensive theory often don’t.

This is a pattern that shows up across systems programming education. The resources that produce the most actual learners tend to be the ones that give you something to run on page one.

The Nanopass Complement

The second paper that rounds out Hague’s recommendation is Dipanwita Sarkar, Oscar Waddell, and R. Kent Dybvig’s “A Nanopass Infrastructure for Compiler Education”. Where Ghuloum’s approach adds language features incrementally, the nanopass framework adds a structural discipline: each compiler pass should do exactly one thing.

A traditional compiler pass for closure conversion might handle both identifying free variables and transforming the program representation in a single step. A nanopass compiler separates these into distinct passes, each small enough to verify in isolation. The framework provides a Scheme-based DSL for defining the input and output language of each pass:

;; Identify free variables as a separate nanopass
(define-pass identify-free-variables : L1 (e) -> L2 (e)
  (Expr : Expr (e) -> Expr ()
    [(lambda (,x* ...) ,[body])
     (let ([free (find-free-vars body x*)])
       `(lambda/free (,x* ...) ,free ,body))]))

The explicit language annotations make it clear exactly what each pass expects to receive and what it promises to produce. This matters both during implementation, when you’re trying to isolate a bug, and during maintenance, when you need to understand what a pass from six months ago was doing.

The two approaches are complementary in the right way. Ghuloum gives you the incremental feature-by-feature methodology that keeps you shipping working code throughout the process. The nanopass framework gives you a structural discipline that makes the resulting compiler maintainable and verifiable once you’ve got it running.

The Living Lineage

Ghuloum’s paper spawned a recognizable genre of compiler education projects. Nora Sandler’s “Writing a C Compiler” applies the same step-by-step methodology to C targeting real platforms, covering the full range from tokenization through register allocation. Each chapter adds one language feature, with a test suite at every stage. The book’s structure is Ghuloum’s methodology applied to a language most working developers know and care about.

There are GitHub repositories implementing Ghuloum’s specific steps in Python, Rust, Haskell, and OCaml. The methodology transfers cleanly because the core insight isn’t Scheme-specific or x86-specific; it’s about the value of a working artifact at every step of construction.

Nada Amin’s research extends the incremental approach into type-theoretic territory, building certified compilers where each step comes with a proof of correctness. The core methodology is still recognizable: start small, add features, verify continuously.

There’s also an obvious connection to how production compilers are actually built today. LLVM’s Kaleidoscope tutorial uses the same structure: chapter one gives you a working lexer, chapter two adds a parser, each chapter builds on what runs before. Cranelift, the code generator used in Wasmtime and Firefox’s WebAssembly engine, was developed with similar staged thinking. The incremental approach isn’t just pedagogy; it’s how you manage complexity in real systems.

Why This Keeps Surfacing

The Hacker News thread for this post has run several times over the years, and it consistently produces the same shape of discussion: experienced compiler engineers confirming the paper is exactly right, and developers who’ve been intimidated by compiler construction saying this is the first framing that made it feel approachable.

That pattern is telling. Compilers are often taught as if the goal is to produce compiler researchers, when most people who want to understand compilation want to understand it well enough to build something or reason about what their toolchain is doing. The papers Hague recommends are calibrated for the second goal. They assume you want to ship a working compiler, not pass a comprehensive exam on parsing theory.

The advice from 2008 doesn’t need updating. The papers are still the right starting point. The incremental method still works. If you’ve been meaning to understand how compilers work, the place to start is a function that takes the integer 42 and emits movl $168, %eax; ret. Everything else follows from there.

Was this interesting?