· 6 min read ·

From sh to ELF: Why a C Compiler Written in Pure Shell Matters More Than It Looks

Source: lobsters

The gist by alganet surfaced on Lobsters with a note that the site has tags for every conceivable language but nothing for shell. The footnote is easy to skip; the project behind it is not.

The description is brief: a standalone C89/ELF64 compiler in pure portable shell. No external compiler required. No intermediate assembly step. Just sh, printf, and the ELF specification.

The ELF64 format is more approachable than its reputation

An ELF64 executable is a structured sequence of bytes. The first 64 bytes are the ELF header: four magic bytes (\x7fELF), a class byte for 64-bit (the value 2), endianness, version, OS/ABI, binary type, machine architecture (0x3e for x86-64), and the virtual address of the entry point. After the header come program headers, each 56 bytes, describing how the kernel maps segments into memory. Then the actual code and data.

The System V AMD64 ABI specifies every field. Brian Raiter demonstrated years ago, in his work on tiny ELF executables, that you can construct a valid Linux ELF binary by hand, byte by byte. A minimal program that calls exit(0) via a Linux syscall fits under 200 bytes once you strip everything non-essential. Shell’s printf can emit arbitrary bytes using \xNN hex escapes. Nothing here requires a compiled tool.

What POSIX sh can carry

Shell is typically treated as a glue language: run this, pipe it there, check the exit code. The actual capability set is broader. POSIX sh provides arithmetic via $(( )), string manipulation via parameter expansion, and printf for formatted output including raw bytes. It can call POSIX-mandated utilities as subprocesses.

Awk alone adds associative arrays, field splitting, regex matching, and arithmetic. Sed provides streaming regex substitution. A shell program that delegates to these tools is still “pure shell” in the meaningful sense: it requires nothing beyond what every conforming POSIX system already ships. No package manager, no compiler, no build system.

Writing a lexer in shell is repetitive but tractable. You process a C source file character by character or line by line, matching patterns against tokens and accumulating them into strings. A recursive descent parser is harder, because shell functions cannot return structured data directly; you end up using global variables to pass results up the call stack, which is messy but workable for a bounded grammar like C89. Code generation is, at its core, integer arithmetic: each x86-64 instruction has a specific byte-level encoding defined by the Intel Software Developer’s Manual, and emitting that instruction means computing the right byte sequence and writing it with printf.

C89 as a bootstrap target

The choice of C89 over a later standard is deliberate in any project like this. C89 is the simplest standardized version of C with broad practical utility. It lacks variadic macros (added in C99), designated initializers, variable-length arrays, _Generic, _Atomic, and the rest of the machinery that modern C carries. The grammar is well-documented, has been formally analyzed for decades, and has a large body of existing implementations to compare against.

C89 is also historically foundational. The original Unix kernel was written in C. The original C compiler, as written by Dennis Ritchie, targeted a language close to what C89 would later standardize. The GNU toolchain, through most of its formative years, targeted C89 or close to it. A compiler that handles C89 can, in principle, compile a meaningful portion of foundational Unix software.

For bootstrapping purposes, the required subset is even narrower: arithmetic, pointers, control flow, functions, structs, and the ability to call OS syscalls. A compiler targeting that subset does not need a full preprocessor or a complete standard library. It needs to produce correct code for the constructs it accepts, and that is enough to bootstrap something more capable.

The bootstrapping lineage

This project belongs to a tradition that starts with Ken Thompson’s 1984 Turing Award lecture, “Reflections on Trusting Trust.” Thompson showed that you cannot trust a compiler binary by reading its source code, because the binary that compiled that source could itself have been modified to insert malicious behavior invisibly. The attack is self-reproducing: once the compiler carries the taint, compiling the clean source produces a tainted binary regardless. The only real defense is to minimize and audit the bootstrap path.

The Reproducible Builds movement takes this problem seriously, and within it, the bootstrappable builds effort focuses specifically on minimizing the trusted binary seed. Projects like stage0, by Jeremiah Orians, start from a few hundred bytes of machine code you can verify by reading hex dumps directly, then build progressively more capable assemblers and compilers on top. GNU Mes provides a Scheme interpreter and MesCC, a C compiler written in Scheme, designed to bootstrap the GNU toolchain from a seed small enough to audit. The live-bootstrap project takes this further, attempting to build a complete Linux system from source with a fully auditable chain.

All of these approaches require you to trust something at the binary level, even if that something is very small. A shell-based C compiler changes the equation: if you trust the POSIX sh interpreter, and sh is audited source code shipped on virtually every Unix system, then the path from human-readable text to a working C compiler becomes direct. You do not need to trust a binary seed. You need to trust that the sh you are running matches its published source, which is a significantly easier claim to verify.

What the compilation loop looks like concretely

Consider what a shell-based C compiler does when translating a simple function. The lexer tokenizes the source into keywords, identifiers, integer literals, string literals, and punctuation, stored as shell variables. The parser consumes tokens using recursive descent, building a representation of the program. The code generator walks that representation and emits x86-64 machine code.

A function that adds two integers under the System V AMD64 calling convention receives its arguments in rdi and rsi, places the return value in rax, and must preserve certain registers across the call. The code generator knows these rules and emits the corresponding instruction bytes. For the instruction mov rax, rdi, the encoding is 48 89 f8 in hex: three bytes. The generator computes that value and calls printf '\x48\x89\xf8', writing three bytes into the output file.

The ELF wrapper provides the header, the program header table pointing at the code segment, and the entry point address. The result is a binary the Linux kernel can load and execute without any further tooling. The shell has performed what an assembler and linker would do separately, without either of them.

Practical scope and honest limits

Compilation speed will not be competitive. Shell processes text slowly at scale. A moderately large C file compiled through a shell-based compiler will take seconds or more, compared to milliseconds in tcc or gcc. For bootstrapping, this is not a problem. You compile once, slowly, using the shell compiler to produce a binary that you then use for all subsequent compilation. The shell compiler is the means to get the first compiler, not the daily driver.

Fabrice Bellard’s tcc serves a similar role in many bootstrapping discussions: a complete C compiler, small enough to audit, capable of compiling itself, and fast enough to use in production. tcc is excellent. It is also a C binary. Getting tcc requires a C compiler. The shell compiler breaks that circularity without introducing a new opaque binary into the trust chain.

The question of how complete c89cc.sh is, whether it covers all of C89 or a well-chosen useful subset, is secondary to the approach. A shell compiler that handles expressions, control flow, functions, structs, and direct syscalls is sufficient to bootstrap something more capable. Completeness can grow from there.

The tag taxonomy problem

The Lobsters comment about missing shell tags is a small data point about how the field categorizes work. Shell is for gluing. Compilers are for real languages. A compiler written in shell disrupts that categorization, and the discomfort is informative.

The mental model that shell cannot be the substrate for serious tools is a habit, not a constraint. POSIX sh has been Turing-complete for as long as it has existed. The real barrier is that writing complex logic in shell is verbose and slow, not that it is impossible. Projects that ignore the habit and work within the constraint tend to produce things that are worth understanding, even when, especially when, the result is impractical at scale.

A standalone C89 compiler in pure portable shell that emits real ELF64 binaries belongs on that list.

Was this interesting?