· 6 min read ·

From Shell to ELF: The Bootstrap Case for a C Compiler in Pure POSIX sh

Source: lobsters

Alexandre Gaigalas’s c89cc.sh is a standalone C89 compiler written entirely in portable POSIX shell. It takes a C source file and emits a working ELF64 Linux executable directly, with no gcc, no assembler, no linker, and no binary build tools of any kind. The only runtime requirement is /bin/sh and printf.

The Lobsters thread where it surfaced asked, half in jest, why the site has no shell tag. The better question is why this kind of project keeps appearing, and what it reveals about what a compiler actually needs to be.

Why ELF64 is simpler than people assume

The key insight that makes shell-based binary emission practical is that the ELF64 format, for a statically-linked executable, is surprisingly sparse. A working Linux binary needs exactly two structures before the code: an ELF header of 64 bytes and at least one PT_LOAD program header of 56 bytes. The kernel reads the program headers to know where to map the file into memory, then jumps to the entry point. Section headers (.text, .data, .bss) are consumed by linkers and debuggers but are not required by the kernel. A compiler can set e_shoff=0 and e_shnum=0 and produce a binary that runs fine.

All of that is writable with POSIX printf:

# ELF magic, class=64-bit, data=little-endian, version=1, OSABI=SystemV
printf '\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
# e_type=ET_EXEC, e_machine=EM_X86_64, e_version=1
printf '\x02\x00\x3e\x00\x01\x00\x00\x00'

Multi-byte little-endian values (addresses, sizes, offsets) are computed with shell arithmetic and emitted byte by byte:

emit_u32_le() {
    v=$1
    printf "\\x$(printf '%02x' $((v & 0xff)))"
    printf "\\x$(printf '%02x' $((v >> 8 & 0xff)))"
    printf "\\x$(printf '%02x' $((v >> 16 & 0xff)))"
    printf "\\x$(printf '%02x' $((v >> 24 & 0xff)))"
}

The entry point is placed at virtual address 0x400000 + 64 + 56 = 0x400078. From there, the generated code runs directly with no libc, no startup wrapper, no dynamic linker. A minimal exit(0) is three x86-64 instructions: mov eax, 60 (the sys_exit syscall number), xor edi, edi, syscall. Nine bytes of machine code constitute a complete program body.

Writing a compiler in a hostile language

The genuine challenge in c89cc.sh is not binary emission. It is implementing a lexer, a recursive-descent parser, and a code generator in a language with no arrays, no structured data types, no way to return values from functions except through global variables, and character-by-character string processing that costs O(n) per step via ${var#?} substring expansion.

POSIX sh functions return exit codes (integers 0-255), not values. Every parser rule that needs to produce a node type, a stack offset, or a type annotation must communicate through a global variable. A grammar rule for additive expressions ends up looking roughly like:

parse_addexpr() {
    parse_mulexpr
    while [ "$TOK" = '+' ] || [ "$TOK" = '-' ]; do
        OP=$TOK
        next_token
        parse_mulexpr
        case $OP in
            '+') emit '\x48\x01\xD8' ;;  # add rax, rbx
            '-') emit '\x48\x29\xD8' ;;  # sub rax, rbx
        esac
    done
}

The global TOK holds the current token. Code is emitted directly during parsing, which means there is no AST and no intermediate representation. This is a single-pass compiler in the tradition of Fabrice Bellard’s TCC and its earlier ancestor otcc, both of which also avoid constructing an AST in order to minimize code size and complexity.

The symbol table, without arrays, is typically implemented as eval-based named variables:

sym_define() { eval "SYM_${1}=${2}"; }
sym_lookup() { eval "echo \"\$SYM_${1}\""; }

This works correctly for well-formed C identifiers. The trickier problem is backpatching: when the compiler emits a conditional jump, the target address is not yet known. A traditional compiler records a pointer to the patch site in an AST node. Without an AST, the shell script must buffer code in a variable and perform string surgery once the target offset is known, or make two passes. Given the cost of variable-held binary strings, two passes over the source is more practical for anything involving forward jumps.

Why C89 specifically

C89 (ANSI C, ratified 1989) is the right choice for a minimal bootstrap compiler for several reasons that compound each other.

Every C compiler that exists accepts C89 as a baseline, which means a program that c89cc.sh can compile can be compiled by anything else in the ecosystem. The grammar is small, roughly 77 production rules in Annex A of the standard, compared to C11’s substantially larger surface. Declarations must appear at the top of blocks, which eliminates the mixed-declarations-and-code parsing complexity introduced in C99. There are no // line comments, no variable-length arrays, no long long (officially), no _Generic, no _Atomic.

For a bootstrap-stage compiler, the practical subset is even smaller: integer types, pointers, arrays, functions, arithmetic, conditionals, loops, and return. Floating point, structs, switch, typedef, variadic functions, and most of the preprocessor can be deferred to later stages. What remains is enough to write a C compiler, which is exactly the bar that matters.

The bootstrapping problem

Ken Thompson made the definitive statement of the underlying issue in his 1984 Turing Award lecture, “Reflections on Trusting Trust”. A C compiler binary can contain a backdoor that persists even after the source code is audited and cleaned, because the backdoor lives in the compiled binary of the compiler itself, not in the source. The compiler reinserts the backdoor every time it compiles itself. The only defense is an independent bootstrapping chain that does not pass through the suspect binary.

The Bootstrappable Builds community has been constructing such chains for years. The current standard approach, developed by Jeremiah Orians and others in the stage0 project, starts from a hex0 binary of roughly 357 bytes. That binary reads ASCII hex digit pairs and emits the corresponding bytes. It is small enough that a determined person can verify it instruction by instruction against the x86-64 reference manual. From hex0, you build hex1, then a macro assembler, then M2-Planet (a C-like language compiler), then GNU Mes, then TCC, then GCC. GNU Guix uses a variant of this chain to bootstrap its entire package set from a sub-kilobyte auditable seed.

A shell-based C compiler changes the geometry of that chain. If c89cc.sh can compile enough of C to build TCC, then the trusted seed for a Unix system could be /bin/sh itself, a binary that is already present on the system, already constrained by a public standard (POSIX.1-2017), and whose behavior has been independently implemented dozens of times. The compiler arrives as a shell script, readable as plain text, requiring no trust in any binary that was not already present before it ran.

This is a different proposition from hex0, which is a purpose-built minimal binary whose only job is to exist at the bottom of the chain. Shell is already there for other reasons. A compiler implemented in it is auditable not because it is small but because the language it is written in is ubiquitous and well-specified.

Where c89cc.sh sits among minimal compilers

The lineage of small C compilers is long. Ron Cain’s Small C (1980) compiled a C subset to 8080 assembly and was published in full in Dr. Dobb’s Journal. Bellard’s otcc (2002) was a 2KB obfuscated C program that compiled a C subset and could compile itself. TCC extended that work into a production single-pass compiler targeting x86, x86-64, ARM, and RISC-V. Rui Ueyama’s chibicc (2020) reconstructed a full C11 compiler one commit at a time as a teaching project. Lars Kirkholt Melhus’s lacc does graph-coloring register allocation and emits ELF64 directly without invoking an assembler.

c89cc.sh occupies a different point in this space. It is not trying to be fast, complete, or pedagogical in the chibicc sense. Its constraint is the implementation language. Writing a compiler in POSIX sh forces every design decision toward minimalism: no AST because structured data is expensive, single-pass because multiple passes require intermediate file I/O, direct binary emission because invoking an assembler adds a dependency.

Those constraints map almost exactly onto what makes a compiler useful for bootstrapping. The most bootstrap-relevant compilers in the ecosystem (TCC, M2-Planet, otcc) share these properties not because their authors were hostile to optimization but because the same reasoning leads to the same architecture.

The real point

A shell script that compiles C to ELF64 is a systems programming artifact as legitimate as any of the compilers it could replace. The limitations of POSIX sh are not incidental to the project; they are what makes the result meaningful. The exercise of writing a compiler under those constraints produces a program whose design is entirely justified by the problem it solves, with no excess borrowed from a richer host language.

The bootstrappable builds problem is not exotic. It is the question of whether you can verify, from first principles, that the software running on your machine was built from the source you can read. A shell script compiler is one credible answer to part of that question, and it runs on hardware you already own, using tools that shipped with the operating system.

Was this interesting?