The Lobsters poster who linked c89cc.sh had a dry observation attached: “We have tags for all sorts of languages, but nothing for shell?” The gist itself is a standalone C89 compiler written entirely in pure POSIX shell. Feed it C source, get a working ELF64 binary. No assembler, no linker, no C compiler anywhere in the dependency graph. The entire compilation pipeline, including lexing, parsing, code generation, and ELF construction, runs inside /bin/sh.
The immediate temptation is to file this under “clever hacks” and move on, but the project touches a real and unresolved problem in systems software: one that Ken Thompson articulated in 1984 and that the bootstrapping community has spent decades trying to address.
The Trust Problem
Thompson’s 1984 Turing Award lecture, published as “Reflections on Trusting Trust”, is worth reading in full if you have not encountered it. The argument runs in three stages. First, he demonstrates self-reproducing programs. Second, he describes inserting a Trojan horse into the Unix C compiler that injects a backdoor into login.c whenever it compiles that file. Third, and most memorably: he modifies the compiler to also recognize when it is compiling itself and to inject both pieces of malicious logic automatically. After running this poisoned compiler once, every subsequently compiled binary carries the Trojans, even if you discard the modified source and compile from a clean copy. The malice lives in the binary, not the source.
“The moral is obvious,” Thompson wrote. “You can’t trust code that you did not totally create yourself.”
This is not a theoretical concern. Every time you install a compiler from a package repository, you are trusting a chain of binaries going back to some point where a human physically verified that the binary matched the source. In practice, that verification rarely happens, and the chain is trusted by convention rather than by inspection.
The Bootstrapping Tradition
One community has taken Thompson’s concern seriously enough to build a systematic response. The Bootstrappable Builds project aims to reduce the “seed,” the initial trusted binary from which everything else is derived, to something small enough for a person to audit by hand, every byte.
The most rigorous approach in that tradition starts with hex0, a 357-byte x86 ELF binary that reads a hexadecimal text file and outputs the bytes it represents. From hex0 you can bootstrap hex1, which adds label support, then hex2, which generates ELF headers, then the M0 and M1 macro assemblers, then M2-Planet (a C-like compiler targeting multiple architectures), then GNU Mes, then TCC, and eventually full GCC. The GNU Guix project ships this entire auditable chain as part of its bootstrap process.
hex0 is roughly 30 lines written in hexadecimal notation. You read each byte, check it against the comment describing what x86 instruction it encodes, and you are done. The seed is small enough for one person to verify in an afternoon with a copy of the x86 instruction reference. The rest of the chain follows mechanically from there.
Where c89cc.sh Sits
c89cc.sh approaches the same concern from a different angle. Its seed is /bin/sh rather than a 357-byte binary, which means it has traded one kind of trust for another. A production shell binary, even a minimal one like dash, is tens of thousands of lines of C and a compiled binary of several hundred kilobytes. You cannot audit that in an afternoon.
But the project is making a different and more practical point. If you accept /bin/sh as given, as most developers working on any Linux or Unix system already do implicitly, then c89cc.sh demonstrates that you need nothing else to compile C89 to a native ELF64 binary. The toolchain collapses to a single trusted primitive. This matters in environments where you have a shell but cannot or will not install a C compiler: minimal containers, early-stage bootstrap environments, or restricted systems. It also matters conceptually as a demonstration of what POSIX shell can do when taken seriously as a programming environment.
How the Shell Does It
POSIX sh has no arrays and no native character-at-a-time input, but it has $(( )) arithmetic and a rich set of parameter expansion operators. These turn out to be enough to build a working lexer.
To extract a single character from a string $str in pure POSIX sh, without any bash extensions:
first="${str%"${str#?}"}"
rest="${str#?}"
${str#?} removes the shortest prefix matching any single character, yielding everything after the first. Removing that result as a suffix from the original string leaves only the first character. This works in any POSIX-conforming shell, including dash, mksh, and busybox sh, because it relies entirely on standard parameter expansion. No read -n1, no arrays, no subprocesses.
For binary output, printf is the critical mechanism. printf '\x7fELF' writes the four-byte ELF magic number directly to stdout. Every byte of the ELF64 header is assembled this way: the class byte (\x02 for 64-bit), the data encoding byte (\x01 for little-endian), the machine type field (\x3e\x00 for x86-64 in little-endian byte order), the entry point virtual address, and the program header table offset. The ELF64 specification defines the ELF header as exactly 64 bytes and each program header entry as 56 bytes. c89cc.sh constructs both from scratch, computing all byte offsets using $(( )) arithmetic during code generation.
The generated x86-64 machine code bypasses libc entirely and uses Linux syscalls directly. Writing to stdout becomes a syscall instruction with rax=1 (the write syscall number), the file descriptor in rdi, the buffer address in rsi, and the byte count in rdx. Exit uses rax=60. The shell script calculates all virtual addresses and instruction offsets at generation time. No relocation entries, no dynamic linker, no PLT: the output is a self-contained static binary that the kernel can load and run directly.
A minimal ELF64 executable structured this way needs only three things: the 64-byte ELF header, one 56-byte PT_LOAD program header describing the loadable segment, and the machine code itself. Section headers are entirely optional for a runnable executable; they exist for the linker, not the loader. By skipping them, c89cc.sh produces the smallest valid ELF64 output possible for any given program.
The Supported Subset
The compiler handles the essential core of C89: integer types (int, char, long), basic arithmetic and logical operators, if/else/while/for control flow, functions with typed parameters and return values, local variables on the stack, and basic pointer operations. This is not the full standard, but it is enough to write nontrivial programs and to trace the full pipeline from source characters to loaded binary.
The comparison to other minimal compiler projects is instructive. TCC, Fabrice Bellard’s Tiny C Compiler, targets C99 with GNU extensions, is self-hosting, and runs to roughly 80,000 lines of C. chibicc, written by Rui Ueyama of mold linker fame, targets C11 in about 6,000 lines of C, with each commit introducing exactly one feature in pedagogical sequence. cproc, which uses the QBE backend, is around 7,000 lines and serves as the system compiler for the Oasis Linux distribution. c89cc.sh is fewer than 1,000 lines of shell and requires no build step at all.
The performance cost of the shell approach is real. Scanning C source character by character using ${str#?} is O(n²) over the input length, because each character extraction involves string operations proportional to the remaining input. For large files this slows noticeably, but for the size of programs that c89cc.sh is practically useful for, the performance is acceptable.
What It Shows
Shell is a real programming language with a real execution environment. The reason nobody writes production compilers in it is not fundamental capability; it is performance and ergonomics. The same printf escape sequences that emit ELF headers are the ones that format log output. The same $(( )) arithmetic that tracks program counter offsets is the one that counts loop iterations in a build script.
c89cc.sh makes the boundary between “scripting language” and “systems language” less obvious than convention suggests. The substrate is more expressive than its typical use implies, and this project demonstrates that in the most concrete possible terms: it takes text in one well-defined format (C89 source) and produces binary in another (ELF64 executables), using nothing but the operations that POSIX mandates for any conforming shell.
Thompson’s problem does not have a clean resolution. Trust has to start somewhere, and you choose how far back to trace it. c89cc.sh traces it to /bin/sh, which is a reasonable place to stop for most practical purposes. The result is an entire C compilation toolchain collapsed into a single portable file that runs on any POSIX-conforming system, which covers the overwhelming majority of places where software gets built.