A Shell Script That Compiles C: What It Takes to Build From the Ground Up
Source: lobsters
There is a gist floating around Lobsters titled c89cc.sh: a standalone C89 compiler targeting ELF64, written entirely in portable shell. No gcc. No as. No ld. Just /bin/sh and the binary it can write.
The instinct is to call this a party trick. A more honest reaction is that it exposes something uncomfortable about how deep the dependency chain actually goes.
What This Actually Does
The script takes C89 source and emits a valid ELF64 executable using nothing but POSIX shell primitives. That means lexing C tokens with read and parameter expansion, doing code generation with integer arithmetic, and writing machine code bytes to stdout using printf. The output is a statically linked, stripped binary that Linux can execute directly.
The key primitive is printf’s octal escape support. POSIX mandates that both the builtin and /usr/bin/printf interpret \NNN as a single byte with the given octal value:
# Emit the ELF magic bytes: 0x7f, 'E', 'L', 'F'
printf '\177ELF'
# Emit a zero byte
printf '\000'
To emit computed values, you compose octal strings with shell arithmetic:
emit16le() {
v=$1
printf "\\$(printf '%03o' $((v & 0xFF)))"
printf "\\$(printf '%03o' $(( (v >> 8) & 0xFF)))"
}
emit32le() {
v=$1
for shift in 0 8 16 24; do
printf "\\$(printf '%03o' $(( (v >> shift) & 0xFF)))"
done
}
This is tedious but correct. POSIX $(( )) arithmetic is at least 32-bit signed; on 64-bit bash it is 64-bit, which is necessary for the 64-bit virtual addresses in ELF64 headers. The approach has real limits: null bytes cannot be stored in shell variables (they terminate the C string underneath), so any part of the binary containing embedded zeros must be emitted directly rather than built up in a variable.
The Minimal ELF64 Structure
An ELF64 executable needs surprisingly little to be valid. The format is defined in the System V ABI supplement and breaks down as follows:
ELF Header (64 bytes): starts with the four-byte magic \x7fELF, followed by class (64-bit), data encoding (little-endian), version, OS/ABI, then fields specifying the machine type (0x3E for x86-64), the entry point virtual address, and offsets to the program and section header tables.
Program Header Table: for a minimal static executable, you need exactly one entry of type PT_LOAD. This tells the kernel loader where in the file to read from and where in virtual memory to map the content.
Code: raw x86-64 machine code starting at the load segment offset.
Section Header Table: optional for execution. A stripped executable has none. The kernel only looks at program headers.
The practical result is that a minimal ELF64 executable header is 120 bytes: 64 for the ELF header, 56 for one program header entry. Everything after that is code. You can write a working Linux binary in a couple hundred bytes of printf calls, which is exactly what c89cc.sh does for its output.
The entry point is typically set to a virtual address like 0x401000. The code segment in the file starts at offset 120, and gets mapped to that address by the loader. When the program starts, execution begins at the entry point, which must ultimately call sys_exit via a syscall instruction or the process will crash.
Why C89 Specifically
C89 is the right target for a minimal compiler for several reasons that compound on each other.
The grammar is simpler than C99 in one critical way: all variable declarations must appear before any statements in a block. This means a single-pass compiler can process a function body linearly: collect declarations, then emit code for statements. There is no need to scan forward or maintain complex scope state.
There are no variable-length arrays, so stack frame sizes are fixed at compile time. Code generation for function prologues and epilogues becomes straightforward. There is no long long, no _Bool, no restrict qualifier, no complex integer promotion edge cases from the later standards.
There are no // comments. The lexer only needs to handle /* */ block comments, which simplifies the token stream significantly.
C89 is still Turing-complete. The Linux kernel was written in C89 until around 2002. All classic Unix utilities compile against it. You can write a real program, including a self-hosting compiler, using only C89.
This last point matters because the goal of a project like c89cc.sh is not to be a production compiler. It is to occupy the lowest rung of a bootstrap ladder: a tool written in something auditable that can produce a binary that does something useful, which can then be used to compile something more capable.
The Bootstrapping Context
Ken Thompson’s 1984 Turing Award lecture, “Reflections on Trusting Trust”, is the canonical statement of the problem. Thompson showed that a C compiler can contain a self-replicating Trojan: one that inserts a backdoor into the login binary when compiling login, and also inserts the Trojan-injection code into any new compiler binary it compiles. Inspecting the source reveals nothing. The binary carries the attack forward.
The practical consequence is that the trust in any compiled software depends on the trust in the entire chain of compilers used to build it. Modern Linux distributions require a pre-existing GCC to build GCC. That GCC binary is itself the product of a prior GCC. The chain goes back to binaries no one remembers compiling.
The bootstrappable builds project and stage0-posix are the most serious efforts to address this. stage0-posix starts from hex0: a 357-byte binary that reads hex pairs from stdin and emits their byte values. This seed is small enough that a determined person can verify each byte manually. From hex0, the chain climbs:
hex0 (357 bytes, hand-auditable)
-> hex1 (hex assembler with labels)
-> hex2 (full-label assembler)
-> M0 (macro assembler)
-> M2-Planet (C subset compiler)
-> Mes (Scheme + C library)
-> TCC
-> GCC
Every step is compiled by the previous step. The seed binary is the only required act of faith. c89cc.sh belongs to the same conceptual space: a compiler seed that is “written in” something humans can read and verify without a binary toolchain.
Fabrice Bellard’s OTCC (Obfuscated Tiny C Compiler, 2001) predates this framing but explores the same territory. OTCC is roughly 2KB of obfuscated C that compiles a real subset of C to x86 machine code in memory and runs it. It is self-hosting. Bellard later cleaned it up into TCC (Tiny C Compiler), a full C99 compiler in under 100KB that is still actively used.
The Shell as a Trust Boundary
The interesting claim c89cc.sh makes is that /bin/sh is a more auditable and more portable starting point than any pre-compiled compiler binary. Shell interpreters are small. POSIX defines the behavior precisely. On most systems, /bin/sh is either dash or busybox sh, both of which can be inspected, and both of which implement the same relevant subset of POSIX.
Shell is also genuinely portable in a way that compiled binaries are not. A shell script runs wherever /bin/sh runs: Linux on x86-64, ARM, RISC-V, BSD systems, even on a system with no compiler installed at all. The generated ELF64 is necessarily architecture-specific (x86-64 machine code is hardcoded), but the compiler itself travels without recompilation.
The performance is obviously terrible compared to TCC or even a Python-based compiler. Forking a printf process for each emitted byte is slow. But that is not the point. The point is that you can read the source, understand every line, and know with confidence what it does.
That confidence is exactly what the software supply chain has been lacking. The SolarWinds attack in 2020 compromised a build system rather than source code. The XZ Utils backdoor in 2024 was inserted at the social engineering and build script level, not in the published source. Neither attack would have been possible if the build chain started from a seed small enough to audit by hand.
A C89 compiler in 500 lines of shell is auditable by hand. A GCC 13 binary is not.
What It Reveals
Projects like c89cc.sh are existence proofs: you do not need a compiler to build a compiler. You need computation and I/O, and POSIX shell provides both. The ELF64 format is documented and fixed. The x86-64 instruction encoding is documented and fixed. Given those specifications and a Turing-complete language with the ability to write bytes to a file, you have everything required.
Most developers never think about this layer. We install a toolchain from a package manager, trust the package signature, and move on. That trust is reasonable for most threat models. But the tools to think differently, and to build differently, have existed for a long time. A shell script writing raw ELF headers is just one small, concrete demonstration of where that thinking leads.