Shell Is a Compiler Substrate: The Technical Depth Behind a C89-to-ELF64 Compiler in Pure sh
Source: lobsters
A gist recently surfaced on Lobsters with a title that reads almost like a dare: a standalone C89/ELF64 compiler implemented in pure, portable shell. The author’s own comment on the submission notes that Lobsters has tags for all sorts of languages but nothing for shell. That detail is more interesting than it sounds.
Shell is not usually considered a systems programming language. It has no type system, no memory management primitives, and no notion of binary data beyond raw file I/O. And yet all of those properties turn out to be either irrelevant or workable when the job is translating a restricted subset of C into a native Linux executable. Working through why reveals something about what compilers are at their core.
What ELF64 Actually Requires
An ELF64 executable is a sequence of bytes with a defined layout. The format is specified by the System V ABI and consists of a fixed 64-byte header, followed by program headers describing memory segments, followed by the actual code and data. Nothing in the format requires that it was produced by a particular language or toolchain.
The ELF header begins with a magic number, four bytes: 0x7F, 0x45, 0x4C, 0x46. Then a class byte (0x02 for 64-bit), endianness (0x01 for little-endian), version (0x01), OS/ABI (0x00 for System V), and seven bytes of padding. The remaining fields encode the file type (0x02 0x00 for executable), machine type (0x3E 0x00 for x86-64), entry point address, offsets to program and section header tables, and various size fields.
Shell’s printf can emit all of this. The POSIX specification for printf includes support for \xNN hex escape sequences and octal escapes, so a function that outputs an ELF header is a series of printf calls:
printf '\x7fELF' # magic
printf '\x02' # 64-bit class
printf '\x01' # little-endian
printf '\x01' # ELF version 1
printf '\x00' # System V OS/ABI
printf '\x00\x00\x00\x00\x00\x00\x00' # padding
printf '\x02\x00' # ET_EXEC executable
printf '\x3e\x00' # x86-64 machine
printf '\x01\x00\x00\x00' # object file version
For a minimal executable with a single load segment, you need roughly 64 bytes of ELF header, 56 bytes of program header, and then the machine code itself. The arithmetic for offsets and sizes can be done entirely with shell’s $(( )) arithmetic expansion. Writing a correct ELF64 file by hand in shell is tedious but mechanically straightforward.
Why C89 Specifically
C89, the 1989 ANSI standard ratified as ISO/IEC 9899:1990, is the smallest complete C specification. Compared to C99 and later standards, it omits variable-length arrays, designated initializers, complex number types, inline functions, _Bool, // line comments, and the <stdint.h> and <stdbool.h> headers.
From a compiler implementation standpoint, each feature that gets removed makes the semantic analysis stage cheaper. C89 has a simpler type system, no implicit conversions from pointer to boolean, and a narrower set of expression forms. The grammar is smaller. The set of valid programs is smaller. A compiler that only needs to handle C89 can skip entire passes that a full C17 compiler would require.
There is also a deeper reason to target C89 specifically. The bootstrappable builds movement, which has been working for over a decade on the problem of building software from fully auditable source, converges on C89 as a lingua franca. Projects like M2-Planet implement a minimal C compiler that targets multiple architectures and is itself written in a subset of C that M2-Planet can compile. The Stage0 project goes further, establishing a path from hand-written hex bytes all the way up to a high-level language without ever relying on a pre-compiled binary that cannot be inspected.
The logic is that C89 is old enough, stable enough, and minimal enough that a correct implementation of it can be written in a few thousand lines of fairly simple code. If your compiler can compile C89, and you write it in C89, you have a self-hosting compiler with a very small trusted computing base.
Parsing in Shell
The harder part of a shell-based C compiler is not binary emission but parsing. C89’s grammar is not ambiguous in the ways that make it truly difficult to parse, but it is not context-free either: distinguishing between a type cast and a parenthesized expression requires knowing what names have been declared as types, which means the parser needs access to a symbol table.
Shell handles this by leaning on the POSIX text-processing utilities. Lexical analysis, which is the stage that turns character streams into tokens, maps naturally onto sed and pattern matching. AWK handles structured line processing well enough to implement a recursive-descent parser for a restricted grammar. Variables can accumulate state between pipeline stages.
The limitations are real. Shell has no native data structures beyond strings and positional parameters. Trees need to be serialized to text and re-parsed, or simulated through naming conventions where node_42_left and node_42_right are just variable names. Stack-based recursion requires careful management of global state. None of this is elegant, but it works.
Fabrice Bellard’s TCC, the Tiny C Compiler, demonstrates that a full C compiler targeting x86-64 can be implemented in roughly 10,000 lines of C. A shell-based compiler operating on a smaller subset of C89 and generating less optimal code has a correspondingly smaller surface area. The tradeoffs shift: slower execution of the compiler itself, more limited optimization, but radically reduced dependencies.
The Bootstrap Dependency Problem
The point of a shell-based C compiler is not performance. It is dependency minimization. To compile GCC or Clang, you need an existing C++ compiler. To compile that compiler, you need another compiler. At some point, the chain terminates in a binary blob that was compiled decades ago and cannot be directly verified against any source. Ken Thompson’s 1984 Turing Award lecture, “Reflections on Trusting Trust,” describes exactly this attack surface: a compiler can be made to insert backdoors into code it compiles, including into itself, in a way that is invisible in the source.
The bootstrappable builds community’s response is to shrink that trusted blob as close to zero as possible. Hex0 is a minimal meta-assembler that fits in a few hundred bytes and can assemble itself from hex source. GNU Mes is a Scheme interpreter used to bootstrap GCC from nearly-first principles. A shell compiler fits into this lineage: /bin/sh is available on virtually every Unix-like system, and on many systems it is a small binary like dash that is itself straightforward to audit.
If you can compile C89 with shell, and your shell binary is auditable, then you can bootstrap from shell to C89 to a full compiler without introducing any opaque pre-compiled artifacts. The bootstrappable.org project tracks the state of this effort across multiple distributions.
What This Actually Demonstrates
The Lobsters comment about shell having no language tag reflects a genuine gap in how the programming community categorizes languages. Shell is dismissed as glue, as scripting, as automation. But the operations that a compiler performs, tokenizing text, building a symbol table, emitting structured byte sequences, are all things that POSIX shell and its companion utilities handle directly.
C89 as a target is the right choice not just for minimalism but because the specification is stable, freely available, and well understood. ELF64 is the right output format because it is the native format on Linux and the specification is public and precise. Shell is the right implementation language because it is the most widely available programming environment that requires no compiled dependencies beyond the kernel.
A compiler that fits in a shell script does not compete with GCC or LLVM on any practical axis. It competes on a different axis: how few prior assumptions it requires to run. That question matters more than it used to, now that the software supply chain is a routine attack target and reproducible builds are considered baseline good practice rather than academic curiosity.