C89 to ELF64 in Pure Shell: printf as Code Generator

A C compiler written in pure portable shell sounds like a contest entry or a weekend dare. The gist by alganet is neither: it is a genuine implementation that parses a restricted subset of C89 and emits valid ELF64 binaries using nothing but POSIX sh. No awk co-processes, no compiled helpers, no external tools; the compiler is the shell script itself.

Understanding why this is possible requires looking at the ELF64 format more carefully than most developers ever do.

The ELF64 Format Is Smaller Than It Looks

The Linux kernel does not require much from an executable binary. To run a program, it needs an ELF file header (64 bytes), at least one PT_LOAD program header describing where to map the file into memory (56 bytes), and the raw machine code to execute. No sections, no symbol tables, no relocation entries, and no dynamic linking information are required for a statically-linked binary that calls no external libraries. The total structural overhead for a minimal runnable ELF64 file is 120 bytes, followed by whatever machine code you want to execute.

The ELF file header starts with four magic bytes (\x7f, E, L, F), followed by fields describing the architecture (64-bit little-endian x86-64), the file type (ET_EXEC for a statically-linked executable), the virtual address of the entry point, and the byte offset of the program header table within the file. The single PT_LOAD program header that follows tells the kernel to load the entire file into memory starting at virtual address 0x400000, the conventional load address for static Linux executables, and to mark that region as readable and executable.

POSIX sh can write every one of those bytes using printf:

printf '\x7f\x45\x4c\x46'  # ELF magic: .ELF
printf '\x02'               # EI_CLASS: ELFCLASS64
printf '\x01'               # EI_DATA: little-endian (ELFDATA2LSB)
printf '\x01'               # EI_VERSION: current

The printf built-in supports \xNN hex escapes and \0NNN octal escapes in every POSIX-compliant shell. Redirecting stdout to a file turns a shell script into a binary emitter. The machine code for exit(N) on x86-64 Linux is 11 bytes:

b8 3c 00 00 00   # mov eax, 60   (sys_exit syscall number)
bf 0N 00 00 00   # mov edi, N    (exit code argument)
0f 05            # syscall

A complete, runnable ELF64 binary that exits with code 0 is 131 bytes, all of which a shell script can produce without calling any external program.

Why C89 Is the Right Subset

C89, the 1989 ANSI C standard, is a deliberate target for a minimal compiler implementation. Compared to C99 or C11, the feature set is substantially smaller: no variable-length arrays, no // single-line comments, no inline, no _Bool, no designated struct initializers, and no mixed declarations and code. That last restriction is the most parser-friendly property of the standard. In C89, all variable declarations within a block must appear before any statements, so the compiler always knows whether it is in a declaration phase or a statement phase, without lookahead or context tracking.

The C89 grammar is also relatively unambiguous at the expression level. There are no compound literals, no for loop variable declarations, and no complex type inference rules to implement. For a compiler whose primary target is int main() { return N; }, C89 provides enough of a real language to be interesting while keeping the grammar tractable for a restricted implementation.

Parsing in Shell

Shell has no regex engine in the POSIX base. The available string-processing tools are case/esac glob pattern matching, IFS-based word splitting, ${var#prefix} and ${var%suffix} trimming, and $(( )) integer arithmetic. Together these are sufficient for a rudimentary tokenizer.

The standard pattern is to read input line by line with while IFS= read -r line, then dispatch on token patterns using case. Splitting a C source line like return 42; into tokens requires setting IFS to include semicolons and spaces, then working through the resulting word list. Keywords and integer literals are matched with case patterns, and compiler state is tracked in shell variables.

This approach is fragile relative to production compiler requirements, but the shell compiler operates on a carefully restricted input language where the grammar is regular enough for case-based dispatch to work. Once the parser recognizes return followed by an integer expression, the code generator emits the byte sequence for mov edi, N and the syscall epilogue. The output is effectively a lookup table: one printf call sequence per recognized construct.

Code generation is, perhaps counterintuitively, the part of a compiler that shell handles most naturally. Pattern recognition maps to case, and each recognized pattern maps to a fixed sequence of bytes, emitted via printf.

Minimal C Compilers: Prior Art

Minimal C compilers have a long tradition. Fabrice Bellard’s TCC (Tiny C Compiler) implements full C99 in roughly 20,000 lines of C, uses a single-pass design with no AST or intermediate representation, and compiles itself in under a second. Rui Ueyama’s chibicc implements a large portion of C11 in about 5,000 lines, structured so each Git commit adds exactly one language feature, with companion write-ups explaining every design decision. Both generate x86-64 ELF directly.

Every compiler in this tradition is written in C, which means every one of them requires an existing C compiler to build. The shell compiler breaks that dependency. Its only requirement is a POSIX shell binary, which is typically smaller and simpler than any C compiler. dash, the default /bin/sh on Debian-based Linux systems, is under 100KB. No C compiler binary comes close to that footprint.

The Bootstrapping Problem

Ken Thompson’s 1984 Turing Award lecture, Reflections on Trusting Trust, is the definitive treatment of why the bottom of the tool chain matters. Thompson demonstrated that a compiler binary can insert malicious code into every program it compiles, including new versions of itself, in a way that persists even after the source code is audited and cleaned. The only defense is to trace your trust to a binary you know was built cleanly, which requires an auditable chain all the way down to the hardware.

The bootstrappable builds project builds that chain explicitly. The goal is a complete, auditable path from a minimal binary seed to a full software stack. The canonical chain begins from hex0, a roughly 200-byte program that reads ASCII hex digits and writes raw bytes; it is simple enough to verify by inspection. From hex0, progressively more capable assemblers and compilers are built until full GCC can be compiled from source.

A shell-based C compiler fits naturally into this kind of chain. Replacing a C compiler binary with a shell script means the trusted seed needs to include only a POSIX shell, not a compiled C compiler. Shell source is also more auditable line by line than compiled binary code. The live-bootstrap project pursues exactly this approach, tracing a complete path from a small binary seed through shell scripts to a full Linux userland.

What the Gist Demonstrates

The alganet gist does not implement all of C89. It does not handle structs, pointer arithmetic, function calls, or the preprocessor. What it demonstrates is that the core loop of compilation, from source token to binary byte, is expressible in the most restricted portable programming environment available on Unix systems, and that such an implementation has genuine relevance to the bootstrapping problem rather than being purely a curiosity.

The ELF64 format is accessible enough to generate valid executables without a linker. C89’s syntax is regular enough that case-based tokenization covers meaningful programs. And printf with hex escapes has been a portable binary writer since early Unix, available on every system with a POSIX shell.

Each of these techniques is individually known; what the gist contributes is assembling them into something that compiles and runs, without reaching for awk, python, or any other runtime. The boundary between scripting environment and systems tool is, in practice, a matter of what you choose to build with the primitives you have.