A Login Shell Without a Runtime: What Assembly Reveals About the Unix Process Contract
Source: lobsters
Writing a login shell in assembly sounds like it belongs in the same category as implementing your own memory allocator or writing a TLS stack from scratch: technically possible, practically unreasonable. Geir Isene’s writeup on exactly that is a useful corrective to that instinct. What the exercise reveals is not that assembly is a sensible tool for this task, but that the Linux kernel’s side of the login contract is far smaller than the toolchain it normally gets buried under.
What the Kernel Actually Requires
Before getting into assembly mechanics, it is worth being precise about what the system actually requires from a login shell, because the requirements are minimal to the point of being surprising.
When login(1), sshd, or getty hands off control to a user’s shell, it reads the shell path from the seventh field of /etc/passwd, verifies that path is listed in /etc/shells, and calls execve with argv[0] set to a string beginning with -. That leading hyphen is the POSIX convention by which a shell detects it is running in login context and should source /etc/profile and ~/.profile. By the time execve is called, PAM session modules have already run: pam_limits.so has applied ulimit values, pam_env.so has set environment variables, and the process has been dropped to the user’s uid and gid. The shell inherits a fully prepared environment. It does not need to authenticate anyone, manage PAM, or initialize a session.
The kernel’s actual requirements for the thing that gets exec’d come down to three points: it must be a valid ELF executable (or a #! script), its path must appear in /etc/shells, and it must not crash immediately. Nothing in the kernel or login process checks that it implements POSIX shell semantics, supports job control, understands redirection, or sources profile files. A binary that reads a line from stdin, execs it as a command, and loops is a valid login shell in every sense the system enforces.
The Syscall Loop
The minimal shell loop translates directly to a small set of Linux system calls. The x86-64 kernel interface is direct: load the syscall number into rax, load arguments into rdi, rsi, rdx, r10, r8, and r9, execute the syscall instruction, and read the return value from rax. For a shell that reads one command at a time and runs it, the full syscall vocabulary is:
sys_read = 0
sys_write = 1
sys_fork = 57
sys_execve = 59
sys_wait4 = 61
sys_exit_group = 231
sys_chdir = 80
The core loop in NASM notation:
.loop:
; write prompt to stdout
mov rax, 1 ; sys_write
mov rdi, 1 ; fd = stdout
mov rsi, prompt ; "$ "
mov rdx, 2
syscall
; read input line
mov rax, 0 ; sys_read
mov rdi, 0 ; fd = stdin
mov rsi, buf
mov rdx, 255
syscall
test rax, rax
jle .exit ; EOF or error: exit cleanly
; fork
mov rax, 57 ; sys_fork
syscall
test rax, rax
jz .child ; child: rax == 0
; parent: wait for child to finish
mov rdi, rax ; child pid
mov rax, 61 ; sys_wait4
xor rsi, rsi ; *wstatus = NULL
xor rdx, rdx ; options = 0
xor r10, r10 ; *rusage = NULL
syscall
jmp .loop
.child:
; tokenize buf into argv, then exec
mov rax, 59 ; sys_execve
syscall
; exec only returns on failure
mov rax, 231 ; sys_exit_group
mov rdi, 127 ; 127 = command not found
syscall
.exit:
mov rax, 231
xor rdi, rdi
syscall
The tokenization step, splitting the input buffer into a null-terminated argv array and locating the executable path, is where most of the actual code lives. It requires pointer arithmetic and string scanning, but all of it is straightforward register manipulation. There is no string library, no heap allocation, and no runtime to initialize. The buffer goes on the stack or in a BSS region; the argv array is a handful of pointers into that same buffer with null bytes written over the whitespace separators.
The ELF Wrapper
The binary itself needs very little structure. As explored previously in the context of a shell-based C compiler, a statically linked executable that makes direct syscalls requires only an ELF file header (64 bytes) and a single PT_LOAD program header (56 bytes) before the machine code. No sections, no symbol table, no dynamic linker, no libc initialization routines. Total structural overhead: 120 bytes.
When the kernel execs this binary, it maps the PT_LOAD segment into memory, sets the instruction pointer to the entry address in the ELF header, and begins execution. There is no ld.so, no _start prologue unwinding a libc initialization table, no global constructors to run. The first instruction in the text segment is the first thing the CPU executes. For a minimal login shell, the entire binary could plausibly fit under 2KB, compared to dash at roughly 100KB stripped.
The startup time follows from this structure. Dash starts in roughly a millisecond on modern hardware; bash with a typical configuration takes around five milliseconds. A login shell with no dynamic linker traversal, no C runtime initialization, and no profile sourcing would start in tens of microseconds. The kernel loader time dominates.
The Oldest Lineage
There is something cyclical about this exercise. The original Unix shell, written by Ken Thompson in 1971, ran on the PDP-7 and was implemented in PDP-7 assembly. It was minimal by necessity: no variables, no scripting features, just command execution with pipes. When Unix moved to the PDP-11 and C became viable as a systems language, the shell was rewritten in C. Every shell in the lineage since then has been written in C.
Returning to assembly in 2026 is a deliberate excavation of the layer that was always underneath. The kernel interface has not changed in its essentials: fork, exec, and wait are the same three operations Thompson’s shell performed, expressed through system call numbers rather than PDP-7 trap instructions. The register-passing convention is different, the addressing modes are different, but the model is identical.
The Trust Dimension
A login shell in assembly has a secondary property that does not come up often. A C-compiled login shell depends on the C compiler that produced it, and as Ken Thompson demonstrated in his 1984 Turing Award lecture, a compiler binary can silently modify the programs it compiles, including login.c. Thompson’s canonical attack targeted the login binary specifically.
An assembly source file, translated by NASM or GAS, maps almost one-to-one from source to binary output. The transformation is local: each instruction mnemonic corresponds to a known byte encoding, and the encoding tables are published and verifiable. The gap between what you read and what runs is narrow enough that a careful person could verify the binary against the source by hand. The bootstrappable builds project has been working down exactly this trust chain, and a login shell in assembly sits naturally at the bottom of it.
The cd Constraint
The one constraint an assembly shell shares with every shell implementation is that cd cannot be a child process. The chdir(2) syscall changes the working directory of the calling process. If you fork a child and call chdir in the child, the parent’s directory is unaffected. The user types cd /tmp and the prompt returns with the same working directory.
This is a correctness requirement from the Unix process model, not a performance consideration, as covered in detail previously. The same logic applies to any operation that mutates inherited process state: signal masks, open file descriptors, environment variables. In an assembly shell, handling cd means checking the first token of the input buffer before the fork, comparing it against the four characters of cd\0, and branching to a sys_chdir call in the shell process rather than the fork path. That requires a few dozen extra instructions but no architectural change to the loop.
What the Structure Exposes
Production shells like dash (~15,000 lines of C) and bash (~130,000 lines) have accumulated decades of correctness requirements: POSIX expansion order, job control semantics with process group management, signal disposition inheritance, terminal ownership transfers with tcsetpgrp, vfork optimization paths, hash tables for command lookup, and hundreds of builtin commands. That complexity is real and serves real users.
Underneath all of it is the same read, fork, execve, wait loop that an assembly shell exposes directly. The assembly implementation does not produce something practical; it produces something legible. Every instruction corresponds to a specific kernel operation with no intermediary between intent and interface. When the shell reads input, the sys_read call is there in the source. When it forks, the process split is explicit. When it waits, the parent suspension is visible.
The login shell in assembly is, in the end, a proof that the contract Linux enforces on a login shell is small enough to satisfy with a few hundred instructions and no runtime at all. The larger shells are not more correct at the system interface level; they are more correct in the behaviors they expose to users on top of that interface. The interface itself has always been this narrow.