· 7 min read ·

What an Assembler Actually Does When It Reads Your Code

Source: lobsters

Most programmers have a vague mental model of assembly: you write mnemonics, the assembler turns them into bytes, the linker stitches the pieces together. That model is accurate but shallow. Once you actually read through a working assembler, as Brian Callahan’s demystifying assemblers post encourages you to do, you find a set of concrete engineering decisions that are each interesting on their own terms.

This is a walkthrough of those decisions: the two-pass structure that nearly every assembler uses, what symbol tables actually contain, how forward references get resolved, and what the output object file must carry so the linker can finish the job.

The Input Problem

Assembly source is a sequence of statements. Some are instructions. Some are directives. Some are label definitions. The processor only knows about instructions; everything else exists to help the assembler produce the right bytes in the right places.

Consider a simple x86-64 snippet:

section .text
global _start

_start:
    mov rax, 60
    mov rdi, 0
    syscall

message:
    db "hello", 0

The assembler has to figure out the numeric address of message before it can encode any instruction that references it. If message appears after the instruction that references it, the assembler cannot know the address on a first read. This is the forward reference problem, and it is the reason assemblers are almost universally designed as two-pass systems.

Pass One: Building the Symbol Table

The first pass makes no attempt to emit final machine code. It walks the source sequentially and tracks one thing: the location counter, often written as $ or ., which represents the current offset within the current section. Every time the assembler encounters a label definition, it records the label name and the current value of the location counter in a symbol table. Every time it encounters an instruction or data directive, it advances the location counter by the encoded size of that instruction or datum.

For instructions, estimating the size during pass one is usually straightforward because most instruction encodings have a fixed width determined entirely by the opcode and operand types, not by operand values. An x86-64 mov rax, imm64 is always 10 bytes regardless of what the immediate is. A jmp rel32 is always 5 bytes. The assembler can advance the location counter correctly without resolving any symbol.

By the end of pass one, the symbol table maps every label to a known offset. message in the example above has a definite value, say offset 13 from the start of .text once the three instructions before it have been counted.

Pass Two: Emitting Code

Pass two reads the source again and now emits bytes. When it encounters an instruction that references a label, it looks the label up in the symbol table built during pass one and substitutes the numeric value.

For x86-64 specifically, the instruction encoding is handled through the ModRM byte system, which encodes register operands and memory addressing modes into a compact format. A mov to or from a register uses a one-byte opcode followed by a ModRM byte that identifies the destination and source. For a mov rax, [rip + label], the assembler encodes the RIP-relative displacement as a 32-bit signed integer at a known offset within the instruction.

Here the pass two calculation is:

displacement = target_address - (instruction_end_address)

RIP-relative addressing in x86-64 computes addresses relative to the instruction pointer after the instruction has been fetched, which is instruction_start + instruction_size. The assembler knows both quantities during pass two and writes the correct 32-bit displacement into the instruction stream.

Forward References in Jumps

Jumps are where forward references most visibly complicate things. Consider:

    cmp rax, 0
    jz  .done
    ; ... some work ...
.done:
    ret

During pass one the assembler notes .done at whatever offset it falls. During pass two, when encoding jz .done, it computes the displacement and writes a jz rel8 (2 bytes) if the target is within 127 bytes, or jz rel32 (6 bytes) if not.

This creates a subtle problem: the size of the jump instruction affects the location counter, which affects the addresses of every subsequent label, which affects the displacements of other jumps. Some assemblers resolve this with a fixpoint iteration: keep re-computing sizes until no instruction size changes. Others default to the larger encoding on a first pass to avoid the possibility of needing to grow instructions after addresses have been committed. NASM takes the conservative approach by default and emits jmp rel32 unless you annotate the jump as jmp short.

The NASM manual section on effective addresses documents this behavior explicitly. Gas, the GNU Assembler, uses a relaxation algorithm that attempts to shrink jumps after calculating all addresses, iterating to convergence.

Object Files and Relocations

When you assemble a file that references symbols defined in another file, pass two cannot fill in the final address at all. The symbol simply is not in the symbol table. The assembler’s response is to emit a relocation entry instead of a resolved address.

An ELF object file carries a .rela.text section (for x86-64, which uses RELA format with an explicit addend rather than REL which embeds the addend in the instruction stream). Each entry in this table describes one location in the text that needs patching:

typedef struct {
    Elf64_Addr  r_offset;   /* byte offset in section where patch goes */
    Elf64_Xword r_info;     /* symbol table index | relocation type */
    Elf64_Sxword r_addend;  /* constant to add after resolving symbol */
} Elf64_Rela;

The relocation type field encodes the formula the linker must apply. For R_X86_64_PC32, the formula is S + A - P: the symbol value S plus the addend A minus the patch location P. This is exactly the RIP-relative displacement formula above, generalized. For R_X86_64_64, the formula is just S + A, used when the full 64-bit absolute address needs to be embedded, typically in data sections or in position-dependent code.

The assembler writes a placeholder zero (or sometimes a partial value) at r_offset in the text and records the relocation. The linker reads all participating object files, assigns final addresses to all sections, resolves all symbol references against those addresses, and applies each relocation formula to patch the instruction stream.

This division of labor is the reason assemblers and linkers are separate programs. The assembler handles encoding; the linker handles address layout. Merging them into a single tool would require knowing the complete set of input files before encoding any instruction, which destroys incremental compilation.

The Symbol Table in the Object File

The ELF symbol table (SHT_SYMTAB) is separate from the relocation table. Each entry gives a symbol’s name, its binding (local, global, or weak), its type (function, object, section, file), the section it belongs to, and its value (offset within that section).

For an undefined symbol, the section index is SHN_UNDEF. The linker uses this to flag symbols that must be resolved by searching other object files and libraries. If the linker cannot find a definition, it emits the familiar undefined reference to error.

For a global symbol defined in the file, the entry records the section and offset. The linker computes the final virtual address once it knows where each section lands in the address space, then patches all relocation entries that refer to that symbol.

Local symbols (prefixed with .L in Gas by default, or declared static in C) are stripped from the symbol table in the final linked binary unless you pass -g. They exist in the object file only to help the assembler resolve section-relative references during its own two-pass algorithm.

Where Assemblers Get Interesting

The two-pass model with relocation is the textbook answer, but real assemblers accumulate interesting complications around the edges.

Gas’s relaxation pass is one example. Another is macro expansion: assembler macros can expand to variable-length instruction sequences depending on their arguments, which means the location counter cannot advance by a fixed amount per macro invocation during pass one. Gas and NASM handle this by fully expanding macros during pass one while tracking sizes.

The handling of .align and .balign directives introduces padding bytes whose count depends on the current location counter value, which in turn depends on instruction sizes that might not be final yet in a relaxation-based assembler. Getting padding right across relaxation iterations requires care.

Section switching is another wrinkle. A single assembly file can switch between .text, .data, .rodata, and .bss multiple times. Each section has its own independent location counter, and label values carry both a section tag and an offset. The assembler must track multiple counters simultaneously and tag each symbol with its section so that relocation entries can correctly identify which section’s base address to use.

The Practical Payoff

None of this is purely academic. If you write hand-rolled assembly in a project, understanding the two-pass model tells you why certain forward references work while others require you to annotate the jump size. Understanding relocation entries tells you what the linker map file is actually reporting and how to diagnose address conflicts or unexpected segment sizes.

For systems programmers in Rust or C who occasionally drop into inline assembly, the constraint that inline asm operands must be bound to variables rather than arbitrary expressions is a direct consequence of the assembler needing to know sizes for its location counter: the compiler hands the inline block to an assembler with unresolved operands expressed as relocatable symbols, and the relocation machinery handles the rest.

The full mechanics are laid out clearly in the System V ABI for x86-64, which specifies both the relocation types and the object file format expectations. The GNU Binutils source, particularly bfd/ and gas/, is the canonical reference implementation if you want to read working code rather than specification prose.

An assembler is a small program in terms of conceptual weight. The two-pass design, the symbol table, the relocation entries: once you see these three pieces together, the whole translation chain from source to running process becomes straightforward to reason about.

Was this interesting?