· 8 min read ·

Three Layers of Context for LLM Code Generation in Complex Runtimes

Source: hackernews

Retrieval-augmented generation over API documentation has become the standard first response when an LLM pipeline fails at specialized code generation. Add the docs to context, reduce hallucinations, get better output. For high-resource languages in well-documented frameworks, this works reasonably well. The model already understands the language’s structure; the documentation provides current API signatures and helps with version accuracy.

Godogen, a pipeline that generates complete playable Godot 4 games from text prompts, uses a three-layer reference architecture rather than a single documentation store. A year of development and four major rewrites produced this structure by eliminating each category of failure independently. Each layer covers knowledge that the layers below it structurally cannot provide, and removing any one produces a distinct and reproducible class of failures.

Understanding why the layers are non-collapsible is worth doing carefully, because the structure generalizes.

Layer One: The Language Specification

The first layer is not retrieved from existing documentation; it is authored. Godogen includes a hand-written GDScript language specification that explicitly describes the language’s type system, annotation syntax, and the points where GDScript diverges from Python in ways that produce syntactically plausible but incorrect code.

This layer is necessary because GDScript’s closest relative in most models’ training data is Python, and the overlap is genuinely confusing. Python idioms that are natural to write, and that a model trained on mixed GDScript and Python data will produce, fail to compile in GDScript or compile and do the wrong thing:

# Python idiom that fails in GDScript: list comprehension
var positions = [node.position for node in get_children()]  # invalid

# GDScript equivalent
var positions: Array[Vector2] = []
for node in get_children():
    positions.append(node.position)

# Python idiom that fails: multiple return via tuple
func get_bounds() -> (Vector2, Vector2):  # invalid GDScript
    return min_pos, max_pos

# GDScript equivalent: use a typed Array
func get_bounds() -> Array[Vector2]:
    return [min_pos, max_pos]

The GDScript documentation for individual classes does not say “list comprehensions do not exist.” It documents what GDScript has, not what Python has that GDScript omits. A model generating code from class references alone still carries the wrong structural priors; it knows what Array methods are available but not that the Python comprehension syntax it would naturally reach for is absent. The specification layer exists to correct structural priors that documentation cannot contradict, because documentation assumes you already have them right.

The same applies to type annotation syntax, @export and @onready annotation semantics, signal declaration syntax, and several GDScript features that have no Python analogue. These are language-level constraints that belong in a language specification, not in per-class API documentation.

Layer Two: The Version-Accurate API Reference

The second layer is the Godot 4 API reference, converted from the engine’s XML source files. Godot maintains its class documentation in the engine repository, approximately 850 classes covering methods, properties, signals, constants, and version metadata.

The version accuracy matters precisely because GDScript’s training data spans Godot 3 and Godot 4, which changed the API substantially. Signal connections, the primary character movement class, the move_and_slide() signature, and dozens of other commonly used methods changed between versions. A model without the Godot 4 reference interpolates between versions, producing code that satisfies neither:

# Godot 3: move_and_slide took velocity as an argument
velocity = move_and_slide(velocity, Vector2.UP)

# Godot 4: velocity is a property; move_and_slide takes no arguments
velocity.y += gravity * delta
move_and_slide()

No amount of language specification fixes this; the specification describes syntax and semantics, not the current state of specific method signatures. The API reference layer, drawn from the authoritative XML source rather than from scraped documentation that may mix versions, closes this gap.

Godogen lazy-loads this reference rather than injecting the full 850-class corpus into every context. The agent first determines which classes the requested game type will require, then loads documentation for that working set. A 2D platformer typically resolves to CharacterBody2D, CollisionShape2D, Sprite2D, AnimationPlayer, Camera2D, Timer, Area2D, and the Input singleton. This targeted loading stays within context budget while providing authoritative signatures for every API the generated code will actually use.

The lazy-loading decision also makes the boundary between layers two and three operationally concrete. When the agent requests documentation for CharacterBody2D, it receives not only the class’s method signatures but the operational knowledge entries associated specifically with that class. The quirks travel with the documentation they annotate, not as a separate retrieval step.

Layer Three: The Operational Knowledge Database

The third layer is where documentation retrieval ends and a different kind of knowledge begins. Godogen calls this the quirks database. It contains runtime behaviors, phase-specific constraints, and implicit API contracts that formal documentation does not capture, because documentation authors assume you already know them.

The canonical example is the owner property during headless scene construction:

var node = Node2D.new()
scene_root.add_child(node)
# Without this line: the node writes to disk without error,
# then silently disappears when the .tscn is reloaded.
node.owner = scene_root

The Node.owner API is documented. What the documentation does not state is that omitting owner during programmatic scene construction produces a specific silent failure: the node is present in memory during construction, the save operation reports success, the .tscn file is syntactically valid, and the node is simply absent when the file is reloaded. No error is emitted at any stage. Discovering this requires building a headless scene, observing the discrepancy, and tracing backward to the cause.

The @onready annotation has the same character. Documentation accurately describes what it does: defers variable assignment until _ready() fires after the node enters the scene tree. During headless scene construction, there is no active scene tree, so _ready() never fires for the scene being built. The annotation parses without error, compiles without warning, and leaves the variable null in a way that fails at runtime with no visible connection to its cause:

# In a running game: works correctly, assigns after _ready()
@onready var sprite: Sprite2D = $Sprite2D

# In a headless construction script: _ready() never fires.
# Variable is null. No compile error. No runtime error at assignment.
# Failure appears later, when code uses sprite and gets a null reference.
@onready var sprite: Sprite2D = $Sprite2D  # silently does nothing

Neither of these constraints appears in the API reference because the API reference documents behavior during normal game execution. Neither appears in the language specification because they are runtime behaviors of a specific execution phase, not language semantics. They require a third layer: curated knowledge of what happens in the specific execution contexts this pipeline uses, built by running the pipeline until it fails and encoding what the failures reveal.

Signal connections add a third case of the same kind. Godot 4’s first-class signal objects work correctly in a running game’s _ready() method. During headless construction, the scene tree does not exist as a live graph; node path traversal with $ resolves against the in-memory hierarchy, which behaves differently from the instantiated tree at runtime. Connections established through node paths in a headless context can reference nodes that do not exist at the moment of the call, producing either a null reference or a silent no-op with no diagnostic output.

Each of these cases shares the same structure: the API accepts the call, produces no error at the call site, and fails at a later phase. That specific signature, silent acceptance followed by deferred failure, is precisely what documentation cannot warn you about, because documentation is written for the phase where the API works.

Why the Layers Are Non-Collapsible

Each layer addresses a failure category the others structurally cannot.

The specification layer corrects structural priors from training data contamination. It cannot be replaced by API documentation, because documentation does not enumerate what the language lacks. A model reading accurate Array documentation still reaches for Python comprehension syntax because the specification layer is what needs to tell it that syntax is absent. It cannot be derived from the quirks database because structural priors are language-wide and not instance-specific.

The API reference layer provides version-accurate method signatures. It cannot be replaced by the specification, which describes language semantics rather than library behavior. It cannot be replaced by the quirks database, which covers edge cases in specific contexts rather than the full API surface. The Godot 3 versus Godot 4 migration problem is not a quirk; it is a version-accuracy problem that belongs in the reference layer.

The operational knowledge layer covers behaviors in specific execution phases that neither the language specification nor the API documentation addresses, because both are written for normal use contexts. It cannot be derived from the other two layers because it is empirical knowledge, gathered by running code in the specific conditions that produce failure.

A pipeline with two of the three layers fails systematically in the category the third covers. Adding documentation to a pipeline that lacks a specification still produces Python-idiom hallucinations in GDScript output. Adding a specification to a pipeline that lacks an operational knowledge layer still produces silent failures at phase boundaries. The layers are necessary together.

The General Pattern

Godot is not unusual in having this structure. Any runtime with lifecycle phases, a large API surface, and a target language with limited or contaminated training data has the same three knowledge categories.

Unity’s MonoBehaviour lifecycle has documented method names and underdocumented ordering in specific configurations: the difference between Awake, Start, and OnEnable when scripts load in particular orders, the constraint that GetComponent called from a constructor returns null, the interaction of ExecutionOrder attributes across multiple scripts. These are operational knowledge. They are not in the C# specification or the Unity Scripting Reference.

GPU shader programming in HLSL or GLSL has the same structure: a language specification, per-API documentation (D3D12, Vulkan, OpenGL), and a layer of driver-behavior and hardware-specific knowledge that surfaces through running shaders and observing failures on specific hardware. The Microsoft HLSL documentation does not document the driver-specific divergences that graphics programmers accumulate through years of debugging.

Embedded RTOS programming in FreeRTOS or Zephyr has interrupt context restrictions on API calls that are specified in prose documentation but not enforced by the API itself. A model generating embedded code from API documentation alone will produce code that compiles, links, and fails unpredictably at runtime when an interrupt context calls a function that blocks.

In each domain, an LLM pipeline targeting that runtime would need to construct all three layers explicitly. The model does not become more capable through better documentation; it becomes more reliable within a specific execution context when the documentation precisely matches what that context requires.

Godogen spent a year across four rewrites assembling these three layers for a domain where none of them previously existed in pipeline-ready form. The hand-authored specification, the XML conversion pipeline, and the accumulated quirks database are the infrastructure that makes the model’s output reliable. They are not configuration of the model. They are the engineering work that makes the task tractable, and the pattern they form describes what that work looks like in any domain where it has not yet been done.

Was this interesting?