Build Time, Runtime, and the Node That Vanishes: What Godogen Reveals About LLM Code Generation

A pipeline that takes a text prompt and outputs a playable Godot 4 game sounds like the kind of project that either works trivially or fails immediately. Godogen is neither. After about a year of development and four major rewrites, the result handles GDScript generation, 2D and 3D asset creation, scene graph construction, and visual testing. The three engineering problems it had to solve are not just Godot-specific; they are a useful lens for understanding what makes any code generation target tractable or not.

The Training Data Problem Is Worse Than It Looks

GDScript has roughly 850 classes. It uses Python-like syntax: indentation blocks, similar control flow, type annotations that look like Python’s. This creates a specific failure mode that goes beyond the usual sparse-training-data issue. LLMs are not simply uncertain about GDScript; they are confidently wrong about it in ways that are hard to catch.

The MultiPL-E benchmark established that LLM code generation performance tracks training data volume closely. GDScript sits well below even Lua on that curve. Most existing GDScript in public datasets predates Godot 4, which shipped in March 2023, which means training data actively teaches behaviors that fail in the current version. A model that has seen substantial Godot 3 code will generate Godot 3 method signatures alongside Godot 4 node names, producing hybrid code that fails in both versions.

The Godot 3 to Godot 4 migration breaks are numerous and non-obvious:

# Godot 3
$Timer.connect("timeout", self, "_on_timer_timeout")
onready var player = $Player
export var speed = 200.0
move_and_slide(velocity, Vector2.UP)

# Godot 4
$Timer.timeout.connect(_on_timer_timeout)
@onready var player: CharacterBody2D = $Player
@export var speed: float = 200.0
# velocity is now a class property; move_and_slide() takes no argument
move_and_slide()

Beyond version contamination, Python similarity activates the wrong priors. len(x) does not exist in GDScript; the correct form is x.size(). list.append(x) becomes array.push_back(x). There are no list comprehensions, no tuple returns. Vector2 and Vector2i are distinct types with non-overlapping method sets. The $NodePath shorthand resolves against the live scene tree at call time, which means calling it from _init() before the tree exists returns null with no diagnostic output.

Godogen’s solution is to not rely on training data for the API surface. The project maintains a hand-written language spec documenting where GDScript diverges from Python, API documentation converted from Godot’s own XML class source, and a quirks database of behaviors not captured in any official documentation. The full API catalog is several megabytes, so loading all of it into every generation call is not feasible. Instead, the agent identifies which classes a given game type requires before writing any code, then fetches only those definitions.

A 2D platformer has a predictable, bounded working set: CharacterBody2D, Sprite2D, CollisionShape2D, Area2D, Camera2D, AnimationPlayer, Timer, Input. Loading only these keeps the context compact and reduces the surface for hallucinating plausible-sounding method names from unrelated subsystems. This is closer to structured retrieval against a known catalog than to embedding-similarity RAG, and it is more reliable for this task because the selection logic is deterministic rather than probabilistic. Liu et al. 2023 established that transformer models attend poorly to content in the middle of long contexts; keeping API references compact places the critical definitions in the high-attention zone.

The Build-Time vs. Runtime Boundary

Godot’s execution has distinct phases, and which APIs are available depends entirely on which phase you are in.

During _init(), the node object is constructed but has no parent and no children. get_node() returns null. get_tree() returns null. The @onready annotation is syntactic sugar for an assignment deferred until just before _ready() fires; a model applying Python variable semantics generates null references with no obvious cause:

# Wrong: $Sprite2D resolves to null during _init()
var sprite = $Sprite2D

# Correct: deferred until the scene tree is fully built
@onready var sprite: Sprite2D = $Sprite2D

Signal connections established before _ready() fail silently in Godot 4 by default. A CharacterBody2D with no CollisionShape2D child loads cleanly and then falls through the floor when physics runs. There is nothing logged at any stage.

Godogen builds scenes using headless GDScript tool scripts that construct the node graph in memory via Godot’s API and serialize it with PackedScene and ResourceSaver. This sidesteps the fragility of generating .tscn text directly. Godot’s .tscn format is an INI-like serialization with strict invariants: the load_steps header count must exactly equal the number of resource declarations, UIDs must be globally unique, every ID reference must match its declaration exactly. An LLM maintaining exact counts and globally unique identifiers across a document is structurally unreliable. The headless builder delegates all of that to the engine.

But headless construction introduces its own constraint. During a headless SceneTree run, _ready() never fires. Code must be explicitly structured for the phase it targets. Knowing which APIs are available at build time versus runtime required careful prompting, but the payoff is that the generated scenes are structurally correct before any game logic runs.

The analogues from infrastructure tooling are direct: AWS CDK generates CloudFormation JSON by running TypeScript; CDK8s generates Kubernetes YAML the same way. The principle is that delegating format correctness to an API is more reliable than generating the serialized text directly, regardless of whether the generator is a human or an LLM.

The Node That Vanishes

The most subtle failure mode in headless Godot scene construction has no diagnostic output at any stage. When building a scene programmatically, add_child() is not sufficient to include a node in the saved .tscn file. Every node’s .owner property must also be explicitly set to the scene root:

var root = Node2D.new()
var player = CharacterBody2D.new()
var sprite = Sprite2D.new()

root.add_child(player)
player.owner = root   # owner = root, not the immediate parent

player.add_child(sprite)
sprite.owner = root   # owner = root at every depth

Omitting .owner causes the following: add_child() succeeds in memory, PackedScene.pack() returns success, ResourceSaver.save() writes the file and returns success, and then the node is simply absent when the file is reloaded. No error, no warning. The design rationale is correct: Godot distinguishes between nodes that are part of the serialized scene and nodes created at runtime that should not be saved. The owner property is the mechanism. It is appropriate behavior for runtime-created ephemeral nodes; it is a silent footgun for headless scene builders. The documentation mentions it without foregrounding it as the primary failure mode of headless construction.

This category of failure, where the failure only manifests on a subsequent operation after several successful ones, is exactly what makes it hard to capture in training data. No blog post or forum answer documents it until someone wastes hours chasing it. Godogen’s quirks database exists because these behaviors are real constraints that cannot be inferred from API documentation alone.

What the Comparison Projects Reveal

Rosebud AI generates Phaser.js browser games in JavaScript. JavaScript has orders of magnitude more training data than GDScript, no version-contamination problem between Phaser 2 and Phaser 3, and evaluation that runs in a browser tab in milliseconds. The reliability difference between browser-game generators and native engine generators reflects training data density and evaluation cost, not fundamental differences in generation capability.

Unity Muse generates C# MonoBehaviour scripts. C# has substantially higher LLM training corpus representation than GDScript, and Unity documentation appears in training data at a density GDScript cannot match. The pattern holds.

Godogen’s more expensive architecture, the custom language spec, the quirks database, the lazy-loading API retrieval, the headless builder pattern, the external evaluation loop with a virtual framebuffer, is a consequence of targeting a native engine with a sparse and version-contaminated training corpus, not a deliberate design choice toward complexity.

The Evaluation Loop Problem

A coding agent is inherently biased toward its own output. Static analysis can catch syntax errors and type mismatches, but cannot catch a character that falls through the floor because physics layers are misconfigured, or a signal that fires but produces no visible effect because the connection was made at the wrong lifecycle phase. Godogen evaluates generated games by running godot --headless with a virtual framebuffer and capturing screenshots. This is slower than browser evaluation by a wide margin, but it is the only way to observe runtime behavior in a native engine.

The evaluation architecture is what separates a generator from a pipeline. Generating code that compiles is a much lower bar than generating code that produces a playable game. The visual evaluation loop closes the gap between the two, at the cost of iteration speed.

Four rewrites over a year is a significant investment for a side project. Each rewrite addressed a new category of failure rather than refining a stable approach. The end result is an architecture shaped by the specific failure modes of its target, which is probably the only way to build something this far outside the density of existing training data.