Most LLM code generation works at the function level. You describe what you want, you get a function back, you paste it in, you test it. The failure modes are obvious: wrong method names, bad syntax, logic that doesn’t match the spec. You iterate quickly.
Generating a complete, playable game is a structurally different problem. Logic, scene structure, and assets form a coupled system where failures only surface at the boundary between components, often at runtime, often silently. Godogen, a pipeline that takes a text prompt and produces a complete Godot 4 project, spent about a year across four major rewrites working through exactly this problem. The three engineering bottlenecks the project solved are worth examining in detail because they generalize well beyond game development.
GDScript Is a Harder LLM Target Than It Looks
GDScript’s Python-like syntax is usually framed as an accessibility win, and for human developers it is. For LLMs, it operates as a liability.
Models trained on the breadth of publicly available code carry strong Python priors. GDScript shares enough surface syntax to activate those priors while diverging in ways that are subtle enough to escape detection until runtime. The failures look correct. The code compiles. It runs. Then it breaks in ways that require Godot-specific knowledge to diagnose.
Signal connections are a good example. The Godot 3 form was:
$Timer.connect("timeout", self, "_on_timer_timeout")
Godot 4 changed this completely:
$Timer.timeout.connect(_on_timer_timeout)
Models blend these. The blended version satisfies neither version of the engine. Similarly, KinematicBody2D became CharacterBody2D in Godot 4, and move_and_slide() changed from accepting a velocity vector as an argument to reading from a property you set beforehand. These aren’t minor API updates; they’re breaking changes that look like working code in a context window.
Python list comprehensions and tuple returns don’t exist in GDScript. The type system treats Vector2 and Vector2i as distinct with separate method sets. PackedFloat32Array looks like a list but has strict element typing. A model producing GDScript from Python training data doesn’t produce obviously broken output; it produces output with the right shape and the wrong semantics.
The GDScript documentation covers about 850 classes, and most of the GDScript available on the web targets the Godot 3 API. Training data teaches the wrong version at scale.
Godogen’s solution was a three-layer knowledge system. First, a hand-authored language specification that explicitly addresses Python divergence, covering type traps, annotation semantics, and structural differences. Second, the Godot XML source documentation converted to a format the agent can consume, providing version-accurate method signatures that override the mixed Godot 3/4 signal in training data. Third, a quirks database that captures behaviors you can only learn through debugging: things the official docs don’t mention because they’re implicit contracts, not documented behaviors.
Because 850 classes would blow up the context window, the agent lazy-loads only the APIs relevant to the type of game being generated. A 2D platformer needs CharacterBody2D, Sprite2D, CollisionShape2D, Area2D, Camera2D, AnimationPlayer, Timer, and Input. Everything else stays out of context. The quirks relevant to those classes load alongside them.
The Node That Silently Vanishes
Godot’s scene format, .tscn, is a text serialization format designed for tool output, not human editing. It tracks load steps, UIDs, external resource references, and node parent paths in a way that is easy to get almost right while being subtly wrong. Generating this format directly would require the model to maintain invariants that aren’t derivable from the structure of the output.
Godogen avoids this by generating headless GDScript that constructs scenes programmatically through Godot’s API and then serializes them. The engine handles the format; the model only needs to produce correct API calls:
var root = Node2D.new()
root.name = "GameScene"
var player = CharacterBody2D.new()
player.name = "Player"
player.position = Vector2(100, 300)
root.add_child(player)
player.owner = root
var sprite = Sprite2D.new()
sprite.name = "Sprite2D"
player.add_child(sprite)
sprite.owner = root
ResourceSaver.save(PackedScene.new().pack(root), "res://scenes/game.tscn")
The owner assignment is the detail that took careful work to teach. Every programmatically-created node must have its owner set explicitly. If you omit it, the node exists correctly in memory, ResourceSaver.save() succeeds and writes a valid-looking file to disk, and the .tscn file can be reopened in the editor, where the node is simply absent. No error at any stage. The node silently vanishes on reload. This behavior isn’t documented prominently in the API reference; it surfaces through debugging headless scene construction.
This “generate the builder, not the format” pattern has analogs elsewhere. CDK8s generates Kubernetes YAML through TypeScript objects rather than raw YAML strings. Pulumi expresses infrastructure through code rather than HCL. Playwright generates browser interactions rather than raw HTTP sequences. The principle holds when a format is designed to be tool output: use the tool’s API and delegate format correctness to the implementation.
Build-Time and Runtime Are Different Environments
The headless construction phase and the running game are two distinct execution environments with different available APIs. This distinction doesn’t appear clearly in the official documentation because the docs describe normal runtime use. An agent without explicit guidance about this boundary produces code that is correct for one environment and wrong for the other.
@onready is the clearest example:
@onready var sprite: Sprite2D = $Sprite2D
At runtime, this defers the assignment until _ready() fires after the node enters the scene tree. In headless construction, _ready() never fires. The variable stays null. No compile error. The failure appears at runtime when the game runs and something tries to use sprite.
Signal connections that use node path traversal face the same issue. At runtime, $Button.pressed.connect(_on_button_pressed) works against a live scene tree. During headless construction, there is no live tree; path resolution works against an in-memory hierarchy that may or may not contain the target node.
Teaching the model which APIs belong to which phase required building explicit context about the lifecycle that doesn’t exist in any single document. The quirks database carries this: for each relevant API, whether it’s valid at build time, at runtime, or both, with notes on failure modes when used in the wrong phase.
The Evaluation Loop Problem
A coding agent evaluating its own output has a systematic bias. It tends to see code it wrote as more correct than it is, because the same priors that generated the code also evaluate it. For a pipeline producing entire games, where failures only surface when the game runs, this bias is expensive.
Godogen runs generated projects under godot --headless and captures the console output, feeding errors back into the generation loop. This catches null node references, type mismatches, missing scene dependencies, and runtime physics failures that static analysis can’t detect. The feedback is objective: the engine either runs the game or it doesn’t, and when it doesn’t, it reports why.
This is harder than it sounds for a complete game because failures can be downstream of the actual cause. A missing owner assignment produces a missing node which produces a null reference in a script which produces a runtime error pointing at the script, not at the headless construction code. Tracing from symptom to cause requires understanding the Godot lifecycle well enough to reason backwards through the dependency chain.
What This Generalizes To
Godogen’s bottlenecks aren’t specific to Godot. Any domain-specific runtime with sparse training data, version-split documentation, and implicit behavioral contracts will look similar. Unity’s DOTS architecture, Unreal’s Blueprint system, and shader languages like WGSL all have corners where models generate plausible-looking code that fails in ways requiring deep runtime knowledge to diagnose.
The architecture Godogen settled on, a structured reference system with explicit phase documentation and a database of empirically-discovered quirks, is essentially a wrapper that compensates for what training data can’t reliably capture: the delta between what the docs say and what the engine actually does. That delta is present in most mature codebases. Building infrastructure to make it legible to a model is the work that snippet-level code generation doesn’t require but system-level generation does.