Generating Games from Text: The Engineering Reality Behind Godogen

When an LLM generates Python, it draws on years of training data: millions of repositories, StackOverflow threads, tutorials, and documentation pages. When it generates GDScript for a Godot 4 project, it is working from a corpus several orders of magnitude smaller, most of which targets the wrong engine version. Godogen is a pipeline by developer htdt, built through four major rewrites over about a year, that takes a text prompt and outputs a complete, playable Godot 4 project. The engineering required to make that reliable reveals something broader about LLM code generation in domains where training data is thin.

The GDScript Corpus Problem Has a Version Layer

The headline challenge is that GDScript has roughly 300 to 450 built-in classes and a Python-adjacent syntax that gives LLMs misleading confidence. Control flow, indentation, list operations: much of this transfers. The failures concentrate in domain-specific patterns, specifically the signal system, the annotation syntax, and the node lifecycle.

What compounds the scarcity problem is that most GDScript in public code datasets predates Godot 4. Godot 4 shipped in March 2023 and broke a substantial number of established patterns in the process:

# Godot 3 patterns the model likely learned
onready var player = $Player
export var speed = 200.0
connect("body_entered", self, "_on_body_entered")
move_and_slide(velocity, Vector2.UP)

# Godot 4 equivalents the model must generate
@onready var player: CharacterBody2D = $Player
@export var speed: float = 200.0
body_entered.connect(_on_body_entered)
move_and_slide()  # velocity is now a class property, not an argument

The @ prefix on annotations is easy to omit. The signal connection API changed substantially. KinematicBody2D was renamed CharacterBody2D. yield() was removed in favor of await. None of these are obscure edge cases; they appear in nearly every movement or physics script. A model trained predominantly on Godot 3 examples will produce code that looks plausible and fails consistently.

Godogen addresses this with a custom reference system: a hand-written language spec, API documentation converted from Godot’s XML source tree, and a quirks database covering behaviors that do not appear clearly in official documentation. Because loading the full API surface blows up the context window, the agent lazy-loads only the specific class documentation it needs at runtime. This is retrieval-augmented generation applied to language specification rather than general knowledge retrieval, which is a meaningful distinction: the retrieval target is precise and bounded, not a fuzzy similarity search over documents.

Headless Assembly Beats Raw Format Generation

Godot scenes are stored as .tscn files, a readable text serialization format. A minimal scene looks like this:

[gd_scene load_steps=3 format=3 uid="uid://abc123"]

[ext_resource type="Script" path="res://player.gd" id="1_xyz"]

[node name="Player" type="CharacterBody2D"]
script = ExtResource("1_xyz")

[node name="Sprite2D" type="Sprite2D" parent="Player"]

One approach to LLM scene generation would ask the model to write .tscn text directly. Godogen avoids this entirely. Instead, scenes are assembled by headless GDScript tool scripts that build the node graph in memory and serialize via Godot’s own API:

@tool
extends SceneTree

func _init():
    var root = Node2D.new()
    root.name = "Game"

    var player = CharacterBody2D.new()
    player.name = "Player"
    root.add_child(player)
    player.owner = root  # must be set explicitly

    var packed = PackedScene.new()
    packed.pack(root)
    ResourceSaver.save(packed, "res://scenes/game.tscn")
    quit()

This keeps the LLM away from UID tracking, load step counting, and resource ID assignment, all of which the .tscn format requires to be internally consistent and which are tedious to generate correctly by hand. The model only has to describe the logical structure: node types, properties, script assignments. Godot handles the serialization.

But headless assembly introduces its own trap, one that Godogen explicitly calls out as requiring careful prompting: the owner property. When you build a scene programmatically and call PackedScene.pack(), only nodes whose owner is set to the scene root are included in the packed scene. A node without an owner is silently dropped. There is no exception and no warning; the node is simply absent when the scene loads. The Godot documentation mentions this, but not in a way that foregrounds it as a common failure mode. It requires knowing Godot’s serialization semantics well enough to anticipate exactly where the LLM will go wrong.

Build Time and Runtime Are Separate Execution Contexts

Godot distinguishes between two execution environments that a generative pipeline must reason about separately. The first is the headless tool context where generation scripts run. The second is the game runtime where the produced project plays.

Several GDScript features only exist in the runtime context. The @onready annotation defers variable assignment until _ready() is called when the scene enters the scene tree during gameplay:

@onready var health_bar: ProgressBar = $UI/HealthBar

A headless assembly script running before the game exists cannot meaningfully set up @onready behavior. Signal connections made programmatically at scene-build time behave differently from those established in _ready() or via the inspector. If generated code calls methods that presuppose a running scene tree, it either fails outright or produces state that does not persist into the actual game.

For an LLM, the practical problem is that the code it generates has to be correct in the right context. Syntax validity and runtime correctness are separate criteria, and the model conflates them without careful prompting. Godogen’s solution was to treat the two execution phases as distinct domains with different constraints and prompt for them separately, teaching the model which APIs are available at which phase rather than leaving it to infer the boundary.

The Evaluation Loop Cannot Rely Solely on the Generator

There is a documented tendency for LLMs to rate their own outputs more favorably than equivalent outputs from other sources. Research into LLM-as-judge evaluation shows this self-preference bias is measurable and consistent. For GDScript, this is especially problematic because there is no fast execution path: you cannot run a quick interpreter check on a two-line function. You have to invoke Godot.

Godogen uses headless Godot execution as the primary validation layer. A model can hallucinate an API method, but if Godot cannot find that method, the build fails and the error feeds back into the loop. This is execution-based validation rather than syntactic checking, and it catches a large class of errors definitively.

That still leaves a category of bugs that pass the parser and only manifest during gameplay. A CharacterBody2D with no CollisionShape2D child loads cleanly, then falls through the floor when physics runs. A signal connected to a method with a mismatched signature fails silently in Godot 4 by default. An @onready reference to a node path that does not exist in the scene tree produces a null reference the first time that property is accessed. Physics layer and mask misconfiguration causes objects to pass through each other with nothing logged.

These bugs require behavioral evaluation to detect. Godogen includes a visual testing phase for this purpose, though the source article cuts off before describing it in full. Four major rewrites over a year is suggestive of where the iteration concentrated: not in getting the first valid scene file out of the LLM, but in closing the gap between a parseable output and a game that behaves as described.

What This Generalizes To

The engineering choices in Godogen are specific to Godot and GDScript, but the underlying problems are not. Sparse training data means the model’s priors are calibrated to the wrong version of the target domain. Layered execution contexts mean the model conflates environments that the runtime treats as distinct. And evaluation requires execution, not just syntax checking, because the most consequential bugs are behaviorally silent.

Godogen’s solutions, a custom reference system with lazy loading, headless assembly via the engine’s own API rather than raw format generation, and execution-based validation, form a methodology that applies to other underrepresented runtimes and languages. The year of rewrites is not evidence of an unusually difficult problem; it is evidence of how much domain-specific engineering the problem actually contains, underneath what looks from the outside like asking an LLM to write some game code.