Why Generating Godot Games Required Solving Three Hard Engineering Problems

After a year of development and four major rewrites, Godogen went public on Hacker News as a working pipeline that takes a text prompt and produces a playable Godot 4 project, 2D or 3D assets included. The project sits at an intersection of problems that have no clean solutions: LLM code generation for an underrepresented language, a serialization format that encodes runtime-only state, and an evaluation loop that requires a running game engine to validate anything meaningful.

The three engineering bottlenecks the author identified are not unique to Godot. They are the canonical walls you hit when you try to aim an LLM at any sufficiently niche runtime, and the solutions Godogen developed are worth examining on their own terms.

The Training Data Problem With GDScript

GDScript looks like Python. It uses indentation-based blocks, the same if/for/while keywords, pass, super(), and a broadly similar expression syntax. The resemblance is enough that a language model trained predominantly on Python will fill gaps in its GDScript knowledge with Python idioms, and many of those idioms compile just fine before they fail at runtime in ways that are not obvious.

The differences that matter are specific and non-negotiable. Functions are declared with func instead of def. Boolean literals are true and false, not True and False. The null value is null, not None. Node references obtained via the $NodeName shorthand are resolved when the scene tree initializes, not when the class is parsed, so any field initialized directly from $ outside a function will be null when the script loads. Signals connect via my_node.my_signal.connect(handler) in Godot 4, a change from Godot 3’s string-based connect("signal_name", self, "method_name") that breaks a large fraction of LLM-generated code silently.

Compounding the syntax problem is version drift. Godot 4 shipped in March 2023 with sweeping breaking changes from Godot 3. KinematicBody2D became CharacterBody2D. move_and_slide() changed its entire call signature, dropping the velocity argument in favor of a velocity property. Spatial became Node3D, Area became Area3D, yield() became await. Models trained before mid-2023 have predominantly Godot 3 knowledge. Models trained after have a mixture of both versions in their weights, which produces hybrid outputs that neither engine accepts.

The Godot 4 API has roughly 850 classes. The canonical documentation lives as XML in the Godot source repository under doc/classes/, each file describing a class with its methods, properties, signals, and constants in structured markup. Godogen’s approach to this is a custom reference system: a hand-written language spec, API documentation converted from that XML source, and a quirks database for behaviors that only emerge from using the engine rather than reading about it. Because stuffing 850 class definitions into a context window is not viable, the agent lazy-loads only the specific APIs it needs at runtime, pulling in reference material on demand as the generation proceeds.

This is a pragmatic form of retrieval-augmented generation applied with domain knowledge. The quirks database is the part that cannot be automated. The XML docs can tell you that move_and_slide() exists and what it returns. They cannot tell you that calling it before _physics_process runs produces unexpected results, or that certain collision layer configurations interact with the motion_mode property in non-obvious ways. That knowledge has to come from someone who has built things in Godot and noticed where the documentation stops being sufficient.

Build Time, Runtime, and the Owner Problem

Godot scenes are stored in .tscn files, a text format that describes a node hierarchy by listing node declarations with their parent paths:

[node name="Player" type="CharacterBody2D"]
script = ExtResource("1_abc12")

[node name="Sprite" type="Sprite2D" parent="."]
texture = ExtResource("2_def34")

Hand-editing this format is possible but fragile. A single inconsistency in resource IDs or parent paths produces a scene that either fails to load or loads silently broken. Godogen avoids this by generating scenes through headless GDScript tool scripts that build the node graph in memory and serialize it using Godot’s own PackedScene and ResourceSaver APIs:

@tool
extends EditorScript

func _run():
    var scene = Node2D.new()
    scene.name = "GeneratedScene"

    var sprite = Sprite2D.new()
    sprite.name = "Sprite"
    scene.add_child(sprite)
    sprite.owner = scene  # without this line, the node vanishes on save

    var packed = PackedScene.new()
    packed.pack(scene)
    ResourceSaver.save(packed, "res://generated_scene.tscn")

The sprite.owner = scene line is the critical detail. When Godot serializes a PackedScene, it only includes nodes that have their owner set to the scene root. Forgetting it causes the node to exist during generation and disappear completely in the saved file, with no error or warning. This is the kind of engine behavior that does not appear in the documentation for add_child(), which handles tree membership correctly on its own, making the separate ownership requirement surprising the first time you hit it.

The second boundary Godogen had to map carefully is the distinction between what is available at build time and what only exists at runtime. The @onready annotation marks a variable that gets assigned when _ready() fires, after the full scene tree is initialized. LLMs generate code that uses @onready variables at class scope, outside any function, which evaluates at parse time when the node tree does not yet exist. Signal connections established between nodes in a tool script run during scene construction, not during gameplay. The model has to understand which operations belong to which phase, and that understanding cannot be derived from the GDScript language specification alone because the phases are a property of the engine’s execution model, not the language’s.

The Evaluation Loop Problem

For web development, an automated eval loop is cheap. Generate code, run a linter, run tests, parse the output, feed errors back to the model. For GDScript, the options are considerably more constrained. godot --check-only will catch syntax errors and some type errors without running anything. Beyond that, validating that generated code behaves correctly requires spawning a Godot process, loading the scene, running it, and observing what happens.

Visual validation is the hardest part. A platformer that generates without errors but has the player starting inside a wall, or a collision shape that never overlaps the sprite, or a camera that follows the scene origin instead of the player: none of these fail in any way that grep or a unit test framework can catch. They require looking at the running game. This is the problem that Rosebud AI, a web-based tool for generating Phaser.js games, sidesteps almost entirely. JavaScript in the browser has no binary dependency, the runtime is the tab, and the feedback loop is nearly instantaneous. Godot requires a compiled binary, a display server for anything visual, and enough project structure for Godot to consider the directory a valid project before it will run anything.

The GUT testing framework is the standard community solution for Godot unit tests, but it adds its own dependency and project structure requirements that make it awkward to wire into a generation pipeline that starts from nothing. Godogen’s approach is to run generated scenes in a validation script that instantiates the scene, checks that expected nodes exist with the correct types, and reports failures before a human needs to look at anything. Visual correctness still requires a human in the loop, which is an honest acknowledgment of what automated eval can and cannot verify.

What This Generalizes To

The three problems Godogen had to solve are structurally identical to the problems you encounter targeting any niche runtime with an LLM: a language or API underrepresented in training data, execution semantics that differ from what the language syntax implies, and a validation environment that is more expensive to stand up than a standard test runner. GDScript and the Godot engine are a particularly concentrated version of this because all three problems are present and severe simultaneously, but the pattern appears whenever you try to generate code for an embedded scripting language, a domain-specific runtime, or a platform where the relevant training data on the public web is thin.

The custom reference system with lazy-loaded API documentation is a transferable approach. The quirks database is the non-transferable part that has to be rebuilt from experience for each target. After a year and four rewrites, that accumulated knowledge of where Godot’s engine behavior diverges from what its documentation describes is probably the most valuable artifact the project produced.