Three Engineering Problems That Stand Between a Text Prompt and a Playable Godot Game

After a year of development and four major rewrites, Godogen generates complete Godot 4 games from text prompts. The output is a playable project: GDScript source, a scene tree, assets, the full thing. Getting there required solving three specific problems that are each worth understanding in detail, because they surface in any serious attempt to generate engine-specific code with an LLM.

GDScript Is Python-Shaped But Not Python

The first bottleneck is training data. GDScript has a Python-like syntax, which sounds like an advantage until you realize it creates a specific and predictable failure mode: the model fills gaps in its GDScript knowledge with Python idioms that compile in Python but not in GDScript, or that compile in GDScript but behave differently at runtime.

A few concrete examples. In Godot 4, connecting a signal in code looks like this:

$Timer.timeout.connect(_on_timer_timeout)

In Godot 3, the same operation used a different API:

$Timer.connect("timeout", self, "_on_timer_timeout")

Training data mixes both versions, so models produce code that blends them or defaults to whichever version appears more often in the corpus. The result compiles in neither version, or compiles in one and fails silently in the other.

GDScript also has approximately 850 built-in classes, and the transition from Godot 3 to Godot 4 renamed many of them. KinematicBody2D became CharacterBody2D. Spatial became Node3D. move_and_slide() in Godot 3 accepted a velocity vector as an argument; in Godot 4, velocity is a property on CharacterBody2D that you set before calling the method with no arguments. None of these changes are unreasonable, but every one of them is a landmine for a model whose training data spans both engine generations.

Godogen addresses this with a custom reference system: a hand-written language spec, the full Godot 4 API documentation converted from the engine’s XML source, and a quirks database for engine behaviors that don’t appear anywhere in the official docs. Because loading all 850 classes at once would overflow a context window, the agent lazy-loads only the specific API references it needs at runtime. This is essentially a retrieval-augmented generation setup, but one built specifically around Godot’s class hierarchy rather than a general-purpose vector database.

Godot’s Scene System Is Built Around the Editor

The second problem is more subtle and reveals something important about how Godot 4’s architecture is designed.

Godot’s text scene format, .tscn, is an INI-like serialization format that the Godot editor writes and reads. It stores the node hierarchy, property overrides, external resource references, and signal connections. Hand-editing it is possible but fragile; a single malformed line prevents the scene from loading, and the format requires precise cross-references between sections. Generating it by having an LLM produce the raw text is asking for trouble.

Godogen sidesteps this by having headless scripts build the node graph in memory using Godot’s own runtime API, then serializing to .tscn using the engine’s built-in export functions. This is a cleaner approach: rather than generating the format that the editor produces, you use the same API the editor uses internally.

But that approach introduces a different problem. Several GDScript features are only available after a node has entered the scene tree. @onready variables, for instance, use this annotation:

@onready var sprite: Sprite2D = $Sprite2D

The @onready annotation defers the assignment until _ready() is called, which happens after the node and all its children have entered the tree. If you try to call get_node() during a headless build script, before the scene tree exists, the call returns null. The node is in memory, but the tree isn’t live yet.

Signal connections stored in .tscn files have the same issue. They exist as metadata in the serialized scene but only become live signal bindings when the scene is instantiated and the nodes enter the tree.

Teaching the model which APIs are available at which phase, build-time versus runtime, required precise prompting and documentation of what Godot’s initialization sequence actually looks like. There is also an entirely unrelated gotcha: every node built in a headless script needs its owner property set correctly, or it silently vanishes when the scene is saved. This is not documented prominently anywhere; it is the kind of thing you discover by losing nodes and working backwards.

The Evaluation Loop Problem Is Structural, Not Incidental

The third bottleneck the author describes as the evaluation loop, and it is the one with the broadest implications beyond Godot specifically.

A code-generation agent is structurally biased toward trusting its own output. It generated the code, it can read the code, and it tends to evaluate the code against its own internal model of what the code should do. This is not the same as running the code and observing what actually happens. An agent reviewing its own GDScript for bugs will miss errors that only manifest at runtime: a null node reference because a path doesn’t match the scene tree, a type mismatch caught by Godot’s type checker at load time, a signal that was connected to a method that doesn’t exist.

The only reliable evaluation loop is execution: run the game, capture output and errors, feed the results back into the model. Godot’s headless mode makes this possible in an automated pipeline. Running godot --headless --path /path/to/project executes the full scene tree, physics engine, and scripting runtime without a window. Errors surface as console output that can be captured and re-injected into the prompt.

This is not a Godot-specific insight. Any serious code-generation pipeline for a system with a runtime, whether a game engine, a database, or a web framework, needs an execution feedback loop. The model’s internal verification is insufficient for catching the class of errors that only appear when the code runs against the real system. Godogen builds this loop in, treating it as a first-class part of the generation pipeline rather than an optional testing step.

What This Reveals About LLM Code Generation for Engine-Specific Languages

There are a handful of projects attempting to use LLMs to generate Unity or Godot code, most of them thin wrappers around direct API calls that let users ask for snippets in chat. Godogen is doing something more ambitious: a full generation pipeline with reference injection, headless execution testing, and a feedback loop.

The problems it had to solve are not Godot-specific in principle. Any engine-specific language faces the training data scarcity issue. Any system with a serialization format designed for tool output rather than hand-authoring creates the same build-time versus runtime distinction. Any code-generation agent without an execution loop will produce code that looks right but fails in ways the model cannot anticipate from static analysis alone.

The architectural choice to use the engine’s own runtime API to build scenes programmatically, rather than generating the serialization format directly, is worth noting as a general pattern. When a file format is meant to be written by a tool rather than by humans, the right approach for code generation is usually to use the same tool API that the official editor uses, not to generate the format directly. The engine’s serialization logic already handles the edge cases that would break hand-generated output.

Godogen is open source and available on GitHub. The implementation choices across four rewrites of the same core problem make it worth reading as a case study in what it actually takes to get reliable, runnable output from an LLM on a target platform with sparse training data.