Three Engineering Problems That Stand Between an LLM and a Playable Godot Game

There is a project called Godogen that takes a text prompt, generates 2D and 3D assets, writes GDScript, and produces a complete playable Godot 4 project. Its author spent about a year on it across four major rewrites. Each rewrite was not a refinement of the same approach; each one was a response to a new category of failure that the previous architecture could not handle.

What makes this interesting is not that it works. It is which specific problems it had to solve, and why those problems don’t arise in the browser-based game generators that are more commonly demonstrated.

The GDScript Training Data Problem

Most code generation benchmarks measure performance on Python, JavaScript, TypeScript, Java, Go, and Rust. These languages have years of public code on GitHub, Stack Overflow questions, documentation, and tutorial content that ended up in LLM training sets. GDScript has almost none of that. Godot’s documentation team publishes class references as XML files in the engine repository, and the community is small enough that the total corpus of public GDScript is a rounding error compared to Python.

The surface problem is volume. The deeper problem is that GDScript’s Python-like syntax makes the volume issue worse, not better. A model with strong Python priors will generate code that looks syntactically close to correct GDScript and is semantically broken in ways that compile without errors.

Consider signal connections. The majority of any GDScript training data that exists predates Godot 4, which introduced callable-based signal syntax in 2022:

# Godot 3 — what most training data contains:
$Timer.connect("timeout", self, "_on_timer_timeout")

# Godot 4 — what actually works:
$Timer.timeout.connect(_on_timer_timeout)

A model generating Godot 4 code will frequently produce Godot 3 signal syntax with Godot 4 class names, creating hybrid code that satisfies neither version. The same pattern appears with KinematicBody2D (renamed to CharacterBody2D in Godot 4), the move_and_slide() API (which changed from taking velocity as an argument to reading it from a class property), and the @export and @onready annotations (which gained their @ prefix in Godot 4).

Pure Python idioms cause a separate failure class. GDScript has no list comprehensions, no tuple returns, no True/False/None (they are true/false/null), and no def keyword for functions. Vector2 (float) and Vector2i (int) are distinct types with non-overlapping method sets; you cannot assign one where the other is expected. These are not warnings; they are type errors that prevent the scene from loading.

Godogen’s solution was to build a custom reference system: a hand-written language spec covering GDScript’s divergence from Python, API documentation converted from Godot’s XML source, and a quirks database for engine behaviors that no documentation describes. Because loading all ~850 Godot 4 classes into context at once would overflow any reasonable context window, the agent lazy-loads only the specific APIs required by the game being generated.

The lazy-loading architecture is worth noting. It is not a prompt optimization; it is a hard architectural requirement imposed by the size of the Godot API surface. The agent has to know in advance which classes a game will need before it has written any code, which requires a planning stage that outputs a class dependency list before generation begins.

The Build-Time vs. Runtime State Boundary

Generating .tscn files directly is brittle enough that Godogen avoids it entirely. Godot’s scene format is an INI-like text serialization with precise structural requirements:

[gd_scene load_steps=4 format=3 uid="uid://abc123"]

[ext_resource type="Script" path="res://player.gd" id="1_abc"]

[node name="Player" type="CharacterBody2D"]
script = ExtResource("1_abc")

The load_steps count must equal the exact number of ext_resource and sub_resource declarations. UIDs must be globally unique across the project. Every ID in a resource declaration must match the reference string exactly. Off-by-one in load_steps produces silent corruption. A mismatched UID produces a load failure with an unhelpful error. Generating text that satisfies all of these constraints reliably is difficult, and failures compound because a broken scene file produces no useful diagnostic.

The alternative is to generate headless builder scripts that use the Godot engine’s own serialization API:

@tool
extends EditorScript

func _run():
    var scene = Node2D.new()
    scene.name = "GeneratedScene"

    var sprite = Sprite2D.new()
    sprite.name = "Sprite2D"
    scene.add_child(sprite)
    sprite.owner = scene  # this line is not optional

    var packed = PackedScene.new()
    packed.pack(scene)
    ResourceSaver.save(packed, "res://generated_scene.tscn")

This delegates format correctness to the engine. PackedScene.pack() and ResourceSaver.save() write valid .tscn files by definition. But the builder script approach exposes a different category of problem: certain engine features only exist at runtime, and the boundary between what is available during headless construction and what requires a live scene tree is not clearly documented anywhere.

The @onready annotation is the most common example. It defers variable assignment until _ready() fires, which only happens when a node enters the live scene tree during gameplay. In headless construction, _ready() never fires. Variables decorated with @onready remain null throughout the build process. This produces no error during construction, no error during serialization, and no error on reload. The null reference surfaces at runtime, often several frames into the game, when the variable is first accessed.

The more obscure version of this problem involves node ownership. Calling add_child(node) adds a node to the in-memory tree but does not set its owner property. PackedScene.pack() only serializes nodes whose owner is set to the scene root. The save operation succeeds, the file is written, no error is reported. On reload, the node simply does not exist. There is no warning, no message in the Godot console, no indication that anything went wrong. The only way to discover this is to reload the scene and notice the node is absent, then trace backward to the construction code that omitted node.owner = scene_root.

Teaching a model where this boundary sits required building a phase-specific API reference that explicitly maps each Godot API call to whether it is available during headless construction, only at runtime, or at both. The $NodePath shorthand is another example: it resolves correctly against the in-memory tree during construction but silently returns null if used in _init() before the tree hierarchy exists.

The Evaluation Loop and Three Tiers of Correctness

The third problem is that a coding agent evaluating its own output is working with the same miscalibrated priors that produced the bug. If a model’s internal representation of GDScript is Python-inflected, it will not detect that True should be true by reviewing its own code, because to the model both look correct. External ground-truth evaluation is the only way to catch systematic failures.

Godogen runs generated projects via godot --headless and captures the console and error streams. A virtual framebuffer allows screenshot capture for failures that produce no log output at all: a black screen where the scene should be, or a camera pointed at nothing, or a node positioned outside the viewport bounds.

But this still only covers two of three correctness tiers. Structural correctness means the project compiles and loads. Functional correctness means the game loop behaves as described. Experiential correctness means the physics feel right, the jump arc is reasonable, the platform spacing is playable. The first two can be partially automated. The third cannot.

This is where native-engine generation diverges most sharply from browser-based alternatives. A tool like Rosebud AI generates Phaser.js games using JavaScript, where training data is abundant, the Godot 3/4 version contamination problem does not exist, and each evaluation iteration runs in milliseconds in a browser tab. Every iteration of a Godogen-generated project requires spawning the Godot binary, waiting for engine initialization, running the project, and capturing output. The cost of a single evaluation cycle shapes the feasible architecture. Discovering a new failure class costs more because each experiment takes longer.

A CharacterBody2D with no CollisionShape2D child will compile cleanly, load without warnings, and pass the first evaluation tier. The player will fall through the floor when physics runs. A signal connected to a method with a mismatched parameter signature will fail silently in Godot 4 by default. Physics layer and mask misconfiguration produces objects that pass through each other with nothing in the log. All of these pass structural evaluation and fail functional evaluation. Catching them requires running the game long enough for the relevant physics frame or input event to occur.

Four rewrites over a year is a description of what it takes to discover and handle all of these failure classes empirically. Each rewrite was not optional; it was the response to a category of bug that the previous architecture could not catch before shipping a broken game. The three problems the project identifies are not three independent challenges. They are three interlocking constraints that have to be addressed together: a model that doesn’t know GDScript will generate broken code; a scene construction approach that ignores the runtime boundary will produce nodes that silently vanish; an evaluation loop that trusts the model’s own judgment will not catch either of those failures.

The result, for anyone building in this space, is that generating playable games from text prompts in a native engine requires more infrastructure than it requires model capability. The model is not the bottleneck. The scaffolding is.