· 6 min read ·

When the Training Data Isn't There: Engineering an LLM Pipeline for GDScript

Source: hackernews

LLMs are trained on web-scale data, which means they perform well on languages with web-scale representation. Python, JavaScript, Java, and TypeScript collectively dominate public code repositories. GDScript, the scripting language for the Godot game engine, does not.

This is the core constraint that Godogen spent a year solving across four major rewrites. It is a pipeline that takes a text prompt and produces a complete, playable Godot 4 game: GDScript logic, .tscn scene files, the full project structure. The engineering required to make this reliable centers on compensating for what LLMs genuinely don’t know about GDScript, rather than on prompt cleverness.

The Training Data Distribution Problem

The Godot 4 class reference contains approximately 850 documented classes covering more than 4,000 methods across physics, rendering, audio, networking, UI, and more. The full documentation runs to several megabytes. More critically, Godot 4 was released in March 2023, meaning most GDScript content on the web at the time large LLMs were trained still targets the Godot 3 API, which breaks in significant ways when applied to Godot 4 projects.

The MultiPL-E benchmark, which extended HumanEval across 18 programming languages, found that model performance correlates tightly with estimated training data volume. Languages with thin representation score dramatically lower than Python or JavaScript. GDScript sits well below even Lua on any reasonable training data estimate, and the Godot 3 to 4 version split means that much of the GDScript content that does exist in training corpora teaches the wrong API.

The symptoms are predictable: models hallucinate Python idioms that GDScript doesn’t support, emit Godot 3 API calls that were renamed or restructured in Godot 4, invent method names that sound plausible but don’t exist on the class they’re attached to, and get signal connection syntax wrong.

Consider signal connections. Godot 3 used string-based method references:

# Godot 3 — deprecated
$Button.connect("pressed", self, "_on_button_pressed")

Godot 4 replaced this with callable objects:

# Godot 4 — correct
$Button.pressed.connect(_on_button_pressed)

# With a lambda
$Button.pressed.connect(func(): do_something())

A model that has seen more Godot 3 content than Godot 4 content will emit the former. The code will parse but fail at runtime, and the failure message is not always informative about why.

Compensating with a Custom Reference System

Godogen’s solution treats documentation as infrastructure. The project builds a hand-written language specification, converts Godot’s XML API source into a queryable format, and maintains what the author describes as a “quirks database” of engine behaviors that official documentation doesn’t capture, the kind of knowledge that comes from building things until they break.

The context window constraint is the next obstacle. Including all 850 class references would consume the entire prompt budget for any practical model. GPT-4o’s 128K token context is large, but packing it with API reference displaces the task description, examples, and generated code. The cost also scales directly: at roughly $5 per million input tokens, including the full Godot API in every generation step becomes expensive for a pipeline that iterates multiple times per game.

The solution is lazy-loading. The agent identifies which classes it needs for a given game description, then retrieves only those API definitions at query time. For a typical 2D platformer this comes down to a small, predictable set: CharacterBody2D, Sprite2D, CollisionShape2D, Area2D, Camera2D, AnimationPlayer, Timer, Input. Injecting just these, with precise method signatures and parameter types, fits comfortably within a few thousand tokens and prevents the model from inventing methods on classes it half-remembers.

This approach has parallels in retrieval-augmented generation research. What Godogen calls lazy-loading is structurally similar to schema-first generation: the model first declares which node types it plans to use, and the system injects exactly those API definitions before code generation begins. Cursor’s @Docs feature follows the same principle for arbitrary documentation, but Godogen applies it to a domain where the consequences of getting it wrong are concrete and immediate.

Build Time and Runtime Are Different Phases

GDScript has an annotation system with no Python equivalent, and understanding when annotations take effect is essential for generating correct code.

@onready is the common case:

@onready var player: CharacterBody2D = $Player
@onready var health_bar: ProgressBar = %HealthBar

This annotation defers the variable assignment until _ready() fires, which is the moment the node enters the scene tree and the full subtree is live. Without it, $Player resolves to null during object construction because child nodes haven’t been added yet. The %NodeName syntax, introduced in Godot 4.1, also requires the target node to have “Scene Unique Name” enabled, a property the generator has no way to verify at code-writing time.

Godogen’s pipeline generates scenes using headless builder scripts that construct the node graph in memory and serialize it to .tscn files. This is more robust than hand-editing the serialization format directly, but it creates a sharp distinction between what the builder scripts can configure and what requires a live scene. @onready only has meaning at runtime. Signal connections that reference specific scene nodes can only be wired when those nodes exist. Teaching the model to respect this phase boundary, and to know which operations belong in each phase, required explicit attention in the prompting system rather than relying on general coding intuitions.

The owner property is a related trap. Nodes added via add_child() without setting their owner to the scene root will silently vanish when the .tscn is saved. Godot uses owner to determine which nodes belong to the serialized scene graph versus which are transient scene-tree members. This is exactly the kind of engine behavior that doesn’t appear prominently in the official class reference but breaks generated scenes reliably when overlooked.

The .tscn format itself has its own fragilities. The load_steps header must exactly match the number of ext_resource and sub_resource declarations. UIDs must be unique. The format=3 field marks Godot 4 compatibility, as opposed to the format=2 of Godot 3. Off-by-one errors in load_steps don’t always produce a clear error message; they corrupt the scene in ways that are difficult to diagnose.

The Evaluation Loop Is Structural

The third bottleneck in the pipeline is the evaluation loop. A coding agent assessing its own output has a structural bias toward optimism; it generated the code, so it reads it charitably. For most programming tasks, unit tests provide an external feedback signal that cuts through this. Game logic doesn’t decompose cleanly into testable units. The meaningful failure modes are “the game doesn’t run” and “the physics behaves wrong,” both of which require the engine in the loop.

Projects like Rosebud AI, which generates Phaser.js browser games from natural language prompts, have a structural advantage here. JavaScript runs in any browser, Phaser has significant training data representation, and the evaluation loop for browser-based generation is short and cheap. Godogen’s output requires a running Godot installation, adds engine startup latency per iteration, and makes the evaluation environment substantially heavier.

Godogen closes the loop using Godot’s headless mode, which can run a game without a display and capture screenshots or execution logs. This is the right approach, but it also means the pipeline is doing real infrastructure work: managing a Godot installation, parsing engine output, and using visual state as a correctness signal rather than a test assertion. The Godot headless server documentation describes the execution mode, but integrating it into a generation feedback loop requires additional tooling.

What the Approach Generalizes To

The pattern Godogen demonstrates applies to any LLM pipeline targeting a niche language or domain-specific API. Model knowledge is a function of training data. For thin-representation languages, that knowledge is unreliable, and the application layer has to compensate directly.

The compensation takes three distinct forms. First, curate authoritative documentation and inject it selectively, so the model is operating from ground truth rather than imperfect training recall. Second, teach the model the phase semantics of the target system, the initialization order, the ownership rules, the build-versus-runtime distinction, all the things that experienced developers learn from debugging rather than documentation. Third, close the feedback loop with the actual runtime, because static analysis of generated code against a niche API is not sufficient to catch the failure modes that matter.

The gap between a working demo and a reliable pipeline is an engineering problem at each of these three levels. Godogen’s four rewrites over a year reflect that working through them required iteration against real engine behavior, not refinements to prompt wording. The reference system, the quirks database, and the headless evaluation runner are the parts that determine output quality; the prompt operates on top of all of that scaffolding.

Was this interesting?