The Three Hard Problems in LLM-Driven Game Generation

Getting an LLM to generate functional code in a well-represented language is tractable: models have ingested enough Python, TypeScript, and Rust to produce working output with reasonable frequency. Getting one to generate a complete, playable game in GDScript is a different problem, and the engineering required to make it work reveals challenges that apply broadly to any domain-specific code generation pipeline.

Godogen is a pipeline packaged as a set of Claude Code skills that takes a text prompt and returns a complete Godot 4 project: architecture decisions, 2D/3D assets, GDScript source, and a working scene graph. The author spent about a year on it across four major rewrites. Three distinct engineering bottlenecks drove most of that effort, and each one generalizes beyond games.

The Training Data Gap

GDScript occupies a thin slice of public code repositories. Godot 4 has approximately 850 classes in its standard library, with a Python-like syntax that shares enough surface similarity with Python to cause consistent confusion. A model that has processed millions of Python files will reach for Python idioms instinctively: subtly wrong container behavior, type annotations that look valid but compile differently, inheritance patterns that GDScript handles unlike Python in ways that matter at runtime.

The Godot 4 API also changed substantially from Godot 3, which means even the GDScript that exists in training corpora may reflect a deprecated interface. Methods were renamed, the signal system was restructured, and the node hierarchy shifted. A model interpolating between Godot 3 and Godot 4 patterns produces code that satisfies neither.

Godogen addresses this with a custom reference system built from three sources: a hand-written language spec that describes GDScript’s actual semantics, full API documentation converted from Godot’s XML source files, and a quirks database of engine behaviors that documentation does not cover. That last category is the most interesting: things like the fact that Vector2i and Vector2 are not interchangeable despite appearing similar, or that certain node properties behave differently depending on whether they are set inside a _ready() callback versus an exported variable initializer.

Injecting all 850 classes into the context window before generation would exhaust the token budget. The solution is lazy-loading: the agent identifies which classes and APIs are relevant to a specific game type and pulls in only that subset at runtime. This is retrieval-augmented generation applied to a structured API catalogue rather than a document store, and it scales to the full API surface without burning the context budget on every call.

The broader principle here is that when a model’s training distribution is sparse for a target domain, curated context built from primary sources outperforms hoping the model interpolates correctly. The Godot XML docs are authoritative in a way that any scraped tutorial or GitHub repository is not. Building from those sources, rather than supplementing whatever the model already knows, is the right foundation.

Build Time vs. Runtime

Godot scenes are stored as .tscn files: a text serialization format describing a tree of nodes, their properties, and signal connections. The format uses integer resource IDs that must remain internally consistent, external resource references tied to load paths, and node path strings that break silently if the tree structure changes. Generating .tscn content directly as text is fragile because the format’s invariants are easy to violate and difficult to validate without actually loading the file in the engine.

Godogen avoids direct .tscn text generation by writing headless GDScript that constructs the node graph in memory using Godot’s own API, then serializes it through the engine’s native save mechanisms. This delegates format correctness to the engine itself rather than trying to reproduce it through string construction. The resulting files are valid by construction because the engine that wrote them also defines what valid means.

The tradeoff is that headless execution is not the same environment as a running game. During headless scene construction, there is no active player loop and no scene tree in the conventional sense. The @onready annotation, which auto-initializes node references after the scene tree is fully set up, does not apply at build time. Signal connections that use node paths work differently when there is no instantiated scene to traverse. Most consequentially: the owner property of every node must be set explicitly during headless construction. Nodes that lack a proper owner appear correct in memory and write to disk without errors, but they disappear silently when the saved file is reloaded.

The owner property behavior is a representative example of what makes this engineering hard. It is not highlighted as a requirement in the API reference for any individual method. It is knowledge that surfaces through building scenes programmatically, debugging the results, and comparing what was written against what was loaded back. Encoding this kind of phase-specific, context-dependent behavior into a prompt system requires building an execution model into the context, not just an API reference. The model needs to understand that the same API call means different things depending on whether the engine is constructing a scene versus running one.

The Evaluation Loop

A coding agent evaluating its own output will interpret ambiguous results charitably. In software with deterministic correctness criteria this manifests as agents writing tests that pass the code they already wrote. In game development it is worse, because correctness is partly visual and behavioral rather than testable through return values.

GDScript that compiles and a Godot scene that opens without crashing are necessary conditions for a working game, not sufficient ones. A game might open to a black screen, exhibit physics that never stabilizes, or present a degenerate loop where nothing happens. These failures produce no log output. Catching them requires observing the running game, not parsing terminal output.

The Godogen pipeline includes a visual testing component in its evaluation loop. Running Godot with a display in a non-interactive context requires a virtual framebuffer or the engine’s headless rendering mode, both of which add infrastructure requirements and do not fully replicate interactive play. The bias problem persists regardless of infrastructure: separating the generation context from the evaluation context, even through distinct model invocations with different system prompts, reduces the tendency to declare success on ambiguous outputs. An evaluator that did not write the code is less likely to assume that a black screen is intentional.

Four rewrites over a year is partly explained by how hard it is to establish a reliable quality signal for generated games. Compilation success and crash-free loading are easy to measure. Whether the game is actually playable is not, and that gap is where most of the iteration happens.

Packaging Through Claude Code Skills

Shipping Godogen as Claude Code skills rather than a standalone tool is a practical distribution decision. Claude Code already manages the agent loop, file I/O, and tool orchestration. The skills layer lets the Godogen logic run within that infrastructure without rebuilding it. Users invoke the pipeline from within an existing Claude Code session, and the output lands as files in the working directory. The pipeline also inherits improvements to Claude Code’s underlying capabilities, including better context management and model updates, without requiring changes to the Godogen skills themselves.

This is worth noting because much of the complexity in building agentic pipelines sits in the scaffolding: managing context across turns, handling tool errors, writing output to the right locations. Offloading that to an existing platform and focusing engineering effort on the domain-specific problems is a reasonable architectural choice.

The Broader Pattern

The three problems Godogen solves are not specific to games. Any pipeline generating code for a domain-specific language or runtime faces the same constraints: sparse training data that causes the model to drift toward more familiar idioms, execution phases with different API surfaces that must be encoded into the model’s context, and evaluation criteria that cannot be reduced to pass/fail compilation checks.

Games make all three problems harder. GDScript’s training data representation is thin compared to general-purpose languages. Scene-graph engines have particularly distinct build-time and runtime execution models. Game quality has a behavioral and interactive dimension that most code generation pipelines never encounter. A SQL generation pipeline can validate output by running queries. A game generation pipeline has to decide whether jumping feels right.

That combination is why a year of work and four rewrites is a reasonable cost, and why the resulting architecture, lazy-loaded API references, headless scene construction, phase-aware prompting, and visual evaluation, reflects genuine engineering work. The interesting insight from Godogen is not that LLMs can generate games. It is that making them do so reliably required solving problems that have nothing to do with game design and everything to do with how LLMs fail when the training distribution is thin and the target environment has phases.