Generating Game Code Is Easier Than Knowing Whether It Works

Generating playable games from text prompts places two demands on the pipeline: it must produce code that works, and it must know that the code works. Godogen, a pipeline that takes a text prompt and produces complete Godot 4 projects, spent approximately a year across four major rewrites solving both. The author describes three engineering bottlenecks in the HN announcement. Two concern generation. The third concerns evaluation, and it scales hardest as game complexity increases.

The generation problems have discernible solution shapes. GDScript’s training data is thin and version-contaminated, but the fix is context injection: a hand-written language spec, API documentation converted from Godot’s XML source, and a quirks database built from empirical failures. The build-time versus runtime boundary, which determines which Godot APIs are available during headless scene construction versus during live gameplay, is a knowledge gap that explicit annotation in prompt context can address. Neither problem is trivial, but both have clear engineering paths.

The evaluation problem is structurally different.

Three Tiers of Correctness

A generated Godot project has three levels at which it can be correct or wrong, each requiring a different evaluation strategy.

Structural correctness means the project compiles, loads, and enters the game loop without an engine error. Godot’s scene serialization catches some structural problems at load time: a scene file with a mismatched load_steps count, an ext_resource referencing a missing file, a node type absent from Godot 4. GDScript compilation catches another subset. But structural validation is incomplete. A CharacterBody2D with no CollisionShape2D child is structurally valid and loads without complaint. Physics will not work when the game runs.

Functional correctness means the game behaves as described: the player character moves correctly, projectiles intersect hitboxes, state transitions between scenes fire at the right moment, UI elements reflect the correct game state. This tier requires the game to actually run. Headless execution via godot --headless can drive some functional testing: run the game for N frames, observe whether the expected state has been reached, check the error stream. Signal connections to methods with mismatched parameter signatures fail silently in Godot 4 by default and require behavioral observation rather than log inspection. Physics layer and mask misconfiguration causes objects to pass through each other with nothing written to the console.

Experiential correctness is harder to operationalize. The jump arc feels appropriate. The camera lead matches the movement speed. The platform spacing is playable given the character’s expected capabilities. These are continuous parameters with wide acceptable ranges that depend on design intent rather than specification. Most automated testing frameworks in software development address structural and functional correctness and handle experiential correctness through user testing. For game code specifically, the ratio of importance shifts. A significant fraction of what makes a game feel like a game lives in the experiential tier. A generator that achieves the first two but ignores the third produces output that technically works but does not play well.

The Self-Evaluation Problem

An LLM evaluating its own generated code brings the same miscalibrated priors that produced the bugs in the first place.

Research on LLM-as-judge evaluation has documented self-preference bias consistently. A model asked to review its own output rates that output more favorably than equivalent code from other sources, independent of actual quality. For GDScript specifically, if a model’s internal representation conflates Godot 3 and Godot 4 signal syntax, it will not detect that $Timer.connect("timeout", self, "_on_timer_timeout") is wrong by reading what it generated. Both versions of that pattern look syntactically plausible within the model’s calibration. Systematic errors arising from systematic miscalibration are invisible to the miscalibrated evaluator.

This means external ground-truth evaluation is not optional; it is the only mechanism that can close the feedback loop on systematic failure classes. For most code generation domains, external evaluation is inexpensive: compile the output, run a test suite, check return values. For game code, the cheapest external evaluator is the game engine itself, and invoking the engine carries meaningful time cost relative to running a unit test suite.

What Headless Execution Actually Catches

Godogen runs generated projects via godot --headless with a virtual framebuffer, which allows screenshot capture in environments without a display server. Console output and error streams feed back into the generation loop. This provides external validation at the structural tier and catches a subset of functional problems, specifically those that produce console errors or fail to reach a testable game state within a fixed frame budget.

The evaluation cost shapes what iteration is feasible. Each evaluation cycle requires spawning the Godot binary, waiting for engine initialization, running the project for some duration, and capturing output. This is qualitatively different from running a tight unit test suite. The time per evaluation cycle directly limits how many generation-and-correction iterations fit within a reasonable budget, which constrains how aggressively the pipeline can correct for bugs it discovers.

Rosebud AI sidesteps part of this problem by generating Phaser.js browser games rather than native Godot projects. JavaScript runs in a browser tab; each evaluation iteration takes milliseconds rather than seconds; and the visual output renders in the same environment hosting the evaluation. The trade-off is scope. Browser-based Phaser generation is constrained to web capabilities and Phaser’s considerably smaller API surface compared to Godot 4’s roughly 850 classes. Faster evaluation cycles are worth something, but they come attached to a narrower range of achievable games.

The Visual Evaluation Gap

The third bottleneck the author mentions is that a coding agent is biased toward its own output, and the visual testing phase is the intended response. Capturing screenshots of running games provides evaluation signal that the generator does not have during generation, which breaks the self-reference loop.

But visual evaluation of game behavior is harder than it first appears. A screenshot of a black screen indicates failure clearly. A screenshot showing a character standing on a platform could indicate correct behavior, a collision system that happens to work by coincidence, or a physics simulation that has not advanced far enough to trigger the failure mode. Distinguishing between these requires either a sequence of frames, game-state annotations, or domain knowledge about what correct platformer behavior looks like visually. Building that classification layer is itself a non-trivial engineering task, and it is one the Godogen source description does not detail, because the source text cuts off precisely at that point.

This is where the evaluation problem becomes recursive: assessing whether generated game code is correct requires something that understands what correct gameplay looks like, and that understanding draws on the same sparse domain knowledge that limits the generator. The evaluation system and the generation system are not independent; they share the same knowledge bottleneck.

Why Each Rewrite Was Necessary

Four major rewrites across a year is consistent with iterative failure-class discovery, where each rewrite responds to a category of failure the previous architecture could not catch. The evaluation problem compounds across rewrites because evaluation quality is itself a function of accumulated domain knowledge.

Early in the project, before the quirks database existed, the generator produced bugs the evaluation loop could not reliably detect because it lacked context for what to look for. As the quirks database grew, generation improved and the evaluation loop could be tuned to the remaining failure classes. The rewrites are not evidence of a misconceived initial approach; they are the expected shape of building toward coverage of an empirically discovered failure space. You cannot know in advance what the failure classes are; you discover them by building, failing, and recording what went wrong.

The ceiling on output quality is set by the evaluation loop, not by the generator. A pipeline that produces plausible-looking game code and cannot reliably distinguish correct output from plausible-but-wrong output has not finished solving the problem. The generation side produces candidates. The evaluation side determines whether the problem is solved.

For anyone building in this space, that asymmetry is worth holding clearly. Improving the generator without improving the evaluator produces better-looking failures. Improving the evaluator first tells you precisely what the generator needs to fix. The year of rewrites Godogen required was, in large part, the cost of building an evaluation system that could see the right things.