Evaluating Generated Games Is a Different Problem Than Evaluating Generated Code

Code generation benchmarks operate on a clean premise: run the output against a predefined test suite, count the passes. SWE-bench does this for GitHub issues, scoring agents on whether their patches make the repository’s tests pass. HumanEval scores function synthesis against input/output examples. The verification signal is deterministic, cheap to compute, and unambiguous. A generated game has none of these properties.

Godogen is a pipeline that takes a text prompt and produces a complete, playable Godot 4 project: GDScript source, scene files, assets, the full structure. The author spent a year on it across four major rewrites. The third engineering bottleneck they identify is the evaluation loop, and it is the one that most directly exposes what makes game generation categorically harder than the problems code generation benchmarks measure.

What Correctness Means for Generated Games

Game correctness has three layers that require different verification strategies.

The first is structural: does the project load? Do the scripts compile, does the scene parse, does the engine reach a running state? This layer is verifiable with standard tooling. Godot’s headless execution mode captures compilation errors, load failures, and runtime crashes as console output. The command godot --headless --path /path/to/project runs the full scene tree, physics engine, and scripting runtime without a display. Pass/fail here is unambiguous.

The second layer is functional: does the game loop behave as described? In a platformer, does the character move and jump? Are collisions registered? Does the scoring system respond to events? This layer is partially verifiable through log output and instrumented game state, but it requires either watching what the engine reports during execution or adding programmatic hooks to the generated code that report on game state over time. Neither approach is as clean as a unit test.

The third layer is experiential: does the game actually feel like what was requested? A generated platformer where the character can technically move and technically collide with platforms is not the same as one where the movement is responsive and the physics feel stable. Testing cannot distinguish these outcomes. The difference lives in parameter tuning, in the relationship between gravity, jump force, and platform spacing, in whether the camera tracking creates disorientation or clarity. These are design properties with no test representation.

Top agents on the SWE-bench verified subset score in the 45 to 55 percent range as of early 2026, and that benchmark measures only the first correctness tier: structural correctness against a predefined test suite. Games expose all three tiers simultaneously.

The Structural Bias in Self-Evaluation

The Godogen documentation describes the evaluation problem partly in terms of agent bias. A code-generation agent reviewing its own output compares that output against its own internal model of what correct GDScript should look like. If the model’s internal representation of a valid Godot signal connection is wrong, the generated code will be wrong and the self-evaluation will confirm it as correct. The model cannot catch errors it does not know it is making.

This is distinct from the confirmation bias that affects human code reviewers. A human author may unconsciously overlook gaps between intent and implementation, but they are working from ground truth about GDScript semantics. An LLM’s internal representation of GDScript semantics is itself the source of the error when training data is sparse or when Godot 3 idioms contaminate Godot 4 generation. Reviewing the output against the same representation that generated it resolves nothing.

The real escape from this bias is execution: running the game against the actual Godot engine. Godot has no training data gaps about the Godot API; it executes exactly what the engine specifies. An error from a running Godot process is authoritative in a way that model self-evaluation is not. This is why headless execution is a structural requirement in the pipeline rather than an optional testing step.

Separating generation and evaluation into distinct model invocations with different system prompts reduces (but does not eliminate) the bias for cases where the engine does not catch the error. An evaluator context that did not produce the code is less likely to rationalize an ambiguous result as intentional. It still evaluates against the same underlying model knowledge, so systematic errors in that knowledge still propagate through.

The Phaser Comparison and Infrastructure Cost

Rosebud AI generates browser games using Phaser.js, the JavaScript game framework. The structural difference in verification cost is significant. JavaScript runs in any sandboxed browser context; a Phaser game can be loaded and inspected in milliseconds with no external binary. The evaluation loop is fast and cheap. JavaScript also has vastly more training data representation than GDScript, reducing the frequency of generation errors that require execution to catch.

Godogen’s Godot pipeline trades iteration speed for fidelity. A native Godot project has 3D capability, a full physics simulation, a rich node type system, and the spatial audio and animation tools that characterize the kind of games Godot is used for. Verifying that output is correct requires spinning up a Godot installation, paying engine startup latency, and managing process output. Per-iteration cost is real.

The MultiPL-E benchmark, which extended HumanEval across eighteen programming languages, found a tight correlation between model performance and estimated training data volume. GDScript sits well below JavaScript on any reasonable estimate, which means the Godot pipeline generates more errors per output than a Phaser pipeline would, requiring more evaluation iterations to reach a working result. The heavier verification infrastructure is paired with a language that needs it more.

For simple 2D browser games, the JavaScript path offers a faster feedback loop and higher initial generation quality. For the fuller feature surface of a native engine, the heavier infrastructure is the cost of the capability.

Visual Verification and the Frontend Analogy

Frontend development has a parallel problem: verifying rendered output automatically without requiring a human reviewer for every change. Visual regression tools like Playwright and Storybook snapshot testing capture the rendered state of UI components and compare them against baselines. This catches visual regressions without requiring a human to evaluate each commit.

Game verification has the same shape but different constraints. Visual regression testing works when there is a known-good baseline to compare against. For a generated game, no baseline exists. The question is not whether the output matches a previous state, but whether it represents a coherent and functional game at all. That requires a reference for what “functional” means.

Godogen’s screenshot capture component, running through a virtual framebuffer against the headless Godot process, handles the most detectable cases: a black screen on launch indicates rendering initialization failure even when no error is logged to the console. Beyond that, screenshot analysis requires making judgments about image content, which static comparison cannot provide.

Multimodal evaluation models are a natural direction here. A vision-capable model shown a screenshot of a generated game mid-run can assess whether the scene appears to contain the elements described in the prompt, whether physics appear to be behaving, whether UI elements are positioned and readable. This is not deterministic, but it is sensitive to the second and third correctness layers in ways that console output alone is not. The Godogen pipeline’s screenshot infrastructure positions it to incorporate this kind of evaluation as multimodal models become more reliable at structured visual assessment.

The Pipeline Architecture This Implies

For a generation pipeline targeting games or any other domain where experiential correctness matters, the verification architecture has to address all three layers explicitly and accept that they require different tools.

Structural verification through runtime execution is non-negotiable and should be treated as a first-class feedback loop, not a final check. The engine’s error output is more authoritative than any static analysis of generated code, and feeding it back into the generation loop is the difference between a pipeline that iterates toward working output and one that produces plausible-looking failures.

Functional verification requires either instrumentation hooks or observational tooling against the running simulation. Instrumented logging of game state during headless execution, capturing entity positions, state machine transitions, and event counts over a short simulated run, can catch functional failures without requiring visual analysis.

Experiential verification cannot be automated with current tooling and should be scoped honestly. A pipeline that delivers structurally and functionally correct games is delivering real value even if the output requires human tuning for experiential quality. Acknowledging the gap is more useful than attempting to close it with verification approaches that are not ready.

Godogen’s four rewrites over a year reflect working through each of these layers in sequence. The resulting system’s correctness properties come from accumulated constraint knowledge and runtime verification, not from model capability alone. That is the engineering work behind a prompt that generates a playable game.