· 8 min read ·

Build-Time vs. Runtime and the Node That Vanishes: Inside Godogen's Game Generation Pipeline

Source: hackernews

The Godogen project is a pipeline implemented as a set of Claude Code skills that takes a text prompt and outputs a complete, playable Godot 4 project: GDScript source files, scene files, 2D or 3D assets, and a working node graph. Its creator describes it as a year of development across four major rewrites. Four rewrites is a number worth taking seriously. It implies each iteration was not a refinement of the previous architecture but a recognition that the previous architecture was wrong at a deeper level.

The three problems that forced those rewrites, GDScript’s sparse and version-contaminated training corpus, the distinction between build-time and runtime execution phases, and the limits of self-evaluation, each have specific engineering solutions. What makes Godogen interesting is not any one solution in isolation but how they interact into a coherent approach.

Why GDScript Is a Harder LLM Target Than C# or C++

The obvious candidates for LLM-assisted game development are Unity with C# or Unreal with C++. Both languages have training corpora accumulated over decades of open-source projects, Stack Overflow threads, tutorials, and official documentation. Both have a clean conceptual separation between the host language and the engine API layer. If a model forgets what PhysicsBody does in Unreal, it still correctly understands C++ memory semantics and types. The error is confined.

GDScript uses Python-like syntax: indentation-based blocks, similar expression forms, comparable type annotation syntax. This looks like an advantage. It is not. What Python similarity actually produces is systematic substitution failures. When the model lacks a GDScript answer, it reaches for Python. That substitution is syntactically plausible, may parse without error, and is wrong in repeatable and specific ways.

The failures are not random. Boolean literals are true and false, not True and False. Null is null, not None. Array operations use push_back() rather than append(). size() rather than len(). List comprehensions do not exist. The $NodePath shorthand for accessing children in the scene tree has no Python analog and requires understanding the scene lifecycle to use correctly. None of these produce syntax errors. They produce runtime failures or silent wrong behavior.

Compounding the syntax problem is version contamination. Godot 4 shipped in March 2023 with breaking changes thorough enough that a model trained on a mixed corpus of Godot 3 and Godot 4 documentation produces hybrid code that satisfies neither version:

# Godot 3 signal connection
$Timer.connect("timeout", self, "_on_timer_timeout")

# Godot 4 signal connection
$Timer.timeout.connect(_on_timer_timeout)

Both lines look syntactically plausible. Only one works in Godot 4. Similar divergences affect KinematicBody2D versus CharacterBody2D, Spatial versus Node3D, and move_and_slide() where velocity changed from a parameter to a class property. These are not edge cases. They appear in nearly every physics or movement script. A model trained on mixed-era documentation will produce code that fails consistently on the most common tasks.

Godogen treats this as unsolvable at the training level. The solution is inference-time context injection: a hand-written GDScript language spec documenting the specific places where GDScript diverges from Python, API documentation converted from Godot’s own XML source files version-locked to Godot 4, and a curated quirks database for behaviors that exist nowhere in official documentation. The version locking matters independently of corpus size. Even abundant GDScript training data would actively teach wrong behaviors across the Godot 3/4 boundary; explicitly scoping the injected documentation to one version narrows the failure surface.

The full Godot 4 API spans approximately 850 classes. Loading all of it as context on every generation call is not practical. Godogen’s agent first identifies which classes are relevant to the requested game type, then retrieves only those definitions before generation begins. A 2D platformer resolves to a predictable, compact set: CharacterBody2D, Sprite2D, CollisionShape2D, Area2D, Camera2D, AnimationPlayer, Timer, Input. This is retrieval-augmented generation applied to a structured API catalog, with game structure as the retrieval key rather than semantic similarity search. A smaller context also reduces the surface area for hallucinating plausible-sounding but nonexistent method names from adjacent subsystems.

The Headless Builder Pattern

Godot scenes are stored in .tscn files, a text serialization format with specific bookkeeping invariants. load_steps in the header must exactly match the count of resource declarations. UIDs must be globally unique across the project. Parent path strings must accurately describe tree structure. Resource IDs must precisely match every reference to them. An off-by-one in load_steps produces silent corruption. A mismatched UID causes a resource to fail loading with no useful diagnostic.

Generating .tscn text directly is a bookkeeping problem, not a reasoning problem. LLMs are poor at maintaining exact counts across a document and tracking globally unique identifiers. Godogen avoids the format entirely. Instead, scenes are assembled by headless GDScript tool scripts that build the node graph in memory using Godot’s own API and serialize via PackedScene and ResourceSaver:

@tool
extends SceneTree

func _init():
    var root = Node2D.new()
    root.name = "Game"

    var player = CharacterBody2D.new()
    player.name = "Player"
    root.add_child(player)
    player.owner = root  # required for serialization

    var packed = PackedScene.new()
    packed.pack(root)
    ResourceSaver.save(packed, "res://scenes/game.tscn")
    quit()

The engine handles format correctness by construction. The model only needs to describe logical structure: node types, properties, parent relationships, script assignments. This pattern has analogues in other domains: AWS CDK generates CloudFormation JSON through TypeScript objects rather than raw JSON templates; CDK8s generates Kubernetes manifests through code rather than raw YAML; Pulumi generates infrastructure configuration through general-purpose languages. In each case, the approach delegates format consistency to a system that cannot get it wrong.

But delegating to the engine introduces its own specific failure mode, one that Godogen identifies as requiring explicit prompting to avoid. When you add a node to the scene tree programmatically and then call PackedScene.pack(), only nodes whose .owner property is set to the scene root are included in the packed scene. Adding a node without setting its owner produces no error at any stage. ResourceSaver.save() returns success. The file is written to disk. When the file is reloaded, the node is simply absent.

This behavior is documented, but not in a way that surfaces it as the primary failure mode of headless scene construction. It is the kind of knowledge that accumulates only by running into the failure: building a scene, saving it, reopening it, and finding a hierarchy with missing pieces. This is why the quirks database exists. The API reference describes what add_child() does. It does not foreground that add_child() without setting .owner is semantically incomplete in the specific context of headless scene construction. Operational knowledge of this kind does not appear in training data because training data captures documented behavior, not the contracts between documented behavior and the conditions under which it actually applies.

Build-Time Context and Runtime Context Are Not the Same

The headless assembly context and the game runtime are distinct execution environments with different capabilities. Several GDScript features exist only in the runtime. The @onready annotation defers variable assignment until _ready() fires when a node enters the live scene tree during gameplay:

# Evaluates correctly at runtime when scene tree is live
@onready var health_bar: ProgressBar = $UI/HealthBar

# The same reference from _init() returns null
var health_bar = $UI/HealthBar

During headless construction, _ready() never fires. Variables initialized with @onready remain null. Signal connections established before the game runs behave differently from those established in _ready(). Code that presupposes a running scene tree either fails outright or silently produces state that does not persist into the actual game.

For a code generation model, the practical challenge is that syntax validity and runtime correctness are separate criteria tied to separate execution phases. Without explicit context about which APIs are valid in which phase, the model conflates them. Godogen’s solution was to treat the two phases as distinct generation targets, injecting phase-aware constraints into context when generating code that runs in each environment.

Evaluation Cannot Go Through the Generator

An agent generating GDScript evaluates its output against its own internal model of what correct GDScript looks like. When that internal model is calibrated to a mixed corpus covering two major versions of the engine and a Python-adjacent language, self-evaluation cannot detect the systematic failures it produces. The model does not know that True is wrong and true is right in GDScript. Its evaluation surface is the same surface that generates the error.

Godogen routes evaluation through the actual engine. Generated games run via godot --headless to capture runtime errors and crashes. A virtual framebuffer enables screenshot capture to catch visual failures that produce no log output, such as a correctly loading scene that renders nothing because a node is missing from the hierarchy due to a missing owner assignment.

Execution-based validation catches a definitive class of failures, but the correctness hierarchy for games has layers that automated execution does not reach. Structural correctness (does it compile and load) is fully catchable. Functional correctness (does the game loop behave as specified) is partially catchable through instrumented game state and log output, but bugs like a CharacterBody2D with no CollisionShape2D pass structural evaluation and only manifest when physics actually runs. Experiential correctness (does the jump arc feel right, is the camera speed appropriate, is platform spacing playable) requires human judgment by definition. No execution trace produces a signal about those qualities.

This three-tier structure maps onto a familiar problem in software testing: unit tests cover structural correctness, integration and end-to-end tests cover functional correctness, and user research covers experiential correctness. Tools like SWE-bench measure the first two tiers through automated test suites, where top agents now reach 45-55% on verified subsets. Game generation exposes all three tiers simultaneously, and the third tier has no equivalent benchmark because it resists automation.

For comparison, Rosebud AI generates browser-based games using Phaser.js and JavaScript. JavaScript has orders of magnitude more training data than GDScript, no version-contamination problem from a breaking engine migration, and an evaluation loop that runs in milliseconds in a sandboxed browser tab. The structural differences explain why browser-based generation tools ship simpler architectures. Godogen’s more expensive evaluation loop, custom reference system, and quirks database are not over-engineering. They are the cost of targeting a native engine with a sparse and contaminated training corpus.

The year of rewrites and the architecture that resulted from them demonstrate that LLM code generation in niche domains is primarily a knowledge engineering problem, not a model capability problem. The leverage points are inference-time context (version-locked docs, operational quirks), a generation strategy that delegates format correctness to the target system’s own API, and evaluation that routes through the ground-truth execution environment rather than the generator itself. The same combination applies to any domain where the gap between what models learned and what the runtime actually requires is wide enough to matter.

Was this interesting?