· 7 min read ·

The Quirks Database: What Formal Documentation Can't Tell an LLM

Source: hackernews

Every mature runtime environment has two layers of knowledge. The first is formal: API references, language specifications, official tutorials. The second is informal: behaviors you learn from debugging rather than reading, the things that only appear in Stack Overflow answers three years old, in Discord servers where someone lost work and posted about it, in the comments of issues that were closed as “working as intended.” Formal documentation covers the specification. The informal layer covers the gap between the specification and what actually happens when you build something real.

Godogen, a pipeline that generates complete, playable Godot 4 projects from text prompts, had to address both layers. The formal layer part is expected: it built a reference system from Godot’s XML API source files, a hand-written language specification, and a lazy-loading mechanism to avoid exhausting the context window with all 850 GDScript classes at once. What is less expected, and more instructive, is the third component: a quirks database for engine behaviors that don’t appear in the official documentation at all.

What the Quirks Database Captures

The example mentioned in the project’s own description is representative. When building Godot scenes using headless scripts rather than the editor, every node added to the tree must have its owner property set explicitly, or it silently vanishes when the scene file is saved to disk and reloaded:

var node = Node2D.new()
scene_root.add_child(node)
node.owner = scene_root  # Omit this and the node disappears on reload

The owner property exists in the API reference. Its purpose is documented: it identifies which scene the node belongs to for serialization. What the API reference does not tell you, at least not in a way you would encounter before running into the problem, is that omitting it during programmatic scene construction produces no error and no warning. The node appears correctly in memory, writes to the .tscn file correctly, and then is simply absent when the file is loaded again. The failure is silent, and the documentation provides no mechanism to anticipate it unless you already know to look.

A second example: @onready annotations in GDScript defer node reference resolution until _ready() fires after the scene tree is live. In a headless construction context, before any scene tree exists, @onready is effectively a no-op. The annotation parses without error, but the variable it decorates will not be initialized in the way you expect:

@onready var sprite: Sprite2D = $Sprite2D  # Works in a running game
                                            # Silently doesn't work headlessly

The behavior is correct per the specification. @onready does exactly what it says: it awaits readiness. In a headless context there is no readiness signal. The documentation describes the annotation accurately. What it doesn’t tell you is that headless construction requires restructuring how you assign node references, because the contract that @onready depends on does not exist in that execution context.

A third: the .tscn format requires a load_steps count in its header that must exactly equal the number of ext_resource and sub_resource declarations in the file. Off-by-one errors do not produce a clear error message. They produce scene corruption that varies in its presentation and is difficult to trace back to a header field without already knowing it matters.

Why LLMs Fail at This Layer Specifically

The tribal knowledge layer exists in LLM training data, but in a form that is hard to learn from. API references appear as structured, authoritative documents. Tribal knowledge appears as Stack Overflow questions from someone who hit a bug, GitHub issue comments from a maintainer explaining an edge case, Reddit threads where the person who solved the problem didn’t always post a clear solution.

The signal is diffuse and often mixed with noise. A model trained on public code repositories and documentation will internalize the formal layer reasonably well, in proportion to how much of it appears in the training data. The informal layer is encoded in patterns of problem and solution that are harder to generalize from. The model sees that node.owner = scene_root appears in some headless scene construction examples, but it does not reliably learn that omitting it causes a specific silent failure, because that causal chain is not consistently explicit in any single source.

For GDScript specifically, this problem is compounded by the training data scarcity that affects the formal layer as well. When a language has thin representation in the training corpus, the model is operating with uncertain priors on everything, including the informal knowledge. The model that is unsure which signal connection API to use for Godot 4 is also unsure which node lifecycle edge cases to be cautious about. The gaps in formal knowledge make it harder to reason correctly about informal knowledge, because both require the same underlying mental model of the runtime.

The Python trap makes this worse. GDScript’s surface similarity to Python means the model reaches for Python semantics when it lacks GDScript knowledge. Python does not have a concept equivalent to @onready; it does not have a scene tree lifecycle. The model that is applying Python intuitions to GDScript will produce code that reflects Python’s behavior assumptions, none of which include scene-tree-specific initialization phases.

The Database as Executable Memory

Godogen’s response is to treat this knowledge as a first-class engineering artifact. The quirks database is structured so the pipeline can inject relevant entries into the model’s context when they apply to the current generation task. It is not a pile of caveats appended to every prompt. It is a queryable resource, like the API documentation, that the agent accesses when it identifies that a particular class of operation is involved.

This is the right architecture. Appending every known quirk to every prompt would consume context budget on irrelevant information. Lazy-loading quirk entries in the same way the system lazy-loads API class references means the model has access to the relevant informal knowledge when it matters, without paying the token cost for knowledge it doesn’t need.

The analogy to other domains is instructive. MDN’s browser compatibility tables exist because browser behavior diverges from the specification in ways that developers need to know about in practice. The W3C spec is the formal layer. The compat tables are the informal layer, systematically documented and queryable by feature. SQLite’s quirks and gotchas documentation is an explicit acknowledgment that some behaviors require documentation beyond the API reference. The Linux kernel’s documentation acknowledges in multiple places that certain behaviors are only fully understood by reading the source or observing the behavior, not the formal interface description.

Godogen built the Godot equivalent for LLM consumption.

The Epistemological Problem

The harder question is: how do you build a quirks database? You cannot derive its contents from the formal documentation, because by definition it covers what formal documentation omits. You cannot generate it from the API source. You discover it through failure.

Godogen’s four major rewrites over a year are partly a record of discovery. Each rewrite reflects a cycle: generate scenes, observe failures, identify which failures stem from undocumented behavior, encode that behavior into the quirks database, regenerate. The owner property requirement was probably discovered by building a scene that looked correct in every observable way and then was missing half its nodes after a save-load cycle. The @onready headless behavior was probably discovered by writing code that worked in a running game and failed in the headless construction context with no informative error.

This is the ongoing cost of the approach. The quirks database is not built once. It must be maintained as the engine evolves. Godot 4 has changed its behavior in minor releases, and some of those changes will affect what belongs in the database. Unlike the formal API reference, which is maintained by the engine team and can be re-derived from the XML source on each release, the quirks database requires active observation of engine behavior. The pipeline depends on someone continuing to notice when generated output breaks for reasons the docs don’t explain.

What This Generalizes To

Any LLM pipeline targeting a domain-specific runtime will eventually hit the same boundary. The formal layer is a baseline; it tells the model what the API surface is and what each entry is supposed to do. The informal layer tells the model what actually happens in the configurations that matter for real use.

For well-represented languages, the informal layer is partially absorbed during pretraining, because enough tribal knowledge is present in the training corpus that the model has learned some of it implicitly. For GDScript, neither layer is well-represented, which means both must be engineered explicitly. The API reference system and the quirks database are complementary responses to the same underlying problem: the model does not know what it needs to know.

What Godogen identifies, by naming the quirks database as a distinct component rather than treating it as a prompt engineering detail, is that this informal knowledge has the same infrastructure requirements as formal documentation. It needs to be curated, maintained, structured for selective retrieval, and kept in sync with engine changes. It is not a set of one-time prompt improvements. It is an ongoing artifact whose quality determines whether the pipeline produces output that runs correctly in edge cases, not just in the scenarios the model already handles well.

For any serious LLM code generation pipeline in a niche domain, building the formal reference system is the expected work. Building the quirks database is the work that separates a pipeline that generates plausible code from one that generates code that consistently runs.

Godogen is open source on GitHub.

Was this interesting?