Data Governance Infrastructure Turns Out to Be Agent Infrastructure
Source: simonwillison
The gap between a coding agent that works in a demo and one that works on a production warehouse is almost never about the model. Simon Willison’s exploration of coding agents for data analysis identifies schema quality as the primary performance variable, more consequential than sandbox choice or model selection. That framing is correct, and it points to a practical implication that most agent-building guides skip: data teams are probably already maintaining exactly this context, in dbt, and nothing is using it.
What agents need that raw schema doesn’t provide
A coding agent for data analysis works in a loop: receive a question, inspect the schema, write a query or script, execute it, read the output, and decide whether the answer is complete. The inspect-schema step is where most production failures originate.
Raw schema dumps, the output of DESCRIBE TABLE or information_schema.columns, tell an agent which columns exist and what types they carry. They say nothing about what null values mean in a particular column, whether 0 is a valid quantity or a data entry artifact, which status codes are currently active versus deprecated, or what the business definition of revenue_net is versus revenue_gross. Without this context, the agent fills gaps with training data priors. It assumes timestamp columns are UTC. It assumes user_id is a clean join key. It assumes the status column it’s filtering is exhaustive. These assumptions fail often enough to corrupt results silently, which is worse than a runtime error.
The BIRD benchmark for text-to-SQL tests leading models against schemas that include explicit column descriptions and value annotations. Top models reach 70-80% accuracy in that setting. Against raw schema with no annotations, performance drops substantially. The difference between benchmark conditions and most production warehouses is the annotations, and the annotations are the data engineering team’s job, not the model’s.
What dbt already has
A mature dbt project contains most of what a coding agent needs. It lives in schema.yml files and gets exported as a structured manifest.json artifact on every dbt build. Almost nobody is using it for anything except dbt docs generate.
A typical dbt column definition:
models:
- name: orders
description: "One row per order. Excludes test accounts where domain = 'internal.company.com'."
columns:
- name: status
description: "Current fulfillment state. One of: pending, shipped, delivered, returned. Nulls indicate pre-2022 records migrated from the legacy system."
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'returned']
- name: revenue_gross
description: "Pre-return revenue. For P&L analysis, use revenue_net."
- name: customer_id
description: "Nullable. Nulls indicate guest checkouts, not data quality issues."
The accepted_values test is particularly useful. It encodes the complete set of valid states for a categorical column, information the agent can’t derive from data alone if the column happens to contain only a subset of valid values in the sample it sees. When an agent knows status is always one of four values, it can’t accidentally write WHERE status = 'complete', a bug that appears in agent output against undocumented schemas regularly because complete is a common word that sounds plausible.
Parsing manifest.json for agent context
The manifest.json file at target/manifest.json is structured for programmatic consumption. Building schema context from it is straightforward:
import json
def build_agent_context(manifest_path: str, model_names: list[str]) -> str:
with open(manifest_path) as f:
manifest = json.load(f)
sections = []
for model_name in model_names:
node_key = f"model.your_project.{model_name}"
node = manifest["nodes"].get(node_key)
if not node:
continue
lines = [f"Table: {model_name}"]
if node.get("description"):
lines.append(f"Description: {node['description']}")
lines.append("Columns:")
for col_name, col_info in node["columns"].items():
desc = col_info.get("description", "")
# Extract accepted_values from tests
for test in col_info.get("tests", []):
if isinstance(test, dict) and "accepted_values" in test:
valid = test["accepted_values"].get("values", [])
if valid:
desc += f" Valid values: {valid}."
lines.append(f" - {col_name}: {desc}")
sections.append("\n".join(lines))
return "\n\n".join(sections)
The output feeds into the agent’s system prompt or the first turn of an analysis session. For large warehouses with hundreds of tables, selective retrieval via vector embedding (embedding table descriptions and retrieving the relevant subset at query time) scales the approach; LlamaIndex’s SQLContextContainerBuilder is one path to this.
DuckDB’s native annotation mechanism
DuckDB supports column comments natively, which is useful when the agent queries DuckDB directly rather than a warehouse:
COMMENT ON COLUMN orders.status IS
'One of: pending, shipped, delivered, returned. Nulls are pre-2022 legacy records.';
COMMENT ON COLUMN orders.revenue_gross IS
'Pre-return revenue. For P&L analysis, use revenue_net instead.';
Agents retrieve annotations programmatically before generating any analysis query:
SELECT table_name, column_name, comment
FROM duckdb_columns()
WHERE schema_name = 'main'
AND comment IS NOT NULL
ORDER BY table_name, column_index;
This surfaces annotations as structured output that drops cleanly into context. When a team works primarily with Parquet exports, these comments can be pre-applied in a setup script that runs before any agent session begins.
For teams already running dbt, a migration script that reads manifest.json and applies comments to the DuckDB layer is a one-time task. The descriptions your team wrote for dbt docs now serve two purposes.
The virtuous cycle in practice
Deploying a coding agent against a production schema surfaces documentation gaps faster than any audit process. When an agent consistently produces wrong results for queries involving a particular table, it is almost always because the schema context for that table is missing or misleading. The agent’s confusion is a direct signal about analyst confusion; the same gaps that trip the agent trip new team members.
Logging which columns appear in queries that fail or require multiple correction rounds gives data engineers a ranked list of documentation priorities. This makes the ROI argument for schema documentation concrete. Teams that have struggled to justify annotation work on abstract productivity grounds now have a measurable quality difference between documented and undocumented tables visible in agent output.
The DABStep benchmark, focused on multi-step analytical reasoning over messy tabular data, found that the winning approach from NVIDIA’s NeMo team used a persistent library of verified, typed Python functions rather than generating fresh code for each step. Building that function library correctly still depends on accurate schema understanding; the annotation layer feeds both single-turn queries and longer analytical workflows.
Connecting to the existing toolchain
Simon Willison’s Datasette and llm CLI have always treated generated SQL as a first-class output rather than an internal mechanism. The metadata YAML that Datasette uses for human-readable database publishing transfers cleanly to agent context because both consumers need the same information: what the columns mean, not just what they are.
For teams running dbt, the path from existing documentation to working agent context is shorter than it appears. The manifest.json is already being generated. The E2B sandbox provides isolated execution with DuckDB included by default. The schema annotations your team wrote for dbt docs serve now serve as agent configuration.
The framing shift worth making is this: data governance infrastructure and agent infrastructure are not separate investments. Column descriptions, accepted-values constraints, and model-level documentation are the configuration layer for both human analysts and automated agents. Teams that have done this work get better agent results essentially for free. Teams that have not will find that fixing agent output quality is, structurally, the same problem as improving their data documentation, approached from a different direction.