· 6 min read ·

Schema Quality Determines Data Analysis Agent Output More Than Execution Does

Source: simonwillison

Most discussion of coding agents for data analysis concentrates on execution infrastructure: which sandbox to use, how to handle errors, whether to generate Python or SQL. Simon Willison’s recent exploration of coding agents for data analysis covers this landscape well. The execution environment is important, but it is not where most data analysis agent failures originate.

The majority of failures happen before the first line of code is written, in the gap between what the agent knows about the data and what it needs to know to generate correct analysis. This is the data context problem, and it is underengineered in most deployments.

What “Schema” Means to an Agent

When developers talk about passing schema information to a data analysis agent, they often mean column names and data types: the output of DESCRIBE TABLE or information_schema.columns. This is necessary but not sufficient.

A raw schema dump tells the agent what columns exist. It says nothing about:

  • What null values mean semantically (missing data vs. not applicable vs. historical rows from before the column existed)
  • The valid range of values for numeric columns (is 0 a valid quantity or a data entry error?)
  • What the cardinality of categorical columns looks like
  • Which columns are safe join keys and which have quality issues
  • What business-specific terms mean: revenue_gross vs. revenue_net vs. revenue_recognized
  • Time zone conventions and whether timestamps are stored in local time or UTC
  • Deprecated columns that still exist for backward compatibility but should not be queried

When an agent lacks this context, it fills the gaps with training data priors. It assumes user_id columns are always clean join keys. It assumes timestamp columns are UTC. It assumes status columns with values like active and inactive are exhaustive. These assumptions are wrong often enough to matter.

The Concrete Difference

Consider an agent asked to compute monthly active users from an events table. With a minimal schema:

events: user_id (int), event_type (varchar), timestamp (timestamp), session_id (varchar)

The agent might generate:

monthly_active = (
    events_df[events_df['timestamp'] >= month_start]
    .groupby('user_id')
    .size()
    .shape[0]
)

This counts distinct users who appear anywhere in the events table during the month. If the table contains account creation events, password reset events, and notification delivery events alongside actual session activity, the count is inflated. The code runs without error. The number looks plausible. The answer is wrong.

With an annotated schema:

events:
  user_id: int, foreign key to users.id
  event_type: varchar
    valid values: ['session_start', 'session_end', 'page_view', 'account_create', 'password_reset']
    note: 'monthly active user' is defined as a user with at least one 'session_start' event
  timestamp: timestamp, stored as UTC
  session_id: varchar, null for non-session events

With this context the generated code filters on event_type == 'session_start' before computing distinct user counts. Same execution environment, same model, different output quality. The annotation is twenty words. The impact on correctness is substantial.

Token Budget and the Retrieval Problem

For a setup with a handful of tables, passing a fully annotated schema in the system prompt is straightforward. For a data warehouse with 200 tables and 3,000 columns, it is not.

A full warehouse schema, even minimally documented, can run to 50,000 or more tokens. That exceeds the practical context limit for most deployments, and it fills the agent’s context with noise: tables it will never use, columns irrelevant to the question at hand.

The solution is selective schema retrieval. You embed summaries of each table, store them in a vector index, and retrieve the most relevant tables based on the user’s question before building the agent’s prompt. This is semantically a RAG problem: the schema is the knowledge base, the question is the query, and the retrieved table descriptions are the context.

LlamaIndex provides purpose-built components for this with its SQLTableNodeMapping and SQLContextContainerBuilder. The quality of table retrieval depends directly on how well the table summaries capture their purpose, which loops back to the annotation problem. A table named ev with no description will never be retrieved for a question about “event counts”, while a table named events with a description that includes “user activity events including session starts and page views” retrieves correctly.

A minimal schema retrieval pipeline:

# At index time: embed table summaries
for table in warehouse.tables:
    summary = (
        f"{table.name}: {table.description}. "
        f"Columns: {', '.join(col.name + ' - ' + col.description for col in table.columns)}"
    )
    index.add(summary, metadata={"table": table.name})

# At query time: retrieve relevant tables and build annotated context
relevant_tables = index.query(user_question, top_k=5)
schema_context = build_annotated_schema(relevant_tables)

The retrieval step adds a few hundred milliseconds of latency, cuts context usage by an order of magnitude on large warehouses, and produces meaningfully better code because the agent is not distracted by irrelevant table names and columns.

Existing Standards Worth Borrowing From

Two ecosystem tools have independently solved parts of this problem.

dbt requires column and model descriptions in its schema.yml files, and its documentation build generates a full data dictionary from these descriptions. A warehouse with thorough dbt documentation is, incidentally, well-suited for agent consumption. The dbt docs JSON artifact contains structured descriptions of every model, column, test, and relationship. An agent with access to that artifact has substantially more context than one reading raw schema output from the database. If your organization has already invested in dbt documentation for human users, that investment transfers directly to agent quality.

Datasette, Willison’s SQLite publishing tool, uses YAML metadata files that annotate databases with titles, descriptions, column units, and example values. These were designed for human users of a published dataset, but the format is clean enough that agents read it without transformation. A Datasette deployment with thorough metadata.yml is a natural fit for agent-driven analysis. The annotations that help a human understand a dataset through Datasette’s web interface are the same ones that help an agent generate correct queries against it.

Both tools demonstrate the same principle: documentation that travels with the data, in a structured format, accumulates value over time. Every annotation added for one purpose is an improvement to every other consumer of that data, including agents.

The Business Logic Layer

The hardest part of schema documentation is not describing column types. It is encoding the business logic that experienced analysts carry in their heads but rarely write down.

“Monthly active users” means something specific at each company. So does “revenue”. So does “conversion rate”. An agent asked for these metrics needs the definition, including all the edge cases: which event types count, how to handle users with multiple accounts, whether free trial users are included, what the deduplication window is.

For existing analytical teams, some of this is already encoded in BI tool metric definitions, dbt metrics, or Looker LookML. The most tractable path to high-quality agent analysis is usually not to document the schema from scratch, but to make the existing definitions accessible to the agent. A system prompt that includes the relevant dbt metric definitions, or the LookML explore that corresponds to the user’s question, gives the agent the business logic it needs to generate semantically correct code.

This represents a shift in how data infrastructure teams might think about their documentation: not as something for BI tools specifically, and not as something for agents specifically, but as a shared semantic layer that all analytical consumers read. The investment in dbt model descriptions, column-level tests, and metric definitions pays off across all tooling, whether the consumer is a dashboard, a notebook, or a coding agent.

This is also where the BIRD benchmark for text-to-SQL agents reveals its most consistent finding: agents perform substantially better when given database value annotations (sample values for categorical columns, range information for numeric ones) than when given only schema structure. The benchmark gap between schema-only and schema-plus-annotations is large enough that the annotation quality differentiates agents more reliably than model capability differences do at the same tier.

Practical Starting Point

If you are deploying a data analysis agent against an existing warehouse and want to improve output quality without changing the execution environment, schema annotation is the highest-leverage starting point. A reasonable first pass:

  • Write a two-sentence description for each of your ten most-queried tables
  • Document the valid values for all categorical columns used in filters
  • Add notes to any column where the business definition differs from the name (anything like net_arr, bounced, activated, or mau warrants a note)
  • Explicitly document foreign key relationships, even when the database does not enforce them

Pass this alongside the raw schema output. The execution infrastructure determines whether the agent can run code at all. The schema context determines whether the code it runs is correct. For most production deployments, execution is solved and annotation is not.

Was this interesting?