What Running Code Fixes in AI Data Analysis, and What It Doesn't

Simon Willison has been exploring coding agents for data analysis and the core promise is real: an agent that writes Python or SQL, runs it in a sandbox, observes the output, and iterates is doing something categorically different from an LLM predicting a plausible number. The grounding matters. But the gains are distributed unevenly across failure types, and the failures that remain after execution are the most dangerous ones precisely because the loop gives no signal that anything went wrong.

The Loop and What It Promises

The architecture is the ReAct pattern: reason, act, observe, repeat. The agent receives a question and schema context, generates code, submits it to an execution environment, reads back stdout and stderr, and either answers or continues. Every iteration extends the context window with new observations.

This is how OpenAI’s Advanced Data Analysis works, how Jupyter AI works, and how most agent frameworks like E2B or self-hosted Docker sandboxes plug into Claude or GPT. The execution environment varies. The loop structure is consistent.

What execution changes is the source of quantitative claims. When an agent reports that median order value dropped 12% in Q3, it ran df['amount'].median() against actual data and read the result. It did not predict a plausible number based on training data patterns. That distinction is the entire value of the approach.

The question is which failures the loop catches and which ones survive it undetected.

The Three Error Classes

Failures in coding agents for data analysis divide cleanly into three types, and they behave very differently in the execute-observe loop.

Syntax and reference errors are caught immediately. A column name that doesn’t exist, a function called with wrong argument types, an import that fails. These produce tracebacks. The agent reads the error, corrects the code, and tries again. In practice, well-implemented agents recover from these reliably because the feedback is unambiguous. The error names the problem.

Runtime errors from data shape assumptions are usually caught. The agent writes df['timestamp'].dt.month assuming the column is datetime, but it was loaded as a string. The result is AttributeError: Can only use .dt accessor with datetimelike values. The agent corrects the type conversion and continues. These errors are recoverable because the execution environment reports them explicitly, and the agent has enough context to fix them.

Here is where the choice of query layer starts to matter. SQL catches more of these class-two errors at parse time rather than runtime. A query referencing a nonexistent column fails immediately with a message naming the column. A GROUP BY clause omitting a non-aggregated select expression is refused before any data is touched. DuckDB’s column-level comment system (COMMENT ON COLUMN orders.status IS 'One of: pending, shipped, delivered, returned') gives the agent structured vocabulary it can query with DESCRIBE, which converts ambiguous categorical assumptions into explicit, checkable facts.

Pandas errors more often at runtime with messages tied to the current object state rather than the schema. The failure surface is wider. More subtly, pandas has a notorious class of silent failures: df.groupby('category').sum() sums all numeric columns, not just the one you intended. No error is raised. The output looks reasonable. This is a class-three failure disguised as correct execution.

Silent semantic errors are never caught by the loop. The code runs, produces a number, and the number is wrong. These are the failures that matter most and are hardest to address architecturally.

Examples that appear in practice:

Counting all event types to compute “monthly active users” rather than filtering to session-start events only
Using an inner join where a left join was required, silently dropping rows with no match
Aggregating by calendar month when the business defines months as billing cycles starting on the 15th
Running a t-test on a non-normal distribution without checking assumptions, reporting a p-value as if the test were valid
Treating NULL as zero in a SUM rather than as missing data

None of these produce tracebacks. The agent’s context shows successful execution followed by a plausible result. The loop has no mechanism to surface them. A weak agent will report the wrong number with confidence; a better one might note that it assumed a standard calendar month but cannot verify the business definition.

What Reproducibility Actually Gives You

The meaningful mitigation for silent semantic errors is not more sophisticated error handling. It is the reproducibility property: the agent’s work is visible as code, and that code can be read, questioned, and re-run.

This is part of what distinguishes Simon Willison’s tools from most AI interfaces. Datasette and the llm CLI treat generated SQL as a primary artifact rather than an implementation detail. You see the query. You can read it, run it yourself, modify it, check whether the join type was correct, verify that the date filter matches the intended period.

Contrast this with a system that only returns the final answer. If an agent reports monthly active users as 47,000, you have no way to know whether it counted session_start events or all events. The number might be right or wrong, and you cannot tell.

The DABStep benchmark, which evaluates multi-step analytical reasoning on messy tabular data, found that top-performing systems generate named, typed functions with docstrings and accumulate them in a persistent tool library. The best-performing approach builds its metric definitions empirically through the session, making the definition of “monthly active users” an inspectable artifact of the analysis run rather than an implicit assumption inside a throwaway expression.

This property is also why Jupyter-based interfaces have a structural advantage for data analysis over chat interfaces. The notebook is the record. Each cell is a unit of work. A reviewer can step through the agent’s analysis and check each transformation independently.

Designing the Data Layer to Surface More Failures

Some class-three failures can be promoted to class-two by investing in schema annotations. The BIRD benchmark for text-to-SQL consistently finds that schema quality accounts for more performance variance than model capability differences within the same tier. Agents operating against annotated schemas make fewer silent semantic errors because the schema itself encodes business logic.

Annotated DuckDB schemas using COMMENT ON COLUMN capture not just column names and types but semantics: what null means, what valid values look like, what the business term refers to. A comment on orders.status that lists all enum values and their meaning converts a silent assumption into something the agent can verify before generating code.

For Python-first workflows where SQL is not the primary layer, Polars is worth evaluating. Its lazy evaluation API plans operations before execution, catches more structural errors earlier, and provides more structured error messages than pandas. The SQLContext interface allows SQL queries against Polars LazyFrames, combining the declarative safety of SQL with the Python ecosystem.

Where the Limits Actually Lie

Execution solves the prediction problem. It does not solve the interpretation problem. A coding agent can correctly compute a time series decomposition using statsmodels, extract the trend component, and then draw the wrong conclusion about what the trend means for the business. The computation is auditable. The interpretation is still probabilistic text generation.

The most useful frame for coding agents in data analysis is not “does it get the right answer” but “can I verify whether it got the right answer.” A system that shows its code, produces reproducible results, and annotates its assumptions gives you the tools to catch silent failures yourself. That is a meaningful improvement over systems that predict answers directly, but it requires treating the generated code as the primary output and building review practices around it.

For anyone building on top of these capabilities, the investment that pays off most reliably is not sandbox sophistication or model selection. It is schema annotation depth, and the interface decisions that make generated code visible and inspectable by the people who will use the analysis.