The Execution Feedback Loop That Makes Data Analysis Agents Work
Source: simonwillison
When Simon Willison writes about coding agents for data analysis, the distinction that matters most is between a tool that generates code and one that executes it. The execution side is where the useful part happens.
Most early “AI for data” products were thin wrappers around code generation. You described a dataset, the model wrote some pandas or SQL, you copied it into a notebook and ran it yourself. That works sometimes and fails often: the model hallucinates a column name, misunderstands a dtype, or writes a groupby that produces subtly wrong output. The workflow puts the human back in the loop for execution, and execution is the step where the model most needs feedback.
Coding agents close that loop. The model writes code, the environment executes it, the result or the traceback comes back, and the model decides what to do next. This is not a minor workflow improvement. It changes what the model can accomplish, because it can observe consequences and correct course.
Schema Discovery Before the First Query
The first thing a useful coding agent does with a new dataset is not write a query. It reads the schema.
This might look like a DESCRIBE statement against a DuckDB table, a df.info() call, or a pass through column names and sample values of a CSV. The purpose is to ground the model’s understanding in what is present in the data, rather than what it assumes based on the filename or the user’s description.
Without this step, models make confident errors. Column names that sound reasonable but are wrong, aggregations against columns that turn out to be strings, date filters that assume ISO format when the data uses something else. None of these failures are mysterious once you see them, but they are expensive if the model cannot observe the error and adjust.
With schema discovery built into the agent loop, most of these errors become self-correcting. The model tries a query, sees a KeyError: 'transaction_date', looks at the actual column names, discovers the column is called txn_ts, and retries with the correct name. A human analyst would do the same thing in a notebook. The agent just does it without needing you to be there.
The Execution Sandbox Problem
Running arbitrary code in a loop requires a safe execution environment. This is the part of the architecture that receives the least attention in writeups but costs the most in production.
The options fall into a few categories. E2B provides ephemeral cloud sandboxes that spin up in under 500ms and run Python with a full standard library. Modal offers similar capabilities with more control over the execution environment. For local workflows, a subprocess with a timeout and a restricted set of allowed modules is often sufficient. OpenAI’s Code Interpreter runs in its own managed sandbox, which is why it can handle file uploads directly without external storage setup.
The sandbox needs to be stateful within a session but isolated across users and sessions. The model needs to install packages, load files, and maintain dataframe state across multiple code executions, but a runaway computation in one session should not affect others. This is not hard to implement, but it requires real infrastructure decisions: container lifetime, memory limits, filesystem layout, and network access policies.
DuckDB simplifies some of this. Because it runs in-process and needs no server, you can spin up a DuckDB instance inside a sandbox, load a CSV or Parquet file into it, and query it with SQL that is both fast and predictable. The model gets a well-structured query interface without needing a running database server, and you get OLAP performance on local files without installing anything extra.
import duckdb
conn = duckdb.connect()
conn.execute("CREATE TABLE events AS SELECT * FROM read_csv_auto('events.csv')")
result = conn.execute("""
SELECT
date_trunc('day', timestamp) as day,
event_type,
COUNT(*) as count
FROM events
GROUP BY 1, 2
ORDER BY 1, 3 DESC
""").fetchdf()
print(result.head(20))
The model writes something like this, the sandbox executes it, and the resulting dataframe comes back as context for the next step. If the timestamp column is not a timestamp type, DuckDB will say so clearly, and the model can add a cast.
How the Loop Handles Errors
A coding agent that only succeeds on the first attempt is not more valuable than a code generator. The value comes from what happens when it fails.
When execution produces a traceback, the agent receives the full error message as part of its context. This is different from how most developers think about error handling in LLM workflows, where the goal is usually to avoid errors entirely. For coding agents, errors are information. A ValueError: could not convert string to float: '1,234' tells the agent that the column has locale-formatted numbers and needs a different parsing strategy. A MemoryError tells it the dataset is too large to load into pandas all at once and it should try chunked reading or push the operation to DuckDB.
The quality of error messages matters here. DuckDB produces precise, useful error messages. Pandas error messages are often decent. NumPy error messages can be cryptic. When building a data analysis agent, it is worth testing how the model responds to typical errors from each library you use, because the error message is what the model works from when it decides how to recover.
Context Pressure During Long Sessions
As an agent runs more steps of analysis, the context window fills with intermediate results. This is a real constraint that most agent frameworks handle poorly.
A typical sequence might look like: load schema (200 tokens), first query result (500 tokens), second query result (800 tokens), third query result plus some intermediate computation (1200 tokens). After a dozen steps, you have used a substantial fraction of the context window on intermediate results that the model no longer needs.
The right approach is to summarize intermediate results aggressively. Instead of returning a 500-row dataframe as part of the context, return a short description: “Query returned 487 rows. First 5 rows shown below. Key observations: the ‘status’ column has 4 distinct values; ‘cancelled’ accounts for 38% of rows.” The full dataframe can be written to a file the model can reference if needed, but the context representation should be compact.
This is one of the places where agent architecture directly affects output quality. Models working against large contexts can lose track of early observations. Agents that compress intermediate results stay more coherent across longer analyses.
What Agents Still Get Wrong
Even well-architected coding agents make predictable mistakes on data analysis tasks.
Subtly wrong aggregations are the hardest to catch. A query that produces plausible-looking numbers but groups by the wrong grain, or that double-counts rows due to a join fanout, can pass through the entire loop without triggering any error. The agent has no mechanism to know that the result is wrong, only that it ran successfully.
Large datasets expose the limits of the sandbox approach. A 50GB Parquet file cannot be loaded into a container with 4GB of memory. Agents that have not been taught to check dataset size first, or to prefer query pushdown over in-memory loading, will fail on production-scale data in ways that are hard to recover from.
NULL handling produces a category of subtle errors that models handle inconsistently. SQL NULL semantics differ from pandas NaN semantics, which differ from Python None, and the model’s training data contains enough of each to cause occasional confusion about which context it is operating in.
None of these are reasons to avoid coding agents for data analysis. They are reasons to treat the agent’s output the same way you would treat code written by a capable but unfamiliar colleague: read it, understand what it did, and verify results that matter.
The loop that makes coding agents useful, the execute-observe-adjust cycle, is also what makes them auditable. Every step is a code execution with a visible result. The analysis is not a black box. The model shows its work, because its work is the code it ran.