Simon Willison has been making the case for coding agents in data analysis for a while, and watching him work through real datasets with these tools clarifies something that gets lost in the broader LLM hype cycle: most tasks where people want AI assistance are poorly served by pure text generation, but data analysis is genuinely different. The agent loop, where a model writes code, runs it, sees the output, and decides what to do next, fits exploratory data work almost perfectly.
This is worth unpacking carefully, because the claim isn’t just that LLMs are useful for data analysis. It’s that the specific architecture of a coding agent, as opposed to a chatbot or a text summarizer, solves problems that have been genuinely hard to solve otherwise.
The Architecture That Makes This Work
A coding agent for data analysis is not a model that generates pandas code for you to paste into a notebook. It’s a model operating in a loop: it receives a question, writes code to answer it, executes that code in a sandboxed interpreter, receives the output (or error), and continues from there. The key word is “executes.” The model is not guessing what the output will be; it is reading actual output.
This matters enormously for data work. If you ask a plain LLM “what is the average transaction value in this CSV,” it will hallucinate a number with complete confidence. If you give a coding agent the same file and the same question, it writes df['transaction_value'].mean(), runs it, and returns the real answer. The computation is grounded.
ChatGPT’s Advanced Data Analysis feature (formerly Code Interpreter) was the first mainstream demonstration of this at scale. Users discovered they could upload a spreadsheet, ask a question in plain English, and get back a real answer with a real chart. The model handled the translation from intention to pandas to matplotlib, and the sandbox handled the execution. The workflow felt qualitatively different from anything a chatbot had offered before.
Claude’s equivalent is code execution within Claude.ai’s artifact system, and the open-source ecosystem has produced tools like Jupyter AI and Marimo, which embed LLM assistance directly into notebook workflows. On the infrastructure side, services like E2B and Modal provide sandboxed Python environments that developers can wire up to any model, enabling custom coding agent pipelines without depending on a specific product.
What “Exploratory” Actually Means Here
Exploratory data analysis has always been iterative. You look at a dataset, form a hypothesis, write a query or transformation, look at the output, update your hypothesis, repeat. The bottleneck was never the computation, it was the friction of writing the code at each step, especially for analysts who know what they want to know but don’t have pandas idioms memorized.
Coding agents reduce that friction substantially. The agent handles the mechanical translation from intent to syntax. You still need to evaluate whether the output makes sense, catch when the agent misunderstands column names or applies the wrong aggregation, and decide what to look at next. That part doesn’t go away. But the ratio of thinking-time to typing-time shifts in a useful direction.
Willison has noted that this is the use case where he trusts agent output most, precisely because the code is verifiable. You can read the pandas expression the agent wrote. You can re-run it. If the agent made a wrong assumption about the data structure, you can see it in the code rather than just in a confidently wrong text response.
# An agent working through a dataset might generate something like this:
import pandas as pd
df = pd.read_csv('transactions.csv', parse_dates=['date'])
# Check for nulls before aggregating
print(df['amount'].isna().sum())
monthly = df.groupby(df['date'].dt.to_period('M'))['amount'].agg(['sum', 'mean', 'count'])
print(monthly)
The null check on line 6 is the kind of defensive step an experienced analyst would add. A well-prompted agent adds it automatically, because it has seen enough data analysis code to know that missing values will silently distort a mean. The model isn’t just translating your intent; it’s applying standard practice.
Where the Approach Still Breaks Down
The main failure mode is context. A model can only reason about data it can see, and for large datasets it cannot see the whole file. The typical workaround is to pass summary statistics, schema information, and sample rows to give the model enough context to write correct code. This works well when the dataset is structured and well-named. It works poorly when column names are cryptic abbreviations, when types don’t match what they appear to be, or when the data has domain-specific semantics the model doesn’t know.
There’s also a category of error that’s harder to catch: statistically plausible but wrong analysis. The agent can write syntactically correct code that computes the wrong thing. A .groupby() on the wrong column, a join that creates duplicates, a filter that silently excludes a meaningful subset. These errors won’t throw exceptions; they’ll just return a number. If you don’t have domain knowledge to sanity-check that number, you might not catch it.
Security in sandboxed execution is another real constraint. Hosted services handle this by running code in isolated environments with no network access and limited filesystem scope. If you’re building your own pipeline with E2B or similar, you need to configure those boundaries explicitly. Giving an agent unrestricted code execution against a production database is not a configuration you want.
The Comparison to Traditional BI Tools
Traditional business intelligence tools like Tableau and Looker solve the “non-programmers need to query data” problem with point-and-click interfaces and pre-defined metrics. Coding agents solve a different problem: they let people who think analytically but don’t have programming fluency ask arbitrary questions of arbitrary data.
The BI tool approach requires upfront work: schema modeling, metric definitions, dashboard construction. Once it’s built, it’s fast and reliable for the questions it was designed to answer. The coding agent approach requires almost no upfront work but is less reliable for routine reporting. The two are complementary. An agent is good for the initial exploration that eventually justifies building a formal dashboard; the dashboard is good for operationalizing the insights the exploration produced.
DuckDB has made coding agents considerably more useful for data analysis by providing a fast, embedded SQL engine that agents can use on local files. Instead of loading a large CSV into pandas entirely, an agent can query it with SQL directly:
-- DuckDB running locally, no server needed
SELECT
date_trunc('month', created_at) AS month,
COUNT(*) AS events,
AVG(duration_ms) AS avg_duration
FROM 'events.parquet'
WHERE user_segment = 'enterprise'
GROUP BY 1
ORDER BY 1;
DuckDB can read Parquet, CSV, and JSON directly from disk, and it handles files that would overflow pandas’ memory budget. Agents that reach for DuckDB when appropriate are considerably more capable than those limited to pure pandas.
The Deeper Shift
What Willison’s ongoing work on this topic surfaces is that the value of a coding agent for data analysis isn’t primarily about saving time. It’s about lowering the threshold for asking questions. When asking a question costs thirty seconds instead of thirty minutes, you ask more questions. You notice more things. You run down hunches that you would have previously abandoned because the cost of checking wasn’t worth it.
This is similar to how the proliferation of cheap cloud compute changed infrastructure decisions. When a query costs a millisecond instead of running overnight, the set of questions worth asking expands. When translating “show me the distribution of this field broken down by region” into code costs nothing, you look at distributions you would have skipped.
The flip side is that you can also generate a lot of charts that don’t tell you anything. The cognitive overhead shifts from writing code to evaluating output, and evaluation requires judgment the agent doesn’t supply. The tool amplifies whatever analytical instincts you bring to it; it doesn’t replace them.
For developers building data tooling, the practical implication is that integrating LLM-driven code execution into analysis workflows is worth the complexity. The E2B SDK makes it straightforward to spin up a sandboxed Python environment, pass it code from a model, and get back output. Combined with a model that’s been prompted well about data analysis conventions, the result is a substantial improvement over asking users to write code themselves or waiting for a BI team to build a dashboard.
The use case that was supposed to be a demo has turned out to be one of the more durable applications of the whole paradigm.