Running Code Changes What Data Analysis Agents Can Actually Do

Simon Willison’s recent piece on coding agents for data analysis is worth reading alongside a harder question: what does code execution actually change about the reliability of these systems, and where does that change not reach?

The answer matters because the two failure modes, a model that hallucinates a number and a model that correctly computes the wrong thing and then misinterprets it, look identical in the output but require completely different mitigations.

The Fundamental Architecture

A coding agent for data analysis is not an LLM that knows statistics. It is an LLM embedded in a tool loop where one of the available tools is a Python runtime. The loop looks roughly like this:

user prompt
  → model generates code
  → runtime executes code
  → stdout/stderr/plots go back into context
  → model generates next step or final answer

This cycle repeats until the model decides it has enough information to answer. The key property is that the model’s claims about numbers are grounded by actual execution. When Claude or GPT tells you the mean of a column is 42.7, it arrived there by running df['column'].mean() and reading back 42.7, not by predicting what that number plausibly should be.

The distinction sounds obvious but has real consequences. A pure LLM asked “what is the correlation between these two columns?” will produce a number that fits the narrative it has constructed about the data. A coding agent will produce the number that df.corr() returns. These can differ substantially, especially when the true correlation is counterintuitive.

How Different Implementations Handle the Execution Environment

The choice of execution substrate shapes what the agent can and cannot do.

OpenAI’s Advanced Data Analysis (formerly Code Interpreter) runs Python in an air-gapped container. No outbound network access, a curated set of pre-installed libraries including pandas, numpy, matplotlib, scipy, and sklearn, and a file upload mechanism for getting data in. The sandbox is convenient and safe but opaque: you cannot install arbitrary packages, and the container resets between conversations.

E2B offers sandboxed code execution as an API service, designed specifically for AI agents. Each sandbox is a lightweight Firecracker microVM with a Jupyter kernel. You can install packages with pip, upload files, and maintain state across multiple code executions within a session. The Python SDK looks like:

from e2b_code_interpreter import CodeInterpreter

with CodeInterpreter() as sandbox:
    sandbox.notebook.exec_cell("import pandas as pd")
    sandbox.notebook.exec_cell("df = pd.read_csv('/home/user/data.csv')")
    result = sandbox.notebook.exec_cell("df.describe()")
    print(result.text)

This is the approach several agent frameworks take when they need reliable, isolated code execution without managing container infrastructure themselves.

Open Interpreter runs code locally, in your own Python environment, with full system access. This makes it more capable in some respects (access to local files, installed tools, network) and considerably more dangerous. It is the right choice for local development workflows where you trust the agent and want it to interact with your actual environment. It is not the right choice for running against untrusted data or in automated pipelines.

For self-hosted setups, a Jupyter kernel with the nbformat execution API gives you a persistent session where variables survive between agent steps. This matters for data analysis because loading a large dataset once and then running multiple analysis steps against it in memory is far more efficient than reloading from disk on each tool call.

What Execution Actually Unlocks for Data Analysis

Three capabilities that pure LLMs cannot reliably offer:

Operating on data that exceeds the context window. A 500MB CSV file cannot be embedded in a prompt. A coding agent loads it with pd.read_csv(), runs df.shape to understand its dimensions, and then works with summaries, samples, and aggregations. The data itself never needs to fit in context; only the outputs of computations do.

Iterative exploration. The observe-compute-observe cycle is how data analysis actually works. You look at a distribution, decide it looks skewed, log-transform it, look again. A coding agent does this naturally: each cell execution result goes back into context and informs the next step. A pure LLM can describe this process but cannot perform it, because it has no mechanism to observe the results of a transformation before deciding what to do next.

Visualization as a first-class output. Coding agents can generate matplotlib or seaborn figures and either encode them as base64 for the model’s vision input or return them to the user directly. This is meaningfully different from a model describing what a chart would look like, because the chart is produced from the actual data.

The Reproducibility Property

One underappreciated consequence of the coding agent architecture is that the analysis is auditable. The agent’s work is the code it wrote, not just the prose conclusion it reached. If a coding agent tells you there is a significant correlation between two variables, you can look at the code that computed the p-value and ask whether the test assumptions were met.

This is the right way to evaluate these systems. Do not read only the final prose summary. Read the generated code. A good data analysis agent writes code that is clear enough to inspect: sensible variable names, one operation per cell, intermediate print statements that expose what the data looks like at each stage. An agent that produces dense, opaque one-liners to reach a conclusion is harder to trust even when the conclusion is correct.

The code artifact also makes the analysis reproducible independent of the agent. If the agent runs df.groupby('region')['revenue'].agg(['mean', 'std', 'count']), you can take that line, run it yourself, and verify the output. This is a stronger guarantee than “the model said so.”

Where Execution Does Not Help

Code execution does not fix the interpretation layer. A coding agent that correctly computes a time series decomposition can still draw wrong conclusions from the components. The model reads back the numerical outputs and generates a prose interpretation, and that interpretation is still probabilistic text generation.

This shows up most often with statistical tests. An agent might correctly run a chi-squared test but misread the degrees of freedom, or run a t-test on data that violates the normality assumption without checking. The code was correct; the analysis was not.

Sandbox library availability is a real constraint in managed environments. If you need a specialized library like statsmodels, lifelines for survival analysis, or geopandas for geographic data, you need an execution environment where you can install it. OpenAI’s sandbox includes statsmodels but not everything; E2B and local interpreters let you install what you need.

Context window limits apply to outputs as well as inputs. If a dataframe has 10,000 rows and you print it, the agent sees a truncated representation. Agents need to work with .head(), .describe(), .value_counts(), and other summary methods rather than trying to reason over full data dumps. This is standard data analysis practice, but it means the agent’s view of the data is always partial.

What This Means for How You Use These Tools

The practical implication of all this is that coding agents for data analysis are most useful when you treat the generated code as the deliverable, not the prose conclusion. Set up your execution environment so you can run the code yourself. Review it for statistical correctness, not just syntactic validity. Ask the agent to add intermediate print() statements so you can trace its reasoning through the data.

The execution loop is a genuine architectural improvement over pure text generation for this class of task. It grounds quantitative claims in computation rather than prediction. That matters. But it does not move the problem of interpretation out of the probabilistic domain, and that is where careful evaluation still needs to happen.