· 6 min read ·

Code, Execute, Observe: What Coding Agents Actually Do With Your Data

Source: simonwillison

Simon Willison’s recent writeup on coding agents for data analysis lands at an interesting moment. These tools have matured enough that people are using them for real work, not just demos, and the gap between the hype and the practical experience is worth examining carefully.

The core idea is straightforward: instead of asking a language model to answer a question about data directly, you give it a code execution environment and let it write Python or SQL, run that code, read the output, and iterate. The model never needs to hold your entire dataset in context because the code does the heavy lifting. This changes the failure modes considerably, and it changes what you can reasonably expect from these tools.

The Loop That Makes It Work

Most coding agents for data analysis implement some variation of the ReAct pattern, which interleaves reasoning steps with tool calls. In practice, for data work, it looks like this:

  1. The model receives a question and some description of the data (schema, sample rows, column names)
  2. It reasons about what computation would answer the question
  3. It writes code and submits it to an execution tool
  4. It receives stdout, stderr, and any generated artifacts back
  5. It interprets the output and either answers the question or loops back to step 2

The execution environment is typically a sandboxed Python process. E2B has become a popular choice for this because it spins up isolated microVMs quickly and handles the security surface of arbitrary code execution. OpenAI’s Advanced Data Analysis (formerly Code Interpreter) uses its own sandboxing. When you’re building something custom, you can use RestrictedPython for lightweight isolation, though it has sharp edges.

A typical agent turn looks something like this from the model’s perspective:

Thought: The user wants to know which product categories had the highest return rate last quarter.
I should load the data and compute returns grouped by category.

Code:
import pandas as pd

df = pd.read_csv('orders.csv')
returns = df[df['status'] == 'returned'].copy()
return_rate = (
    returns.groupby('category').size() /
    df.groupby('category').size()
).sort_values(ascending=False)
print(return_rate.head(10))

Observation:
category
Electronics      0.142
Apparel          0.118
Home & Garden    0.067
...

Thought: I have the return rates by category. Electronics at 14.2% is highest.
Answer: Electronics had the highest return rate last quarter at 14.2%...

The model never needed to read the CSV contents into its context window. It just wrote code that reads the file, and the output is a small summary it can reason about directly.

Where This Beats Direct Q&A

The most obvious win is scalability. A 50MB CSV is impossible to pass to an LLM directly, but a coding agent can work with it fine because the data lives on disk and only aggregated results come back into context. The same applies to databases: the agent writes SQL, the database executes it, and the model sees a result set.

Less obvious but arguably more important: the analysis is auditable. When a model writes df.groupby('category')['revenue'].sum(), you can read that code and verify it computes what the question asked. When a model answers directly from its weights, you cannot check its work in the same way. For any serious data analysis, the code artifact is valuable beyond the answer itself.

Simon has explored this through his LLM CLI tool and datasette, which provide lightweight ways to query databases with natural language. The philosophy there is similar: the SQL generated by the model is a visible, inspectable artifact, not a black box.

Tools like pandas-ai and Jupyter AI have brought this capability into familiar data science workflows. Jupyter AI in particular fits naturally because the notebook is already a tool for iterative, observable computation. The model generates cells; you can see what it wrote before it runs.

The Failure Modes Are Different, Not Absent

Coding agents do not hallucinate data values in the same way a direct-answer LLM does, because the values come from actual computation. But they introduce their own failure modes that take some getting used to.

Schema mismatch: The model assumes column names and types based on the schema description it was given. If that description is incomplete or stale, the generated code fails with a KeyError or TypeError. Good agents recover from these errors by reading the traceback and correcting the code. Weaker ones spin in circles or give up.

Silent semantic errors: The code runs without errors and produces a number, but the computation is wrong. A model asked for “monthly active users” might count distinct user IDs without properly deduplicating across event types. The code succeeds; the answer is wrong. This is harder to catch than a visible error.

Incorrect assumptions about data shape: A model might assume a timestamp column is already parsed as datetime and write df['timestamp'].dt.month, which fails if the column is a string. Or it assumes a left join where an inner join was needed, silently dropping rows. These are the kinds of bugs that would also slip through in human-written analysis code, which is both reassuring and alarming.

Context window exhaustion on complex schemas: If you have a database with 200 tables and thousands of columns, passing the full schema to the model is impractical. Systems that handle this well do selective schema retrieval, pulling in only the tables likely to be relevant using embedding similarity or keyword matching.

Architecturally, What Matters

For anyone building one of these systems rather than just using an existing tool, a few decisions have outsized impact:

How you represent the data to the model is the single biggest factor in output quality. Raw DESCRIBE TABLE output is mediocre. Annotated schemas with example values, foreign key relationships spelled out, and domain-specific notes about gotchas (“this column is nullable for historical rows pre-2023”) produce meaningfully better code.

Error recovery strategy separates usable agents from frustrating ones. The simplest approach is to pass the full traceback back to the model on failure and ask it to fix the code. More sophisticated systems categorize errors: a missing column might trigger a schema lookup, while an import error might suggest the execution environment is missing a package.

Artifact handling matters more than it seems. Data analysis often produces charts, not just numbers. The agent needs a way to generate a plot, persist it, and reference it in the response. Systems that handle this cleanly feel much more capable than ones that only return text.

A minimal working loop in Python looks roughly like this:

def run_analysis_agent(question: str, schema: str, max_turns: int = 8):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Schema:\n{schema}\n\nQuestion: {question}"}
    ]
    
    for _ in range(max_turns):
        response = client.messages.create(
            model="claude-opus-4-6",
            tools=[CODE_EXECUTION_TOOL],
            messages=messages
        )
        
        if response.stop_reason == "end_turn":
            return extract_answer(response)
        
        tool_result = execute_code(response)  # runs in sandbox
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": [{"type": "tool_result", ...tool_result}]})
    
    return "Max turns reached"

The max_turns guard is not optional. Without it, a confused agent will loop indefinitely.

Where Things Stand

For exploratory analysis on well-structured data with clear questions, coding agents are genuinely useful today. They are good at aggregations, filtering, joins, and basic visualizations. They are weaker on analysis that requires domain knowledge to formulate correctly, multi-step pipelines with many intermediate state dependencies, and anything where the “right” answer requires understanding business context that was not in the prompt.

The practical advice from people actually using these tools converges on a few points: give the model high-quality schema documentation rather than raw metadata, verify results that will influence decisions rather than trusting the first output, and treat the generated code as a first draft worth inspecting. The code agent pattern is not a replacement for data literacy; it is a tool that amplifies it. Someone who understands pandas and SQL will get far more out of these systems than someone who does not, because they can read the code and catch the subtle errors before they matter.

Was this interesting?