Schema, Sandbox, and Loop: The Architecture Behind Coding Agents for Data Analysis

Simon Willison’s article on coding agents for data analysis arrives at a moment when code execution has become standard practice for serious LLM applications. The core observation, that an agent which can run code will systematically outperform one that merely generates it, seems obvious in retrospect. The reasons why are worth unpacking carefully.

The architecture looks simple on paper: an LLM receives a question and some data context, generates Python or SQL code, that code runs in a sandbox, the output feeds back into the model’s context, and the process repeats until the agent produces a final answer. What varies enormously is everything surrounding that loop: how the agent inspects the dataset before writing code, how it recovers from errors, what runtime it uses, and how it decides when it has finished.

Schema Discovery Before Code Generation

A coding agent working on data it has never seen faces the same problem a human analyst does on their first day with a new dataset: you need to understand the shape of the data before you can write meaningful queries. The complication is that an LLM tends to hallucinate column names and data types when it lacks authoritative schema information.

The solution is an explicit inspection step before any analysis code is written. A straightforward Python version:

import pandas as pd

def inspect_dataset(path: str) -> str:
    df = pd.read_csv(path)
    lines = [
        f"Shape: {df.shape}",
        f"\nColumns and dtypes:\n{df.dtypes.to_string()}",
        f"\nSample (5 rows):\n{df.head().to_string()}",
        f"\nNull counts:\n{df.isnull().sum().to_string()}",
    ]
    return "\n".join(lines)

This output goes into the LLM’s context before any analysis code is generated. The result is sharply fewer hallucinated column references and more accurate type handling, particularly for date columns that CSV parsers often load as plain strings.

DuckDB has become a compelling alternative to pandas for this pattern. Because DuckDB can treat CSV files as queryable tables directly, an agent can skip the loading step entirely and write SQL against the file:

DESCRIBE SELECT * FROM read_csv_auto('data.csv');

This returns type inference results immediately, and subsequent queries can run without loading the full dataset into memory. For analysis tasks where questions are naturally set-oriented, SQL often produces more reliable code from LLMs than equivalent pandas chains. SQL’s declarative structure leaves less room for off-by-one errors in filtering and grouping, and the column-reference errors that plague pandas method chains simply cannot happen when the query engine enforces schema at parse time.

Polars is worth mentioning here too. Its lazy evaluation API, where operations are planned before execution, maps well to the agent pattern because the agent can construct a multi-step query plan and Polars will optimize it before running anything. The error messages are also more structured than pandas, which helps when feeding tracebacks back to the LLM.

Error Recovery and the Value of the Loop

The execute-observe loop becomes most valuable when code fails. Pure code generation, where the LLM produces a script and hands it to a human to run, has no recovery path for runtime errors. An agent running in a loop can catch a traceback, add it to context, and regenerate the failing section.

This matters more for data analysis than for general programming because data is reliably surprising. A CSV column named amount might contain dollar signs in some rows, mixed locale decimal separators, or stray newlines inside quoted fields. An agent that sees the ValueError from its first parse attempt and adjusts has a fundamentally better outcome than one that produces code that fails silently downstream.

The loop also enables progressive refinement without requiring the agent to anticipate everything upfront. An initial aggregation might reveal outliers that warrant a second pass. A summary statistic that looks wrong might prompt the agent to check for duplicate rows. These mid-analysis pivots reflect how experienced analysts work, and they are only possible when the agent executes code rather than just writes it.

This is also where the choice of data library matters for agent ergonomics specifically. A library with clear, structured errors gives the LLM better signal to recover from. DuckDB’s error messages, for example, include the column name and the offending value in most type mismatch errors. Compare that to a numpy ValueError: could not convert string to float with no indication of which column or row caused the failure.

Sandboxing and the Cost of Safety

Running arbitrary LLM-generated code requires thinking carefully about the execution environment. The sandbox needs to be restrictive enough to prevent accidental or injected damage to the host system, but permissive enough to handle the filesystem access and library imports that real analysis requires.

E2B provides hosted sandboxes specifically designed for this use case, with an isolated environment that includes Python and common data libraries pre-installed. The tradeoff is latency and per-execution cost. For short analysis sessions on data that fits in the sandbox, this works well. Their SDK exposes a simple interface:

from e2b_code_interpreter import Sandbox

with Sandbox() as sbx:
    result = sbx.run_code(generated_code)
    output = result.logs.stdout
    errors = result.logs.stderr

For self-hosted setups, Docker containers with networking disabled and filesystem mounts limited to the data directory provide a reasonable baseline. Pyodide, which compiles CPython to WebAssembly, is worth considering for browser-based interfaces. It eliminates the server-side execution concern by running in the user’s browser, though its library ecosystem remains narrower than a full CPython environment. Numpy, pandas, and matplotlib are available, but anything that wraps native extensions usually is not.

One sandboxing detail that gets missed: an LLM might generate code that legitimately takes a long time on large datasets. A hard timeout that kills the process mid-aggregation produces a truncated error that looks to the LLM like a code bug rather than a resource limit. Passing timeout errors back explicitly, along with a suggestion to sample the data or use chunked processing, produces much better recovery behavior than a generic TimeoutError.

What the LLM Contributes to the Loop

The Python runtime does the analysis. The LLM provides translation from a natural language question into executable code, plus interpretation of the output back into a human-readable answer.

The quality of that translation depends heavily on context. An LLM given a vague question about an unfamiliar dataset will produce vague or incorrect code. The same model given the schema, a few sample rows, and a specific question will often produce correct first-pass code. The engineering work in building a good data analysis agent is largely about building that context up before code generation, and about deciding when to ask the user for clarification rather than proceeding on a shaky assumption.

Willison’s work, grounded in years of building tools like Datasette and sqlite-utils, reflects this practical orientation. His tools have always emphasized making data accessible through well-defined, auditable interfaces. The coding agent pattern extends that principle: the LLM becomes a user of the same data access tools, subject to the same constraints.

Transparency as a Design Requirement

The agents that work well in practice tend to be the ones that surface their work: the code they generated, the output they received, and the errors they recovered from. That transparency serves more than debugging purposes. It is what makes it reasonable for a user to trust the final answer.

A coding agent that returns a number without showing its derivation is no more trustworthy than a spreadsheet cell with no visible formula. When the agent surfaces intermediate steps, including failed attempts, the user can verify the logic and catch cases where the agent confidently produced the wrong result. This is a design choice with real engineering implications. The output format needs to carry not just the answer but an audit trail.

Tools like Marimo, which treats notebooks as reactive Python programs where every cell participates in a dependency graph, point toward one direction for this. An agent producing Marimo-compatible output would let users inspect and re-run the generated analysis interactively rather than treating it as opaque output. That kind of verifiability is what distinguishes a useful data analysis tool from a plausible-sounding one, and it is what separates coding agents that earn user trust from those that eventually erode it.