Why the Best Data Analysis Agents Show Their Work

Simon Willison’s recent post on coding agents for data analysis arrives at a point where the pattern has moved past novelty. The tools exist, people use them for real work, and the gap between what these systems promise and what they deliver in practice has become clear enough to discuss honestly.

The core loop is well-established: describe a dataset, ask a question, and a language model generates Python or SQL, executes it in a sandboxed environment, reads the output, and returns an answer. For simple aggregations, comparisons, and visualizations against well-documented data, this produces useful results quickly. The question worth examining concerns when the pattern works and for whom.

What the Demos Don’t Show

The canonical demo for a data analysis agent involves uploading a clean CSV, asking a question with an obvious answer, and showing the chart that comes back. The demo works because the person running it already knows the correct answer. They chose the dataset carefully, phrased the question clearly, and verified the result before recording.

What the demo does not show is what happens when the schema description omits important context, when the column name the user mentions does not precisely match the column name in the data, or when the correct answer depends on a business logic distinction that was never written down.

Consider a model asked for “monthly active users.” It might generate:

monthly_active = df[df['timestamp'] >= month_start]['user_id'].nunique()

This code runs without error. If the DataFrame contains both session events and account creation events, and the question intended to count only users with actual sessions, the number is inflated in a way that looks plausible. There is no error, no traceback, and no signal that anything went wrong; the output is simply a number that seems reasonable. This failure mode is harder to catch than a KeyError, and it is the one that matters most in practice.

The Tools That Actually Hold Up

The tools that produce reliable analytical results are consistently the ones that make human review easy, not the ones that minimize human involvement.

Jupyter AI is the clearest example. The model generates notebook cells; you see the code before it runs. You can read the pandas expression, verify the join condition, check that the groupby aggregates over the right columns, and run it yourself. The loop is: model proposes, human reviews, execution follows. The agent is fast at code generation; the human is essential for catching semantic errors before they produce answers that look right but are not.

Simon Willison’s llm CLI and datasette take the same approach at smaller scale. When you query a database with natural language through these tools, the generated SQL is visible. You can inspect it, copy it, modify it, and run it directly. The answer arrives with its derivation intact.

This transparency is a core feature of these tools, and it is what makes them useful for serious work. When the generated code is visible and editable, a correct query becomes a reusable artifact. You can save it, version it, and build on it the next time a related question comes up. When only the answer is returned, you have to re-derive the analysis from scratch each time, with no guarantee of consistency.

What This Means at the Architecture Level

The DABStep benchmark, which evaluates agents on multi-step analytical reasoning over messy tabular data, illustrates what happens when the “show your work” principle is applied at the agent architecture level rather than just the UI level.

NVIDIA’s NeMo Agent Toolkit Data Explorer took first place on DABStep using an approach that generates named, typed Python functions with docstrings and registers them in a persistent tool library. Rather than producing throwaway code for each analytical step, the agent builds a growing library of verified functions it can retrieve and reuse. A function that correctly extracts quarterly revenue figures, once written and verified, becomes available for every subsequent query that needs quarterly revenue. The ideas trace back to the LATM paper from 2023, which demonstrated that language models can create and accumulate tools for themselves, and to the Voyager Minecraft agent, which built a persistent skill library through exploration.

The problem this solves is error accumulation. In a standard execute-observe-iterate agent, each step derives its logic fresh. If a date-parsing approach at step 2 disagrees subtly with the date-parsing approach at step 5, a join between the intermediate results silently drops rows. A verified, reusable function for date parsing eliminates the inconsistency by definition. Code that correctly handles a class of computation should not be re-derived every time that computation is needed.

The BI Tool Comparison

Traditional BI tools such as Tableau, Looker, and Metabase encode business logic into metric definitions. A data engineer specifies what “monthly active users” means, with all the deduplication logic, event-type filtering, and date-range conventions, and that definition is reused across every dashboard that references the metric. Analysts querying those dashboards do not need to reason about schema details; the correct logic has been encoded upfront.

This is valuable. It is also expensive to build and fragile when business logic evolves faster than the data model does. Coding agents for data analysis sit at the opposite end of this spectrum: they generate bespoke code for each question, without predefined metrics, and can therefore handle questions the BI team never anticipated. The cost is that every question re-derives the logic from scratch, and the correctness of that derivation depends on how well the agent understood the question and the schema.

The NeMo tool registry approach is a way to gradually accumulate verified definitions through agent use rather than upfront data modeling. Each verified function is a de facto metric definition. Over time, the tool library converges toward something like a lightweight, agent-generated semantic layer, built empirically rather than designed upfront.

Practical Implications for Developers

If you are building a system on top of a coding agent for data analysis, a few choices follow from this framing.

Surface the code alongside the result. Users who can read SQL or Python can catch semantic errors before they influence decisions. Hiding the generated code removes the primary mechanism by which wrong-but-plausible answers get caught.

Invest in schema documentation before tuning the model. The quality ceiling for a coding agent on data tasks is largely set by what the model receives before it writes any code. A raw DESCRIBE TABLE dump gives the model column names and types. An annotated schema that explains that user_id in the events table corresponds to id in the accounts table, that revenue_net excludes refunds while revenue_gross does not, and that timestamp is stored in UTC produces measurably better generated code. This documentation costs almost nothing to write.

Store verified queries. When a generated query runs correctly and the result checks out against known data, save it. The next related question can reuse or adapt it rather than starting from scratch. This is the same principle behind the NeMo tool registry, applied at whatever scale your use case requires.

Treat data literacy as a prerequisite, not an obstacle. These tools amplify the capabilities of people who can already read the generated code. Someone who understands a pandas expression can spot a wrong join; someone who cannot will accept whatever number comes back. Designing a system where the generated code is hidden does not make it safer; it removes the mechanism by which errors get caught.

Where Willison’s Approach Fits

Willison’s consistent philosophy across his tools, from sqlite-utils to the llm CLI to datasette, is that the artifact you produce when querying data should be visible, inspectable, and portable. The generated SQL is the primary output, not merely a mechanism for producing a number.

This is more than a UI preference; it is a claim about what makes data analysis trustworthy. Coding agents slot naturally into that model as fast first-draft generators. The human analyst is essential for reviewing, catching semantic errors, and deciding whether the result makes sense given domain knowledge that never made it into the prompt. That combination is more reliable than either alone, and faster than either alone.

The tools that make this easy to do are more valuable than the tools that try to eliminate the human from the loop. The pattern that consistently works is augmented analysis, not autonomous analysis, and the difference shows up most clearly not in demos but in the analyses that actually get used to make decisions.