Agents That Build Tools: What a DABStep Win Reveals About Data Analysis Architecture

The DABStep leaderboard has become a meaningful test for data analysis agents. Unlike earlier coding benchmarks that measured whether a model could write a single correct pandas expression, DABStep requires agents to reason through multi-step analytical workflows over realistic, messy tabular data and produce exact answers. Getting there involves loading data, discovering schemas, joining tables, handling edge cases, and not compounding errors across each intermediate step. Most agent frameworks fall apart somewhere in that chain.

NVIDIA’s NeMo Agent Toolkit Data Explorer hit first place on DABStep, and the writeup on Hugging Face explains the core idea: rather than generating ephemeral code for each sub-task, the agent generates named, reusable Python functions and registers them in a persistent tool registry. This is the piece worth understanding in depth, because it represents a design pattern with broader implications for how we build analytical agents.

What DABStep Actually Tests

DABStep evaluates agents on business-style datasets with questions that require 2 to 10 analytical steps to answer correctly. The evaluation is strict: answers must match exactly, and intermediate reasoning is scored at the step level rather than just at the final output. This penalizes agents that stumble onto a correct answer through incorrect reasoning, and it surfaces the real failure mode of most LLM agents on data tasks: error accumulation.

When an agent misidentifies a column name at step 2, everything downstream is wrong. When it re-derives a date-parsing strategy at step 5 differently from how it parsed dates at step 2, the join breaks. These failures are not primarily model intelligence problems; they stem from architectural choices. The agent has no memory of what it already worked out, so it re-derives it, and that re-derivation carries independent probability of subtle variation.

Strong models like GPT-4o with ReAct-style agents achieve around 30-45% on DABStep’s harder splits. The gap between “powerful model” and “correct analytical agent” is where the interesting engineering happens.

The Reusable Tool Generation Pattern

The core idea in the NeMo Data Explorer is not entirely new to this submission, but applying it seriously to tabular data analysis is the contribution. Instead of writing a code block, executing it, and discarding the implementation, the agent writes a named Python function with typed parameters and a docstring, registers that function in a tool registry, and calls it by name for the rest of the session.

The structure of a generated tool looks something like this:

def filter_transactions_by_date(
    df: pd.DataFrame,
    start_date: str,
    end_date: str
) -> pd.DataFrame:
    """
    Filter a transactions DataFrame to rows within [start_date, end_date].
    Expects a 'transaction_date' column parseable by pd.to_datetime.
    """
    mask = (
        pd.to_datetime(df["transaction_date"]) >= pd.to_datetime(start_date)
    ) & (
        pd.to_datetime(df["transaction_date"]) <= pd.to_datetime(end_date)
    )
    return df[mask]

Once registered, the planner LLM can call filter_transactions_by_date(df, "2024-01-01", "2024-03-31") directly on subsequent steps without regenerating the date-parsing logic. The tool description is indexed semantically, so if the next question involves a different date column on a different table, the planner can recognize that the same logic applies with parameter substitution rather than generating fresh code.

This sounds simple, but it changes the error dynamics considerably. A validated function called repeatedly is more reliable than the same logic re-generated three times from scratch, where each generation has independent probability of subtle variation.

The Lineage of This Idea

This pattern has been building for a while. The clearest prior art is LATM (Large Language Models as Tool Makers) from 2023, which separated the roles of “tool maker” (a capable LLM that writes functions) and “tool user” (a cheaper LLM that calls them). LATM demonstrated that tools created once and cached could allow a weaker model to solve problems that would otherwise require repeated expensive generation from a stronger model. The key insight was that tool creation is expensive but amortizable.

Voyager, the Minecraft agent from Wang et al. (2023), applied a similar idea to game-playing: the agent generates JavaScript “skill” functions and stores them in a skill library that grows over the agent’s lifetime. Later tasks compose existing skills rather than re-deriving everything from scratch. Voyager demonstrated something important: the agent’s capability grows monotonically with tool accumulation, because each new skill is built on validated prior skills.

The NeMo Data Explorer combines both ideas within a single session, tuned specifically for pandas-style tabular operations. The planner and the tool generator are the same LLM, Llama 3.1 70B or 405B via NVIDIA NIM endpoints, which removes the multi-model coordination overhead that LATM required.

Why Data Analysis Is a Good Fit

Not every agentic task benefits from tool reuse. For a web search agent, most queries are unique enough that generated tools would rarely be reused within a session. Data analysis tasks have high logical repetition by nature.

A realistic multi-step analysis over a sales dataset might require date normalization five times, group-by aggregation four times, and currency conversion twice. Without a tool registry, the agent generates each of those operations independently. With a registry, the first generation of normalize_date_column handles all five cases, and each subsequent call is a function invocation with different arguments.

There is also a schema-memory benefit. Loading a dataset and discovering its columns is expensive in tokens and carries potential for misidentification. The NeMo Data Explorer’s persistent execution environment means that df = pd.read_csv("sales_data.csv") happens once, and the resulting DataFrame with its known schema is available for all subsequent tool calls. This is a form of working memory that ReAct-style agents lack: each new code block in a ReAct loop starts from scratch unless the agent explicitly re-loads state.

Comparison to ReAct and Code-as-Action

ReAct (Yao et al., 2022) is the baseline for most agent evaluations. The loop is straightforward: the model produces a thought, then an action (usually a tool call or code execution), then receives an observation, and repeats. ReAct is simple to implement and works well for short-horizon tasks, but it has no mechanism for accumulating reusable knowledge within or across sessions.

Code-as-action approaches, where the agent writes a full Python script per step and runs it, improve on ReAct for complex analytical operations because a full script can express multi-step logic in one generation. But the script is still ephemeral: the next question begins fresh, and any logic that was useful in the prior script must be re-derived.

The NeMo approach sits between a code-as-action framework and a proper tool-use framework. The agent writes code, but code structured as reusable functions rather than procedural scripts. The execution environment is stateful, so loaded DataFrames persist. And the tool registry is queryable by the planner, so the agent can build on its prior work.

Approach	Code persistence	Tool reuse	Long-horizon efficiency
ReAct	No	No	Low
Code-as-action	No	No	Medium
LATM	Yes (cross-problem)	Yes	High
NeMo Data Explorer	Yes (in-session)	Yes	High

The practical difference shows up most clearly on DABStep’s multi-table questions, where a naive agent must repeatedly re-derive join conditions and schema mappings, while a tool-registry agent derives them once and calls the validated function repeatedly.

What This Suggests Going Forward

The DABStep result is a data point in a broader argument: the bottleneck for analytical agents is often memory architecture rather than model capability. A Llama 3.1 70B model with a good tool registry outperforming GPT-4o with a naive ReAct loop suggests that the architectural choice matters more than raw model power at this particular task category.

The tool registry also opens up something LATM gestured toward: cross-session learning. If tools generated during one analysis session are persisted and indexed, a subsequent session over a similar dataset can start with a pre-populated registry rather than building from scratch. This is closer to how a human data scientist works, accumulating a personal library of reusable functions over time.

The NeMo Agent Toolkit is open source, and NVIDIA’s NIM endpoints make it straightforward to swap in different backbone models. The architecture is not model-specific; it works with any LLM capable of generating well-structured Python functions with appropriate signatures.

For anyone building data analysis agents, the structural takeaway is concrete: stop generating ephemeral code blocks and start generating named, typed, documented functions. Register them. Index them. Let the planner retrieve and reuse them. The overhead of structuring generated code as a function rather than a script is minimal, and the benefit to multi-step reasoning is substantial.

DABStep is hard precisely because it rewards this kind of architectural discipline. The agents that do well on it are the ones that remember what they learned.