From 15% to 90%: The Architecture Behind NVIDIA's DABStep Victory

The DABStep leaderboard tells a story that most benchmark announcements obscure: the gap between “LLM can write code” and “LLM can reliably analyze real data” is enormous, and closing it requires architectural choices that go well beyond model selection.

DABStep (Data Agent Benchmark for Multi-step Reasoning) was released by Adyen, the Dutch payments processor, in collaboration with Hugging Face. Its 450 tasks are derived from actual operational workloads: computing merchant fees, analyzing transaction fraud rates, cross-referencing domain manuals with structured CSV data. Eighty-four percent of the tasks are classified “hard,” meaning they require 6+ reasoning steps, 3+ data sources, and meaningful domain knowledge. When the benchmark launched, the best model achieved 14.55% accuracy on the hard set. Claude 3.7 Sonnet, o3-mini, and Gemini 2.5 Pro all clustered around 12-14%. GPT-4o managed 6%.

Those numbers reflect what happens when you take a capable language model, wrap it in a ReAct loop, and point it at real analytical work. The model can write pandas, load CSVs, and compute aggregates. But the hard tasks require recognizing which operations to compose in which order, consulting the right reference tables, and handling intermediate results across multiple files, without losing track of state or misinterpreting domain-specific terminology like Merchant Category Codes or Authorization Characteristics Indicators. Standard agent scaffolding does not reliably handle this.

By March 2026, the leaderboard looked very different. NVIDIA’s NeMo Data Explorer submission using Claude Haiku 4.5 reached 89.95% on the hard set, roughly a 6x improvement over the previous best. DataPilot from Ant Group came in at 87.57%. The interesting part is not the numbers themselves but how these systems achieved them, because neither approach relied on throwing a bigger, more expensive model at the problem.

The Three-Phase Architecture

The NeMo Data Explorer approach is built around a simple observation: the hard tasks in DABStep are parameterized variations of 23 core question templates. Different tasks ask about different merchants, different months, different fee IDs, but the underlying analytical operations are the same. A system that correctly encodes the solution to each core question type can answer all parameterized variants reliably.

This observation leads to a three-phase pipeline.

Phase 1: Offline learning with a heavyweight model. The system takes a representative sample of tasks alongside their ground truth answers and runs them through Claude Opus. The agent solves each task individually, then synthesizes the individual solutions into a master script, which gets refactored into a helper.py library. The library contains optimized, reusable, domain-specific functions: load_payments(), compute_fees_for_merchant(merchant, month), lookup_fee_by_id(fee_id), and so on. Alongside the library, the agent generates few-shot examples demonstrating how to call helper functions for different question types.

This is the DRY principle applied to agent reasoning. The heavyweight model’s job is to discover generalized solutions; it iterates through versions of helper.py, testing against multiple tasks, until the library correctly handles each core question type. The process looks roughly like this:

# What the heavyweight model eventually produces
def compute_fees_for_merchant(merchant_id: str, month: str, df: pd.DataFrame) -> float:
    """Filters payments by merchant and month, joins with fee table, returns total fees."""
    filtered = df[(df['merchant_id'] == merchant_id) & (df['month'] == month)]
    fee_ids = filtered['fee_id'].unique()
    total = sum(FEE_TABLE.get(fid, 0) for fid in fee_ids)
    return total

# What the lightweight model writes at inference time
from helper import compute_fees_for_merchant, load_payments

df = load_payments()
result = compute_fees_for_merchant('MERCHANT_42', '2024-03', df)
print(result)

The bulky per-task reasoning is absorbed into the library. The inference-time code becomes short orchestration over pre-built functions.

Phase 2: Fast inference with a lightweight model. For actual benchmark inference, the system switches to Claude Haiku 4.5. The lightweight model receives only the function signatures from helper.py (not the implementations), a streamlined system prompt, and the few-shot examples. It calls the pre-built functions to answer new tasks in a single pass.

The performance difference is stark: 20 seconds per task instead of the 10 minutes Claude Code Opus required, at significantly lower cost. And Haiku 4.5 with helper.py beat Opus 4.5 running directly: 89.95% versus 66.93% on hard tasks. A smaller, cheaper model with a better tool library outperformed a frontier model working from scratch.

Phase 3: Offline reflection and consistency checking. After initial inference runs, a heavyweight model audits the generated solutions. It reviews reasoning traces, checks that helper functions are being called correctly, and examines groups of semantically similar questions for conflicting logic. When it detects inconsistencies, it identifies the correct approach and injects that insight back into the system prompt for subsequent inference runs.

Reflection happens offline, not during inference. This preserves the speed advantage while still incorporating the error-detection benefits of self-critique.

How This Compares to Other Approaches

DS-STAR, from researchers at Google Cloud and KAIST, took a different path to similar territory. Its multi-agent framework uses seven specialized agents: Analyzer, Planner, Coder, Verifier, Router, Debugger, and Retriever. The system runs a data file analysis module before planning, extracting metadata and structure from each data source. A judge agent evaluates plan sufficiency after each planning step, and a Router agent decides whether to add new steps or backtrack and fix an erroneous one.

DS-STAR reached 45.24% on DABStep hard tasks using Gemini 2.5 Pro, which was the best published result before the NVIDIA submission. It costs roughly $0.23 per task and makes approximately 12.7 LLM calls per task. The multi-agent coordination overhead is substantial, and it accrues on every single task.

AutoMind from Ant Group (the lab behind DataPilot, which scored second on the live leaderboard) uses an evolutionary tree search with a knowledge base of 3,000+ Kaggle competition solutions. It models the solution space as a tree and alternates between drafting new solutions, improving valid ones, and debugging failing ones. This is evaluated primarily on ML competition benchmarks rather than DABStep, but the same lab’s DataPilot submission reached 87.57% on hard tasks.

All of these approaches are interesting, but the NVIDIA approach has a distinctive property: the expensive reasoning happens once, offline, during helper.py construction. Every subsequent inference call is cheap and fast. DS-STAR incurs its full multi-agent overhead on every task. The NeMo Data Explorer amortizes the hard thinking across a batch of tasks, then reuses the result indefinitely.

What This Architecture Actually Means

The conventional framing for improving LLM agent performance is: better model, better prompts, more compute at inference time. The NeMo Data Explorer inverts part of this. The expensive compute is front-loaded into an offline distillation phase. The output of that phase is not a fine-tuned model or a long system prompt, but executable Python code that encodes domain expertise in the most durable, testable format available.

This matters for practical deployments. A helper.py library is inspectable, version-controlled, and unit-testable. If compute_fees_for_merchant is wrong, you can identify and fix it directly. The same is not true of reasoning embedded in model weights or buried in chain-of-thought traces. When the heavyweight model distills its analytical process into a function library, it produces an artifact that software engineers can review, test, and extend through normal development workflows.

The pattern also suggests a path for domain-specific production systems. An organization with a defined set of analytical workloads, whether payments processing, inventory analysis, or clinical data review, can invest heavyweight model compute once to produce a specialized helper library, then serve that library to lightweight models for ongoing use. The helper library becomes the accumulated analytical expertise of the system, expressed as code rather than as model parameters.

This is not entirely new as a concept. The idea of agents generating tools for future use was explored in Voyager, a Minecraft agent that builds a library of skills expressed as JavaScript functions, and in various tool-creation papers from 2023-2024. What the NeMo Data Explorer demonstrates is that this pattern transfers cleanly to structured data analysis, and that the performance gains are large enough to matter on a competitive benchmark with objective scoring.

The Benchmark Design Deserves Credit

DABStep’s design is what makes these results meaningful. The parameterized task structure, where each hard task is a variation of a core question template, is what allows the helper library approach to generalize so effectively. Adyen and Hugging Face designed the benchmark to require iterative multi-step reasoning, use a hidden test set to prevent overfitting, and evaluate with objective hybrid scoring that handles formatting variations without LLM-as-judge. These choices make the leaderboard trustworthy.

What the 90% score does not tell you is how well the helper library generalizes to analytical questions outside the 23 core question types. DABStep’s parameterized structure is both its strength (clean, reproducible evaluation) and its limitation (the benchmark rewards systems that can identify and generalize the core templates, which may not reflect truly open-ended analytical work). A dataset of genuinely heterogeneous tasks would test a different and perhaps more practically relevant capability.

Still, going from 14.55% to 89.95% on hard tasks in a matter of months, using a model smaller than what was tested at benchmark launch, is a meaningful signal. The constraint was not model capability. It was architecture, specifically the choice to encode domain expertise as reusable code rather than expecting a language model to reconstruct it from scratch on every query.