Amortized Reasoning: What NVIDIA's DABStep Win Reveals About When to Spend Compute
Source: huggingface
The dominant instinct when an LLM agent struggles with hard tasks is to reach for a bigger model. NVIDIA’s team took the opposite bet on the DABStep leaderboard, and the results are worth examining closely. Their entry, built on the NeMo Agent Toolkit, used Claude Haiku 4.5 at inference time and still outperformed Claude Code with Opus 4.5 on hard tasks, 89.95% versus 66.93%, while running 30 times faster.
That combination, a lighter model, better accuracy on harder problems, and a fraction of the latency, is not what you would expect from the “just scale it” playbook. Understanding why it works exposes something genuinely useful about agent architecture.
What DABStep Actually Tests
Before getting into the approach, it helps to understand what makes DABStep hard. The benchmark was created by Adyen and contains 450 tasks in the financial payments domain. About 84% of them are classified as “hard.” These are not logic puzzles or trick questions. They model the kind of multi-step data analysis a payment operations analyst would actually run: given 1.4 million rows of transaction data, fee tables, merchant records, and card scheme rules, answer questions like “what would the fee delta be for merchant X if they changed their MCC code to 5411?” or “which card scheme should this merchant steer traffic to in order to minimize fees over the next quarter?”
The tasks require joining across multiple tables, applying temporal filters, reasoning about counterfactuals, and producing numerically exact answers (often to two decimal places). Exact match scoring means a rounding error is a miss. This is the kind of benchmark where writing good pandas code matters as much as understanding what question was asked.
The Reusable Library as an Artifact
NVIDIA’s system, which they call the KGMON approach, is organized around a three-phase loop.
In the first phase, a heavyweight model (Opus 4.5 or 4.6) works through a representative batch of tasks from the training set. It has access to a full tool suite: a stateful Python interpreter, a file structure detector, a semantic retriever, and bash utilities. It solves each task, validates the result against known answers, and then, instead of discarding those scripts, it synthesizes them into a shared helper library.
This is the key architectural decision. Rather than treating each task as independent, the learning phase explicitly looks for structural overlap. The insight is that computing merchant fees for a specific month and listing applicable fee IDs for a merchant share foundational operations: loading the right tables, normalizing date columns, filtering by merchant identifier. If you solve both tasks with separate scripts, you write that logic twice. If you recognize the pattern, you extract it into a function.
The output of the learning phase is helper.py, a domain-specific library distilled from the training data. It also produces few-shot examples that demonstrate how to use the library.
The second phase is inference: a lightweight model (Haiku 4.5) receives only the function signatures from helper.py, not the implementation bodies, along with the few-shot examples and a streamlined system prompt. Its job is to orchestrate calls to existing functions rather than reinvent the underlying logic from scratch on every query. The context window stays small, latency drops to about 20 seconds per task, and accuracy benefits from the fact that the hard parts of each computation are already validated and abstracted.
The third phase runs offline, asynchronously. A heavyweight reflection model audits the inference agent’s code and reasoning, identifies patterns across similar questions (a technique they call group-consistency checking), and injects corrective insights into the system prompt for subsequent inference runs. This closes the loop without adding latency to individual task completion.
Why the Hard Tasks Benefit Most
The performance gap is widest on hard tasks, and this makes sense once you consider what kind of work the helper library actually captures.
Easy tasks, basic aggregations and simple filters, are within the reach of any competent code-generating model. There is not much to reuse because each task is nearly self-contained. Hard tasks are different. They require chains of domain-specific operations: mapping MCC codes to fee categories, applying temporal validity windows to fee rules, handling the edge cases in Adyen’s particular data schema. A model solving these from scratch on every query has to rediscover the same details repeatedly. A model with access to pre-validated functions that already handle these edge cases can focus on the higher-level structure of the question.
This is not a novel insight in software engineering. It is essentially the DRY principle (Don’t Repeat Yourself) applied to agent reasoning. What is novel is building the machinery to automate that abstraction process, validate the abstractions against ground truth, and compress them into a form that a smaller model can use effectively.
Contrast with Test-Time Compute Scaling
The dominant alternative approach to hard reasoning tasks right now is test-time compute scaling, spending more tokens at inference to reason through difficult problems step by step. OpenAI’s o1 and o3 series, Google’s Gemini Thinking variants, and Anthropic’s extended thinking mode all operate on this principle: more inference compute yields better answers.
That approach works well for problems where the reasoning structure itself is the hard part, mathematics, formal logic, planning. It works less well when the bottleneck is domain-specific procedural knowledge that must be reconstructed on every invocation. DABStep tasks fall into the second category. The question “what fees did merchant X incur in December 2023?” does not become answerable through longer chain-of-thought if the model has never been shown how Adyen’s fee table schema works and what temporal filtering logic is required.
The KGMON approach amortizes the cost of learning that domain knowledge. It pays once during the learning phase and then redeploys the result cheaply. The trade-off is that this approach is dataset-specific: you need labeled training examples to build the library, and the library is not transferable to a different domain without repeating the learning phase. Test-time compute scaling is more general but more expensive per query.
The Architecture of the EDA Agent
The DABStep agent is one of two agents in the NeMo Agent Toolkit described in the article. The other handles open-ended exploratory data analysis, and its architecture is worth noting separately because it solves a different problem.
For EDA, users ask questions like “show me the distribution of transaction values by merchant category,” which produce charts rather than scalar answers. A text-only model cannot evaluate or describe a matplotlib figure. The EDA agent addresses this with a Vision-Language Model in the loop: the ReAct agent generates and executes notebook code, the resulting plots are fed to the VLM, and the VLM produces a text description that the agent can then reason about and pass back to the user.
This is a practical example of multimodal tool chaining, not just generating images but using a second model to reintegrate visual output into a text reasoning loop. The notebook as execution environment also provides a natural artifact: the user can open the notebook, see the full execution history, and reproduce or extend the analysis.
What Generalizes
The specific results are benchmark-specific, but the structural ideas transfer. If you have a well-defined domain, labeled examples, and repeated queries that share underlying patterns, the learning-then-inference split is worth considering. The investment is in the learning phase tooling: you need a model capable of synthesizing and generalizing across examples, a validation mechanism to catch mistakes in the generated library, and an orchestration layer that knows when to trigger reflection and library updates.
The group-consistency checking is particularly underappreciated. Evaluating a cluster of semantically similar questions together, rather than independently, exposes logical contradictions that single-question evaluation misses. If your agent answers “yes” to “does merchant X have any transactions in December?” but returns an empty list for “list all transactions for merchant X in December,” you have an inconsistency that only surfaces when you look at both answers simultaneously.
NVIDIA’s full system is available through NVIDIA Launchable, and the DABStep dataset and leaderboard are hosted on Hugging Face for anyone who wants to benchmark their own approach.
The result is a reminder that benchmark performance is as much an architecture question as a model selection question. Choosing where in the pipeline to do expensive computation, and building the infrastructure to move that computation earlier, can matter more than which model sits at the end of the chain.