· 6 min read ·

Separating Learning from Inference: Inside NVIDIA's DABStep-Winning Agent Architecture

Source: huggingface

The DABStep benchmark is not forgiving. Created by Adyen in collaboration with Hugging Face, it presents 450 tasks drawn from real payment processing data: fee calculations, merchant analysis, multi-table aggregations, and counterfactual scenarios like “what would this merchant have paid if they had changed their MCC code before 2023?” The evaluation method is exact text match. If the answer is 12.34, then 12.3 is wrong. If the required format is a comma-separated list in ascending order by amount, a list in descending order is wrong.

Eighty-four percent of the tasks fall into the hard category, where agents must combine a fee manual document, multiple transaction tables, and knowledge of card scheme rules with names like GlobalCard and NexPay. The benchmark is domain-specific in a way that favors careful reasoning over general-purpose retrieval.

NVIDIA’s recent blog post describes how their KGMON system, built on the NeMo Agent Toolkit, took first place with a hard-task score of 89.95, running at 20 seconds per task using Claude Haiku 4.5. The previous best from AntGroup’s DataPilot was 87.57. The Anthropic-provided baseline (Claude Code with Opus 4.5) scored 66.93 on hard tasks but ran at 10 minutes per task. NVIDIA’s system is roughly 30 times faster, 35% more accurate on the hard tasks that make up the bulk of the benchmark, and generates code that is 63% shorter on average (1,870 characters versus 5,011).

The result is worth understanding in detail because it reflects a specific architectural principle that has broad applicability beyond this particular benchmark.

The Three-Phase Architecture

The KGMON system separates its work into three distinct phases that run at different times, with different models, doing different kinds of work.

Phase 1: Offline Learning. Before any benchmark tasks are solved, a heavyweight model (Claude Opus 4.5/4.6) works through a small set of representative tasks from the development split. It solves them, synthesizes the individual solutions into a master solution, and distills the common patterns into a reusable Python function library stored in helper.py. The agent notices that Task 1 (list fee IDs for a merchant) and Task 2 (compute transaction fee) share identical initial steps: fetch merchant metadata, locate the applicable fee schedule. Rather than solving each task in isolation, it extracts a get_merchant_fees() function that both tasks can call.

This phase runs once, offline, before evaluation starts. It produces two artifacts: helper.py and a set of few-shot examples demonstrating how to compose its functions.

Phase 2: Fast Inference. When the system encounters a new task, it uses Haiku 4.5 with only the function signatures from helper.py in context, not the underlying implementation. The model sees something like get_merchant_fees(merchant_id, date_range) -> pd.DataFrame without knowing what SQL operations or file reads happen inside. It chains these pre-built, tested functions to answer the question, then executes the result through a stateful Python interpreter.

The token budget is aggressively pruned. The model does not need to reason about fee calculation logic from scratch; it only needs to understand which tools to call in what order. This reduces both latency and error surface in one move.

Phase 3: Offline Reflection. After inference, a heavyweight model audits batches of solutions using two techniques: reflection (checking whether generated code correctly uses helper.py conventions, follows formatting guidelines, handles edge cases) and group-consistency checking (looking for cases where the agent gave contradictory approaches to structurally similar questions). The insights from this phase update the system prompt for the next inference cycle rather than interrupting the inference loop. This runs offline and progressively improves accuracy without adding latency to individual task solving.

Why This Works: Connecting to Prior Art

The approach is not without antecedents. VOYAGER (Wang et al., 2023) built essentially the same mechanism for Minecraft agents: a skill library of reusable, executable code functions, indexed by text embeddings, that an agent retrieves and composes to solve new tasks. VOYAGER’s library-equipped agent explored 3.3 times more unique items and progressed 15.3 times faster through the tech tree than prior state-of-the-art agents. The parallel to KGMON’s helper.py is direct.

CodeAct (Wang et al., 2024) made the case more broadly: agents that express actions as executable Python code outperform those using JSON tool calls by roughly 20% on multi-step reasoning benchmarks, while requiring 30% fewer actions. The control flow available in Python, including for-loops, conditionals, and intermediate variables, makes agents substantially more capable at composing multi-step operations. This finding is corroborated by the Smolagents framework, where code agents score 55.15% on GAIA versus 33% for tool-calling agents.

The key distinction in KGMON versus applying CodeAct naively is the separation of learning time from inference time. In a standard CodeAct setup, the model reasons about domain logic on every call. In KGMON, domain logic is encoded once into helper.py and the inference model operates purely on function composition. The heavyweight model’s reasoning work is amortized across all inference calls.

This is structurally similar to how an experienced data scientist works. A junior analyst writes a fresh pandas aggregation every time they need to compute a fee. An experienced one has a library of tested utility functions, knows which one to call, and spends their time composing rather than re-deriving. The insight is that this expertise can be extracted from a capable model during an offline learning phase and made available to a cheaper, faster model at query time.

The Trade-offs

The approach has a clear dependency: the quality of helper.py determines the ceiling of inference performance. If the learning phase produces a library with bugs, missing edge cases, or functions that don’t generalize to the full task distribution, the inference phase will silently fail. NVIDIA addresses this through Phase 3’s reflection loop, but there remains a risk that the library encodes incorrect assumptions about the domain.

The easy-task results are instructive here. NVIDIA’s system scores 87.5 on easy tasks; the Anthropic baseline with Opus 4.5 scores 90.2. Haiku 4.5 operating with function signatures slightly underperforms Opus 4.5 with full context on the simpler end of the distribution, where domain expertise matters less and raw reasoning ability matters more. The trade-off is optimized for hard tasks, which is the right call for a benchmark where hard tasks constitute 84% of the evaluation, but it illustrates that the approach is not universally dominant at every difficulty level.

There is also the question of domain specificity. The learning phase is effective here because DABStep tasks share substantial structure: the same fee tables, the same merchant data schema, the same card scheme rules appear across hundreds of tasks. In a benchmark with higher task diversity, the reusable library would shrink and the gains from pre-built tooling would diminish. The approach is most powerful when the task domain has coherent, recurring patterns at a level of abstraction above individual task statements, which makes financial data analysis, code review over a stable codebase, and customer support over a defined product catalog all natural fits.

The Leaderboard Context

For reference, the current DABStep leaderboard shows NVIDIA KGMON at 89.95 on hard tasks, AntGroup DataPilot at 87.57, Google AI DS-STAR at 45.24, and Anthropic’s Claude Code baseline at 66.93. The DS-STAR result is surprisingly low given Google AI’s resources; the leaderboard is open and results reflect submitted system configurations rather than best-effort tuning from each organization.

The 20-second per-task runtime is practically significant. A system that requires 10 minutes per question cannot be used in interactive workflows. A system that answers in 20 seconds using a model the size of Haiku 4.5 is deployable at reasonable cost. In this case, the accuracy improvement and the cost reduction are both moving in the right direction, which is not always the outcome when optimizing agent architectures.

What This Suggests

The broader principle is that offline work should be front-loaded when the task distribution has exploitable structure. An agent that invests in a learning phase once, building a tested library from representative examples, can run faster and more accurately at inference time than one that reasons from scratch on every query.

This pattern generalizes. Engineering agents that operate over a specific API surface can extract a domain library from representative usage patterns. Support agents handling a defined product catalog can encode lookup and calculation logic into reusable tools. Code review agents operating over a stable codebase can pre-build utilities for the common operations they perform repeatedly.

The NeMo Agent Toolkit provides infrastructure for this pattern: config-driven agent composition, built-in evaluation, stateful Python execution, and retrieval integration. The source article details the specific components. The architectural pattern, separating learning from inference and encoding domain expertise into a tested function library, is what makes the result meaningful beyond a single benchmark leaderboard.

Was this interesting?