· 6 min read ·

Retrieval as a Reasoning Problem: What NVIDIA's Agentic Pipeline Gets Right

Source: huggingface

Vector similarity search has a reliability problem that benchmarks tend to obscure. On clean, well-formed queries against a coherent corpus, a dense embedding model does well. Ask it something multi-hop, ambiguous, or domain-shifted, and the cosine distance between query and document stops being a useful signal. You retrieve the semantically proximate document, not the factually relevant one.

This is not a novel observation. The BEIR benchmark, published in 2021, explicitly designed its 18 datasets to test zero-shot generalization across domains precisely because practitioners had noticed that a model trained on MS MARCO could collapse on biomedical or legal retrieval tasks. BRIGHT, a more recent benchmark from 2024, pushes further: its queries require genuine multi-step reasoning to identify relevance, and a dense retriever’s single-shot embedding is often simply the wrong tool for the problem.

NVIDIA’s NeMo Retriever team recently published results showing a generalizable agentic retrieval pipeline that reached #1 on the ViDoRe v3 visual document retrieval leaderboard (NDCG@10: 69.22) and #2 on the BRIGHT reasoning-intensive retrieval leaderboard (NDCG@10: 50.90). The dense baseline on ViDoRe is 64.36; on BRIGHT it is 38.28. Those are not incremental gains.

What makes the result interesting is the architecture, the engineering decisions behind it, and the honest cost profile that comes with it.

The ReACT Loop Applied to Retrieval

The pipeline is built on a ReACT (Reasoning + Acting) agent pattern, the same structure used in tool-calling agents for code execution or web browsing. The agent has access to three operations: think, which plans the approach; retrieve(query, top_k), which queries the embedding index; and final_results, which emits the ranked document set. The loop runs until the agent is satisfied with the retrieved context or hits a step limit, at which point a Reciprocal Rank Fusion (RRF) merge is used as a fallback across all intermediate result sets.

What emerges from this structure is not just a single query rewrite. The ablation data shows the Opus 4.5 agent averaging 9.2 retrieval calls per query on ViDoRe and 11.8 on BRIGHT. The agent is doing something closer to corpus exploration: starting with a broad query, observing what comes back, adjusting terminology, decomposing multi-part questions into sub-queries, and persisting through rephrasing until the retrieved documents actually address the question.

This generalizes an idea that has been explored in academic RAG work for the past few years. FLARE (Forward-Looking Active Retrieval) from 2023 proposed retrieving only when the language model detected its own uncertainty during generation. Self-RAG introduced learnable retrieval tokens. Corrective RAG added a grading step that re-triggers retrieval when retrieved documents score below a threshold. The NeMo pipeline is a practical instantiation of this class of systems, using a frontier LLM as the orchestrator rather than a fine-tuned retrieval-specialist model.

The generalizability finding is worth dwelling on. INF-X-Retriever, the current #1 on BRIGHT with an NDCG@10 of 63.40, scores only 62.31 on ViDoRe, which is below the dense retrieval baseline of 64.36. Systems specialized for one task profile often regress on another. The NeMo agentic pipeline places in the top two on both benchmarks, with the same architecture. The ReACT loop does not need task-specific tuning because the reasoning step adapts the retrieval strategy to whatever the query demands.

The Embedding Model Still Matters, but Less

One of the ablation results cuts against a common intuition. When you replace the stronger nemotron-colembed-vl-8b-v2 embedding model with the lighter llama-nemotron-embed-vl-1b-v2, the agentic pipeline on ViDoRe drops from 66.38 to 62.42 using the gpt-oss-120b agent. The dense baseline with the stronger embedder is 64.36. So the weaker embedder with an agent is still roughly competitive with the stronger embedder used alone.

This is the key leverage point of the agentic approach. The agent compensates for embedding model weakness by issuing more queries and synthesizing across more result sets. The performance delta between strong and weak embeddings roughly halves when an agent is orchestrating the retrieval. For production deployments where you cannot afford to run an 8B-parameter embedding model, this is a meaningful result: a smaller, commercially viable embedding model paired with an agentic loop can recover much of the quality gap.

The Engineering Behind the Loop

The paper describes a detail that most ML publications skip: the infrastructure required to make the evaluation loop practical. The initial implementation used an MCP (Model-Controller-Pipeline) server architecture, with the retrieval system running as a separate process and the agent communicating over network serialization. This introduced latency overhead, complex lifecycle management, and configuration failures under concurrent load.

The team replaced this with an in-process thread-safe singleton: the embedding model and corpus are loaded once, protected by a reentrant lock, and the retrieve() function is called directly from the agent thread. The interface to the agent did not change; the underlying implementation moved from inter-process to in-process. This eliminated network overhead and deployment instability, and made GPU utilization predictable.

This is a systems detail that matters in practice. ReACT agents can issue dozens of tool calls per query. If each tool call pays a network round-trip penalty, the latency compounds quickly. Moving retrieval in-process is the kind of optimization that lets you actually measure what the agentic loop costs on its own, without infrastructure noise masking the signal.

The Cost Profile

The numbers here are honest and worth stating plainly. Dense retrieval on ViDoRe takes 0.67 seconds per query. The agentic pipeline with Opus 4.5 takes 136.3 seconds. That is a 200x latency increase for roughly a 7.5% gain in NDCG@10 (69.22 vs. 64.36).

The token consumption is also substantial: approximately 1.8 billion input tokens and 15 million output tokens across the full ViDoRe v3 evaluation set, which averages to around 760,000 input tokens and 6,300 output tokens per query. At current API pricing for frontier models, this is not a retrieval system you run on every user query in a consumer product.

Using gpt-oss-120b instead of Opus 4.5 drops latency to 78.6 seconds per query and reduces the NDCG@10 from 69.22 to 66.38 on ViDoRe. That is still nearly 120x slower than dense retrieval, but the performance is meaningfully better than the dense baseline, and the cost profile is substantially more tractable if you are running the model on your own hardware.

The BRIGHT results with Opus 4.5 are starker: 50.90 vs. a dense baseline of 38.28. A 33% relative improvement in NDCG@10 on a reasoning-intensive benchmark is significant. For domains where retrieval quality directly affects decision quality, and where queries are genuinely complex, this is a trade-off worth making.

Where This Sits in the RAG Landscape

The broader lesson is about how responsibility is distributed in a retrieval system. Standard RAG places all retrieval intelligence in the embedding model and the index structure. Agentic RAG moves intelligence into the orchestration layer, relying on an LLM to decide what to retrieve, when to retrieve it, and whether what came back is sufficient.

The NVIDIA team’s next stated direction is distilling the agentic reasoning behavior back into smaller specialized models: fine-tuning lightweight models to natively produce think and retrieve sequences without requiring a frontier model in the loop. If that works, you get the generalization benefits of the agentic approach at a latency and cost closer to standard retrieval. That is the same trajectory that instruction following and chain-of-thought prompting have followed: start with large frontier models to establish the capability, then compress it downward.

For now, the architecture is most applicable to high-stakes retrieval tasks where query complexity is high and latency tolerance is generous: enterprise document search, legal and medical information retrieval, research workflows. For high-volume, low-latency applications, the dense baseline or a simple reranker pipeline is still the right tool.

The NeMo Retriever library and retrieval benchmark harness are available on GitHub, and the BRIGHT and ViDoRe v3 leaderboards provide external reference points for comparing approaches. The benchmark positions are a starting point; the more interesting question is what the architecture looks like once the distillation work matures.

Was this interesting?