· 7 min read ·

Closing the Gap: What NVIDIA's Agentic Retrieval Results Say About Embedding Model Selection

Source: huggingface

The Assumption Embedded in Every RAG System

Every retrieval-augmented generation pipeline makes an implicit bet: that the embedding model is the load-bearing component. Pick a better embedder, get better retrieval. This assumption drove years of research and a thriving market in embedding model leaderboards, fine-tuned domain-specific retrievers, and ever-larger vector representations. The logic seems airtight.

NVIDIA’s recent work on NeMo Retriever’s agentic retrieval pipeline puts a number on how wrong that assumption can be. On the ViDoRe v3 visual document retrieval benchmark, the performance gap between a strong embedding model (nemotron-colembed-vl-8b-v2) and a weaker one (llama-nemotron-embed-vl-1b-v2) is 8.5 NDCG@10 points in dense retrieval. Add an orchestrating agent to both, and that gap shrinks to roughly 4 points. On BRIGHT, a reasoning-intensive benchmark, the dense gap is 19 points. With an agent, it narrows to about 7.5 points. The agent closes somewhere between 40 and 60 percent of the quality difference that previously required a larger, more expensive embedding model to cross.

That is a structural shift in how you think about retrieval system design.

How We Got Here: A Brief History of Retrieval

To understand why the agentic approach produces these results, it helps to trace the evolution of retrieval systems and the assumptions each generation built on.

BM25, the classic sparse retrieval method, works by counting term frequencies and inverse document frequencies. It is fast, interpretable, and competitive on clean keyword queries. Its weakness is vocabulary: if the query uses different words than the document, no amount of tuning helps. The query and document must share terms to match.

Dense retrieval replaced sparse counting with learned vector embeddings. Encode the query and document into the same vector space, retrieve by cosine similarity. This handles synonyms, paraphrases, and semantic proximity that BM25 cannot. The tradeoff is that all retrieval intelligence must be baked into the embedding at training time. If the query requires a reasoning step to identify relevance, the embedding either captures that relationship or it does not.

ColBERT and its visual descendant ColPali took a different approach. Instead of collapsing a document into a single vector, they produce token-level embeddings and match at the granularity of individual tokens using late interaction. This recovers precision that single-vector models lose, particularly on long documents and structured content. The visual variant NVIDIA uses here, nemotron-colembed-vl-8b-v2, encodes document pages as images rather than extracted text, handling tables, charts, and mixed layouts without depending on OCR.

Each generation improved retrieval by asking more of the embedding model: better vocabulary handling, better semantic coverage, better multi-vector representation. The agentic approach asks a different question entirely: instead of building a better retriever, can you build a smarter process for using a retriever?

The ReACT Loop as a Retrieval Operator

The NeMo agentic pipeline gives an LLM three tools: think for internal planning, retrieve(query, top_k) for querying the corpus, and final_results to emit the ranked output and exit. The agent loops until it calls final_results or hits a step limit, at which point Reciprocal Rank Fusion merges results across all retrieval attempts by scoring documents based on their ranks across every call.

The emergent behavior from this structure is more interesting than the description suggests. The agent develops real search strategies. It generates dynamic queries that adjust based on what prior retrieval calls returned. It persists through rephrasing when early queries come up short. It decomposes multi-part questions into sub-queries targeted at specific aspects of a complex information need.

Claude Opus 4.5 averages 9.2 retrieval calls per query on ViDoRe and 11.8 on BRIGHT. That is not a query rewriter; it is a corpus explorer. The agent treats each result set as evidence that informs the next query. This is how a skilled researcher uses a search engine, and it transfers reasonably well to a language model with tool access.

The Ablation Numbers Worth Sitting With

The top-line benchmark results show what this architecture achieves at the ceiling. The ablation results show the mechanics of the gap-closing phenomenon.

ViDoRe v3 (Visual Document Retrieval) — NDCG@10

AgentEmbedding ModelNDCG@10Latency (sec/query)Avg Retrieval Calls
Opus 4.5nemotron-colembed-vl-8b-v269.22136.39.2
gpt-oss-120bnemotron-colembed-vl-8b-v266.3878.62.4
gpt-oss-120bllama-nemotron-embed-vl-1b-v262.4278.12.5
Dense onlynemotron-colembed-vl-8b-v264.360.67
Dense onlyllama-nemotron-embed-vl-1b-v255.830.02

BRIGHT (Reasoning-Intensive Retrieval) — NDCG@10

AgentEmbedding ModelNDCG@10Latency (sec/query)Avg Retrieval Calls
Opus 4.5llama-embed-nemotron-reasoning-3b50.90148.211.8
gpt-oss-120bllama-embed-nemotron-reasoning-3b41.2792.84.5
gpt-oss-120bllama-nemotron-embed-vl-1b-v233.85139.16.6

The interesting comparison in the ViDoRe table is gpt-oss-120b with the weaker 1B embedding model: it scores 62.42. The dense baseline with the stronger 8B model is 64.36. You lose about 2 NDCG points by downgrading the embedder even with an agent compensating, but you nearly match a stronger dense retriever while using significantly cheaper embedding infrastructure. The 8.5-point dense gap shrinks to roughly 4 points once the agent is driving retrieval.

The BRIGHT table is more striking. With Opus 4.5 as the orchestrator, the reasoning-specialized embedding model reaches 50.90, a gain of about 12.6 points over what dense retrieval alone achieves with the same embedder. Reasoning ability in the orchestrator matters far more on hard queries than embedding model selection does.

The Tradeoff Is Now About Reasoning Budget

This is the core implication for system design. In a dense retrieval system, the primary engineering variable is which embedding model to deploy. You pick based on capability-cost curves, hardware constraints, domain fit, and latency requirements.

In an agentic retrieval system, the embedding model is partially substitutable. The question shifts to: how many reasoning steps can you afford, and how capable an orchestrator can you run? These are different engineering variables with different cost structures.

A retrieval call on a local in-process retriever is cheap. The singleton design NVIDIA uses loads model weights and corpus embeddings once at startup, keeps them GPU-resident, and protects concurrent access with a reentrant lock. Each retrieve() call is essentially a matrix multiply against an already-resident index, with no network round-trip, no serialization overhead, and no server lifecycle management. The expensive component is the LLM deciding what to retrieve next.

If you can afford Opus 4.5 at 9.2 calls per query and 136 seconds of latency, you get 69.22 on ViDoRe. Drop to gpt-oss-120b at 2.4 calls and 78 seconds, and you get 66.38. Strip out the agent entirely and use dense retrieval, and you get 64.36 at 0.67 seconds. The quality-latency-cost frontier now has three distinct operating points, and you choose based on your application’s constraints rather than primarily on which embedding model fits in memory.

When Each Approach Makes Sense

Dense retrieval remains the right choice for high-volume, low-latency applications. Sub-second retrieval at reasonable quality covers the majority of production search use cases. If your queries are short, well-formed, and semantically close to your documents, a well-tuned dense retriever with a reranker will outperform the agentic approach on cost-adjusted quality at any practical latency budget.

A lightweight agentic loop, a capable but not frontier orchestrator running 2-4 retrieval calls, makes sense when query complexity is moderate and latency tolerance is in the range of tens of seconds. The gap-closing result means you can use a smaller embedding model and recover quality through iteration, which may be preferable to scaling up the embedding model and paying for it across every query regardless of complexity.

A frontier orchestrator with deep iteration, Opus 4.5 at 9-12 calls per query, is appropriate for offline or near-offline workflows where accuracy matters most and queries are genuinely complex: enterprise document search over heterogeneous corpora, legal or medical information retrieval, research synthesis tasks where the right answer requires multi-hop reasoning across many documents. The 136-second per-query latency on a single A100 is not a conversational search system; it is a high-precision batch retrieval system.

The Direction This Points

NVIDIA notes that their next step is distilling the agentic reasoning patterns from Opus 4.5 into smaller models, fine-tuning lightweight models to natively orchestrate think and retrieve loops without relying on a frontier LLM. If that works, the latency and cost of the agentic approach come down substantially while retaining the generalization benefits. This is the same trajectory that instruction following and chain-of-thought prompting followed: establish a capability with large models, then compress it downward.

The NeMo Retriever retrieval-bench repository contains the implementation, including the in-process singleton retriever and the RRF fusion logic.

The central lesson from this work is not that embedding models no longer matter. They still set the ceiling on what the agent can retrieve per call, and domain-specialized models carry real value on matched tasks. The lesson is that the ceiling is not as fixed as single-pass retrieval implies. A reasoning loop that calls the retriever a dozen times and adapts based on what it finds is doing something qualitatively different from a system that embeds a query and accepts the top-k result. That difference is now measurable, and it changes the set of engineering decisions available to you when designing a retrieval system.

Was this interesting?