The Retrieval Generalization Problem That Dense Embeddings Never Solved

The Benchmark That Changed How People Think About Dense Retrieval

When Thakur et al. published BEIR in 2021, the most discussed result was not which model performed best. It was DPR scoring below BM25. DPR, then a leading dense passage retriever, achieved an average NDCG@10 of roughly 37 across BEIR’s 18 heterogeneous datasets. BM25 averaged around 43. A 1994 term-frequency algorithm outperformed a fine-tuned BERT encoder on out-of-domain retrieval.

The reason is straightforward in hindsight. DPR was trained on Natural Questions, a web-based Q&A corpus. Its embedding space encoded the semantic patterns of that dataset: web-style prose, named entities, general knowledge questions. When evaluated on TREC-COVID (biomedical literature), SciFact (scientific claims), or FiQA (financial Q&A), the distributions diverged enough that the dense representations became unreliable guides for ranking. BM25 has no such distribution. It counts terms. Terms do not have out-of-domain problems.

This result set off a multi-year effort to build retrieval systems that generalize. The progression went roughly: instruction-tuned bi-encoders (E5, GTR), sparse learned representations (SPLADE), late-interaction models (ColBERT v2), and hybrid systems combining BM25 with dense retrieval using Reciprocal Rank Fusion. Each approach narrowed the gap. BGE-M3, a notable model from BAAI, encodes three retrieval signals simultaneously from a single encoder — dense vectors, sparse lexical weights, and ColBERT-style multi-vector representations — reaching BEIR averages around 57. NVIDIA’s own nv-embedqa-mistral-7b-v2 sits in the 59-61 range on MTEB retrieval.

All of these systems are still fundamentally single-pass. The query goes in once, a ranking comes out. The system does not reason about whether the ranking is correct, or whether a different query formulation would have retrieved something better.

What Agentic Retrieval Actually Means Here

NVIDIA’s NeMo Retriever agentic pipeline wraps retrieval in a ReACT-style agent loop. The LLM is given three tools: think (plan the search approach), retrieve(query, top_k) (call the embedding retriever), and final_results (output ranked documents). The agent iterates until it decides the results are sufficient or until it hits a step limit.

The patterns that emerge from this loop are not explicitly programmed. The agent decomposes complex queries into simpler sub-queries. It refines its language when initial retrieval returns weak results. It can identify when retrieved documents point toward a different framing of the question and pivot accordingly. This is roughly what a research analyst does with a search engine: the first query is rarely the final one.

When the agent hits its maximum step count or runs out of context, a fallback kicks in: Reciprocal Rank Fusion across all retrieval calls made during the session. RRF computes Σ 1/(k + rank_i(d)) with k=60, aggregating ranks from multiple retrieval attempts into a single score without requiring score normalization. Every retrieval call contributes to the final ranking even if the agent never explicitly chose a winner.

The implementation detail that matters most is how the retriever is exposed to the agent. The original design used an MCP server — a separate process with network round-trips. The current version uses an in-process thread-safe singleton: the retriever loads the model and corpus embeddings once, and all agent tool calls hit it directly via a reentrant lock. This eliminates serialization overhead and significantly improves GPU utilization, which matters when the agent averages more than nine retrieval calls per query.

What the Benchmarks Actually Reveal

The results on two benchmarks tell different stories, and the contrast is worth examining closely.

On ViDoRe v3, a benchmark for visually rich enterprise documents, NeMo’s agentic pipeline using Claude Opus 4.5 and the nemotron-colembed-vl-8b-v2 embedding model scores NDCG@10 of 69.22, ranking first. The dense baseline with the same embedding model scores 64.36. The agentic loop contributes roughly 5 NDCG points over single-pass retrieval.

On BRIGHT, which tests reasoning-intensive retrieval requiring multi-step inference from query to relevant documents, NeMo ranks second at 50.90, behind INF-X-Retriever at 63.40. INF-X-Retriever uses dataset-specific query alignment heuristics tuned for BRIGHT’s structure.

The critical comparison is INF-X-Retriever’s ViDoRe v3 score: 62.31. The non-agentic dense baseline for NeMo’s pipeline on ViDoRe v3 is 64.36. INF-X-Retriever, optimized for BRIGHT, performs below the plain dense baseline on a different benchmark. A system tuned to score well on one retrieval distribution transfers poorly. NeMo’s approach places top-2 on both benchmarks because the agent is domain-agnostic by construction: given a new retrieval task, it adapts its query strategy through reasoning rather than through pre-baked heuristics.

The Latency and Cost Reality

Agentically iterating over 9.2 retrieval calls per query with a frontier model takes 136 seconds per query on a single A100, with roughly 760,000 input tokens consumed. That is not a configuration for latency-sensitive production traffic.

Swapping Claude Opus 4.5 for gpt-oss-120b (an open-weight alternative) cuts the time to 78.6 seconds and drops retrieval calls from 9.2 to 2.4. The NDCG@10 cost is 2.84 points on ViDoRe v3 and 9.6 points on BRIGHT. The larger BRIGHT gap suggests the reasoning-heavy benchmark requires a stronger reasoning model; the agent strategy on ViDoRe v3 is apparently simpler to replicate with a smaller model.

Using a lighter embedding model (llama-nemotron-embed-vl-1b-v2 instead of the 8B ColBERT model) with gpt-oss-120b drops ViDoRe performance to 62.42, slightly below the 8B dense baseline of 64.36. This reveals an important constraint: the embedding model sets the ceiling for what the agent can retrieve. Better queries cannot compensate for an embedding space that cannot represent the relevant documents.

NVIDIA’s production recommendation is pairing the agent with llama-nemotron-embed-vl-1b-v2, trading some top-end accuracy for a more deployable profile. Whether the agentic overhead is worth it at that configuration depends on whether your application tolerates query times measured in tens of seconds.

Contextualizing the Design Choices

Agentic RAG is not a new concept. Query rewriting with LLMs, Hypothetical Document Embeddings (HyDE), multi-query retrieval with RRF fusion, and FLARE-style retrieval triggered by generation uncertainty all predate this pipeline. What NeMo’s implementation contributes is a more fully agentic loop: the LLM controls not just an initial query transformation but the entire iterative retrieval trajectory, deciding when to stop and what to do with partial results.

The ReACT framing was proposed by Yao et al. in 2022 and has become a standard scaffold for tool-using agents. Applying it to retrieval specifically closes a loop that exists in most RAG systems: the generator model knows when retrieved context is insufficient, but standard pipelines provide no mechanism for acting on that knowledge.

The “generalizable” framing in the pipeline’s name reflects an architectural commitment. Rather than fine-tuning retrieval behavior to specific datasets through hand-crafted heuristics, the system delegates adaptation to the reasoning model. The consequence is dependency on that model’s quality and cost, which explains why the ablation results show a large gap between frontier and open-weight agents on reasoning-intensive tasks, and a smaller gap on document retrieval tasks where the query strategy is less complex.

Where This Fits

For production RAG on well-defined, relatively homogeneous corpora, a well-tuned hybrid retriever — BM25 plus dense vectors fused with RRF, followed by a cross-encoder reranker like nv-rerankqa-mistral-4b-v3 — will outperform this pipeline on latency by two to three orders of magnitude while delivering competitive accuracy on known domains.

The NeMo agentic approach makes the most sense for enterprise search scenarios covering genuinely diverse document types and query patterns, where you cannot predict the domain in advance and cannot afford to build specialized pipelines for each one. The benchmark results position it precisely there: consistent top-tier performance across structurally different benchmarks, at the cost of latency and token budget that makes sense only for certain workloads.

The implementation is available in the NeMo-Retriever repository for reproducing the benchmarks or adapting the pipeline to a different embedding stack. The core loop — expose the retriever as a tool, fuse with RRF, run in-process — is straightforward enough to port to any agent framework.