The Retriever as a Tool: Inside NVIDIA NeMo's Agentic RAG Architecture

Dense retrieval has a core problem that’s easy to overlook when your queries are clean and your documents are well-indexed. Give it a messy multi-hop question, a query with rare terminology, or something that requires inferring what the user actually needs rather than matching surface tokens, and the single vector lookup falls apart. There’s no feedback path. The model computes one embedding, finds the nearest neighbors, and returns them regardless of whether those neighbors are useful.

The BRIGHT benchmark (Su et al., 2024) makes this concrete. BRIGHT pulls queries from Stack Exchange across twelve domains, math, coding, physics, chemistry, economics, and others, and pairs each query with relevant documents that share almost no vocabulary with the query itself. You have to reason about what the query is actually asking before you can retrieve. BM25 scores in the 4-8 NDCG@10 range on BRIGHT. Dense retrievers score 10-20. That gap exists not because the models are bad but because the task requires something single-shot retrieval structurally cannot do.

NVIDIA’s NeMo Retriever agentic pipeline is a direct response to this. The architecture wraps a dense embedding retriever in a ReACT-style reasoning loop (Yao et al., 2022), giving a language model the ability to call retrieval as a tool, observe the results, and decide what to do next. The same pipeline, without any task-specific tuning, reached #1 on ViDoRe v3 (visual document retrieval) and #2 on BRIGHT.

What the Loop Actually Does

The ReACT framework is Thought → Action → Observation, iterated until the model decides it has enough. In this implementation there are four tools: think for internal reasoning steps, retrieve(query, top_k) for fetching documents, final_results for returning the answer, and an RRF fallback for when the agent hits step or context limits.

What makes this interesting is the emergent behavior. The system wasn’t explicitly programmed to decompose queries, but on complex inputs it does: it breaks the original question into sub-queries, retrieves for each, synthesizes what it found, and decides whether to keep going. It also rephrases queries persistently when early retrievals come back weak, and generates new queries based on what previous results suggested. None of this is hard-coded logic. It falls out of giving a capable model a retrieval tool and a loop.

The alternative approach, static query decomposition, does some of this but locks in the decomposition strategy at the start. The agent can adapt mid-run based on what it actually finds, which matters when you don’t know in advance what shape the retrieval problem will take.

The Embedding Models

The pipeline ships three embedding models for different contexts:

nemotron-colembed-vl-8b-v2: highest quality for visual document tasks, the one that drives the ViDoRe v3 results
llama-nemotron-embed-vl-1b-v2: lighter, production-oriented
llama-embed-nemotron-reasoning-3b: specialized for reasoning-intensive retrieval, used on BRIGHT

The production deployment uses a thread-safe singleton retriever rather than an MCP server. Single loaded model, single set of corpus embeddings, a reentrant lock for concurrent access. This eliminates network serialization overhead, which matters when you’re running 9-11 retrieval calls per query.

What the Benchmark Numbers Say

The NDCG@10 results on ViDoRe v3:

Configuration	Score	Time/query	Retrieval calls
Opus 4.5 + colembed-vl-8b	69.22 (#1)	136 sec	9.2
gpt-oss-120b + colembed-vl-8b	66.38	78.6 sec	2.4
gpt-oss-120b + embed-vl-1b	62.42	78.1 sec	2.4
Dense only (colembed-vl-8b)	64.36	0.67 sec	1

On BRIGHT:

Configuration	Score	Time/query	Retrieval calls
Opus 4.5 + reasoning-3b	50.90 (#2)	148.2 sec	11.8
INF-X-Retriever	63.40 (#1)	—	—
gpt-oss-120b + reasoning-3b	41.27	—	—
Dense only (reasoning-3b)	38.28	—	—

A few things stand out here. First, the embedding model sets a ceiling the agent can’t fully climb over. On ViDoRe v3, dense-only colembed-vl-8b scores 64.36. The lighter embed-vl-1b with the same agent scores 62.42. Swapping in the better model gets you to 66.38, and adding a more capable reasoner on top gets you to 69.22. The agent amplifies retrieval quality but doesn’t substitute for it. If your embedding model is weak, the loop just finds more weak results faster.

Second, the difference between Opus 4.5 (9.2 calls, 136 sec) and gpt-oss-120b (2.4 calls, 78.6 sec) on ViDoRe is roughly the difference between a model that explores more thoroughly and one that converges faster. The more thorough exploration buys about 3 NDCG points. Whether that trade-off is worth it depends entirely on what you’re building.

Third, the generalization result is the headline. The same pipeline design, no per-benchmark tuning, competes at the top of two structurally different retrieval problems. ViDoRe v3 is about visually rich PDFs and slides where ColPali-style visual document understanding matters. BRIGHT is about reasoning over text with vocabulary mismatch. Hitting both with one architecture suggests the ReACT loop is genuinely general, not narrowly optimized.

The Cost Question

136 seconds per query and roughly 760,000 input tokens per query is a lot. For a search box on a website, this is not viable. For a research assistant running overnight analysis over a document corpus, it might be exactly right.

Agentic retrieval is not a better version of dense retrieval; it’s a different mode of operation for different use cases. The scenarios where it makes sense:

Queries that are genuinely multi-hop and require synthesizing across documents
High-stakes retrieval where recall quality is worth significant latency
Offline pipelines where wall-clock time per query doesn’t matter
Domains with heavy vocabulary mismatch where single-shot retrieval structurally fails

The scenarios where it doesn’t:

Real-time search with latency budgets under a few seconds
High-volume retrieval where token costs accumulate
Clean, well-formed queries over well-indexed corpora where dense retrieval already works well

The 0.67 sec dense baseline on ViDoRe v3 scores 64.36. The 136 sec agentic top result scores 69.22. That’s roughly a 7.5% improvement at 200x the latency and orders of magnitude more token spend. For some applications that trade-off is obviously worth it. For most production systems it’s not.

Where This Points

The interesting direction is what the NVIDIA team mentions about distillation. The agentic pipeline generates traces: sequences of think steps, retrieve calls, and observations that led to good results. Those traces are training data. You can fine-tune a smaller model on them to reproduce the reasoning behavior without running the full agentic loop at inference time.

This is the standard arc for capabilities that emerge in large models: demonstrate the capability in a large expensive system, collect traces, distill into something smaller and cheaper. The benchmark results here are partly a proof of concept for what’s achievable, and partly a data collection exercise for the distillation step.

The RRF fallback (Cormack, Clarke & Buettcher, SIGIR 2009) is a pragmatic design choice worth noting. When the agent hits context or step limits, it merges all retrieved results using reciprocal rank fusion rather than failing or returning a partial answer. RRF is rank-based rather than score-based, which avoids the normalization problem when fusing results from multiple heterogeneous retrieval calls with incomparable similarity scores. It’s a sensible degradation path.

The architecture as a whole is a reasonable answer to the question of what retrieval looks like when you stop treating it as a lookup and start treating it as a reasoning task. The trade-offs are real and the costs are high, but the benchmark results make a credible case that for the right problems, the approach works.