Retrieval Beyond Text: What VLM-Backed Embeddings Change About the Retrieve-and-Rerank Stack

From CLIP to Vision-Language Models

For a few years, multimodal retrieval in Python meant reaching for CLIP. OpenAI released it in 2021, it shipped with a clean contrastive objective, and the sentence-transformers library wrapped it neatly enough that you could encode images and text with the same .encode() call. That was enough for most hobby projects.

Sentence Transformers v5.4 changes the baseline substantially. The HuggingFace announcement introduces unified multimodal support across embeddings and rerankers, but the more significant shift is architectural: the library now treats full vision-language models as first-class embedding backbones, not just convenience wrappers around frozen image encoders.

CLIP’s design was fundamentally about a shared latent space between a vision encoder (a ViT) and a text encoder (a small transformer), trained end-to-end on 400M image-text pairs with a contrastive loss. That architecture is good at zero-shot classification and coarse cross-modal similarity. It is less good at understanding document structure, reasoning about image content, or handling arbitrary resolution input. The new models in v5.4, including Qwen3-VL-Embedding and NVIDIA’s llama-nemotron-embed-vl-1b-v2, use full VLM architectures that were pretrained to understand images as language models understand text, then fine-tuned for embedding tasks. The difference shows up in benchmarks: the Qwen3-VL-Embedding-8B scores 77.8% on MMEB-V2 against CLIP ViT-L-14’s 75.4% on ImageNet zero-shot, and MMEB-V2 is a substantially harder and more representative retrieval benchmark than ImageNet classification.

The Modality Gap, and Why It Does Not Break Retrieval

Anyone building cross-modal retrieval systems will encounter the modality gap quickly. Text-to-text cosine similarities in a well-tuned model sit somewhere between 0.5 and 1.0 for semantically related pairs. Text-to-image similarities for the same conceptual match land closer to 0.1 to 0.7. This is not a calibration problem you can fix by normalizing vectors. It reflects the fact that even after projection to a shared embedding space, representations retain modality-specific structure.

The practical consequence is that cosine similarity thresholds set for text retrieval will not transfer to cross-modal retrieval. If your pipeline filters out candidates below 0.6 similarity, you will silently discard relevant images even when the model understands the match.

The reason this does not break retrieval is that ranking depends on relative ordering, not absolute values. If your query is “revenue growth chart” and image A shows a bar chart of quarterly revenue while image B shows a sunset, the model will score A higher than B within the cross-modal range. The threshold problem is real, but it is solvable by removing hard similarity cutoffs and relying on top-k retrieval instead.

Qwen3-VL-Embedding provides a practical handle on this through configurable embedding dimensionality. The 2B model supports output dimensions from 64 to 2048, and the 8B model goes to 4096. Lower-dimensional projections can narrow the modality gap at some cost to capacity, which is useful when you want more consistent similarity distributions across modalities in a mixed corpus.

The Query-Document Prompt Asymmetry

One of the underappreciated shifts in dense retrieval over the past two years is the move away from symmetric encoding toward asymmetric prompting. Models like E5, BGE, and the Qwen embedding series use different instruction prefixes for queries versus documents. A query gets a prompt like “Represent this query for retrieval:” while a document gets a different prefix, or none at all. This asymmetry pushes the model to represent queries in a way that is oriented toward matching rather than self-description.

This pattern carries forward into multimodal retrieval in v5.4. The library exposes encode_query() and encode_document() methods that apply the correct prompts automatically, and this matters more than it might seem. Using the generic encode() for retrieval tasks bypasses prompt injection, which can meaningfully degrade recall on instruction-tuned models.

With images as queries or documents, the asymmetry takes on a different character. A text query against an image corpus has a clear semantic orientation: the user has intent, the images have content. An image query against a text corpus is less common but increasingly useful for reverse image search or composed retrieval. BGE-VL-large supports composed image retrieval, where the query is an image combined with a text instruction like “same style but with a dark background”. That use case exists outside the symmetric CLIP paradigm entirely.

Installation and Concrete Usage

The new modalities require optional dependencies:

pip install -U "sentence-transformers[image,video,train]"

Loading a multimodal embedding model follows the same pattern as any Sentence Transformer:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={
        "attn_implementation": "flash_attention_2",
        "torch_dtype": "bfloat16"
    },
    processor_kwargs={
        "max_pixels": 600 * 600
    },
    revision="refs/pr/23",
)

The max_pixels processor kwarg matters in practice. VLM-based encoders process images at higher resolution than CLIP, which improves understanding of documents and charts but scales memory consumption quickly. Capping resolution keeps batch sizes reasonable on GPU.

Cross-modal similarity then works over any combination of strings (text), image paths or URLs, and mixed dicts:

query_embeddings = model.encode_query([
    "quarterly revenue breakdown by region",
    "photo of a mountain lake at sunrise",
])

doc_embeddings = model.encode_document([
    "https://example.com/revenue_chart.png",
    "https://example.com/mountain_lake.jpg",
    "The company reported $45M in Q4 revenue across three regions.",
])

similarities = model.similarity(query_embeddings, doc_embeddings)

NVIDIA’s llama-nemotron-embed-vl-1b-v2 takes a slightly different path, using an Eagle VLM architecture that combines Llama 3.2 1B with SigLip2 400M as its vision encoder. It requires trust_remote_code=True and supports up to 10,240 tokens with a maximum of six image tiles per request. Its benchmark numbers on multimodal document retrieval are competitive: 73.24% average Recall@5 when using both image and text inputs, compared to 71.04% with text alone. The improvement from fusing modalities is consistent enough to be worth the added complexity.

Rerankers as the Precision Layer

Embedding models optimize for recall. They need to be fast enough to score a corpus of millions of documents and flexible enough to handle out-of-distribution queries. Rerankers operate on a shortlist, so they can afford to be slower and more thorough.

The v5.4 release extends the CrossEncoder API to handle multimodal inputs:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder(
    "Qwen/Qwen3-VL-Reranker-2B",
    revision="refs/pr/11",
)

query = "quarterly revenue by region"
documents = [
    "https://example.com/revenue_chart.png",
    "The company reported Q4 revenue across three regions.",
    {"text": "Annual financial summary", "image": "https://example.com/annual_report_page.png"},
]

rankings = reranker.rank(query, documents)

The cross-encoder architecture processes query and document together rather than independently, which allows it to attend to fine-grained alignment between them. For multimodal pairs, this means the reranker can catch cases where an image is visually relevant to a query even though its file metadata or surrounding text is not. A chart labeled “Figure 3” scores low on text similarity to “revenue growth”; the visual content might score much higher.

The practical pipeline is the same two-stage pattern that became standard in text retrieval: embed for recall across the full corpus, rerank the top-k for precision. The embedding model handles millions of documents; the reranker handles tens.

ColPali and the Document Retrieval Specialization

Before v5.4, the vidore group had already been pushing multimodal document retrieval forward with ColPali and ColQwen. These models take a different approach: rather than producing a single embedding per document page, they produce patch-level embeddings and use late interaction scoring, similar to ColBERT for text. This preserves spatial structure in the document, which matters when the relevant information is in a specific region of a chart or table.

ColQwen2.5-v0.2 currently sits at 74k downloads on HuggingFace, ahead of most of the models just landing in v5.4. The library’s new multimodal support does not replace this line of work; it gives practitioners a unified interface to both approaches and standardizes the retrieve-plus-rerank pattern that works well for most use cases.

What Changes for RAG Pipelines

The practical implication of v5.4 is that mixed-modality corpora, PDFs with charts, codebases with architecture diagrams, product catalogs with images, are now first-class inputs to retrieval pipelines without custom preprocessing to strip visual content. You can convert PDF pages to images, embed them with a VLM-backed encoder, and retrieve them directly using a text query. The information locked in a bar chart or a network diagram is retrievable without OCR or captioning as an intermediate step.

This changes the design question from “how do I extract text from my documents” to “which modality combination gives me the best recall for my content type”. NVIDIA’s nemotron numbers suggest that fusing both image and text signals outperforms either alone by a measurable margin, and the Qwen3-VL models demonstrate that this holds across the 2B-to-8B parameter range.

The retrieve-and-rerank stack is stable enough now that the main engineering work is choosing the right model for your content type, setting appropriate resolution limits for your hardware budget, and removing fixed similarity thresholds from your retrieval code. The library handles the rest.