From CLIP to VLM Embeddings: How Sentence Transformers v5.4 Changes Multimodal Search

For years, CLIP was the default answer for multimodal search. You wanted to match text queries to images, you used CLIP. The architecture was elegant: two encoders, one for text and one for images, trained together so their outputs landed in the same vector space. The contrastive training objective forced semantically similar pairs together and pushed dissimilar ones apart, and the models were compact enough to run without specialized hardware.

Sentence Transformers built on this with their clip-ViT-* series, making CLIP embeddings accessible through a familiar Python API. The models were small by 2026 standards (ViT-B/32 at 149M parameters), the embedding dimensions were modest (512 to 768), and the semantic understanding was bounded by what contrastive training on image-caption pairs could teach. CLIP could tell you that an image of a dog was more similar to the word “dog” than to the word “airplane,” but it had limited capacity for the kind of fine-grained visual reasoning that distinguishes a “pie chart showing declining quarterly revenue” from a “bar chart showing year-over-year growth.”

Version 5.4 of the Sentence Transformers library, published this week, changes the underlying architecture substantially.

The Dual-Encoder Ceiling

CLIP’s dual-encoder design has a structural constraint: the text and vision encoders never attend to each other during inference. Each encoder independently produces its output vector, and similarity is measured post-hoc by comparing those vectors in the shared space. The model cannot use visual context to disambiguate text, or use text context to focus on specific image regions. The shared embedding space is learned during training but not exploited in any adaptive way when a query arrives.

This ceiling shows up in practice when queries require compositional understanding. A query like “a document with a table comparing product pricing” requires understanding what a table looks like, what pricing data looks like in tabular form, and the relationship between those visual elements and the query text. CLIP-style models handle these queries imprecisely because contrastive training on image-caption pairs exposes models to simpler alignment signals.

Vision Language Models solve this differently. They feed image tokens through the same transformer layers as text tokens, allowing full cross-modal attention at every layer. A VLM processing an image alongside a text query can attend to specific visual regions when processing certain words, building representations that reflect genuine semantic interaction between the two modalities. This is the architecture behind GPT-4V, LLaVA, and the Qwen-VL series, and it is qualitatively different from CLIP for complex visual reasoning tasks.

Sentence Transformers v5.4 brings VLM-based models into the library as first-class embedding and reranking backends. The headline models are the Qwen3-VL-Embedding-2B and 8B, using the Qwen3 vision language model as their backbone. Instead of CLIP’s 149M to 427M parameter range, these are 2B and 8B models outputting 2048-dimensional embeddings. BAAI’s BGE-VL series covers the middle ground from 100M to 8B parameters. NVIDIA’s Nemotron models at 1.7B and 4.7B round out the options.

Getting Started

Installation now uses optional extras per modality:

pip install -U "sentence-transformers[image]"
# for video support as well
pip install -U "sentence-transformers[image,video,train]"

The core API is unchanged from the text-only case:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    revision="refs/pr/23",
    model_kwargs={"torch_dtype": "bfloat16"},
    processor_kwargs={
        "min_pixels": 28 * 28,
        "max_pixels": 600 * 600,
    },
)

# Images accept URLs, file paths, or PIL objects interchangeably
img_embeddings = model.encode([
    "https://example.com/revenue-chart.jpg",
    "https://example.com/architecture-diagram.png",
])

text_embeddings = model.encode([
    "quarterly revenue trend over time",
    "microservices architecture diagram",
])

similarities = model.similarity(text_embeddings, img_embeddings)
# tensor([[0.71, 0.09],
#         [0.08, 0.68]])

For retrieval scenarios, the library distinguishes between query and document encoding. Some models apply different prompting strategies or pooling modes depending on which role the input plays:

query_embeddings = model.encode_query(["charts showing cost reduction over time"])
doc_embeddings = model.encode_document(["path/to/report-page.png", "path/to/slide.png"])

The input format system accepts heterogeneous lists in a single call: plain strings, image paths, URLs, PIL images, numpy arrays, torch tensors, and dictionaries combining multiple modalities within a single document:

embeddings = model.encode([
    "a text-only query",
    "https://example.com/image.jpg",
    {
        "text": "Q3 financial summary",
        "image": "https://example.com/q3-chart.jpg",
    },
])

That last format is particularly relevant for document indexing. Many real-world corpora contain pages or records that are inherently multimodal: a PDF page with an embedded figure, a product listing with both a description and a photo, a slide with speaker notes. Representing these as single embeddings, rather than indexing text and image separately and merging results at query time, is a meaningful architectural simplification.

The Two-Stage Pipeline

The retrieve-and-rerank pattern has been standard in text search for years. Embedding-based retrieval is fast but imprecise; you retrieve a broad candidate set, then apply a more expensive cross-encoder model to re-score those candidates for a final ranking. The embedding model trades precision for throughput; the reranker inverts that trade-off.

Sentence Transformers v5.4 extends this to multimodal content with a new class of CrossEncoder models capable of scoring query-document pairs where the document is an image, text, or a combination. The reranker models available include Qwen3-VL-Reranker-2B and 8B, NVIDIA’s llama-nemotron-rerank-vl-1b-v2, and JinaAI’s jina-reranker-m0.

The CrossEncoder API mirrors the existing text reranker interface:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B", revision="refs/pr/11")

query = "quarterly revenue decline pie chart"
documents = [
    "https://example.com/revenue-chart.jpg",           # image
    "Revenue declined 12% in Q3 due to softening...",  # text
    {                                                   # mixed
        "text": "Financial summary Q3",
        "image": "https://example.com/revenue-chart.jpg",
    },
]

rankings = reranker.rank(query, documents)
for rank in rankings:
    print(f"{rank['score']:.4f}\t(document {rank['corpus_id']})")

A complete retrieve-and-rerank pipeline combining both components:

from sentence_transformers import SentenceTransformer, CrossEncoder

embedder = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")
reranker = CrossEncoder("nvidia/llama-nemotron-rerank-vl-1b-v2", trust_remote_code=True)

query = "revenue decline chart"
corpus = [...]  # list of image paths or URLs

query_emb = embedder.encode_query(query)
corpus_embs = embedder.encode_document(corpus, show_progress_bar=True)

# Retrieve top 10 via cosine similarity
top_k_idx = embedder.similarity(query_emb, corpus_embs).argsort(descending=True)[0][:10]
top_docs = [corpus[i] for i in top_k_idx]

# Rerank for precision
final_rankings = reranker.rank(query, top_docs)

You can also inspect what modalities a model supports at runtime:

print(model.modalities)      # ['text', 'image', 'video', 'message']
print(model.supports("audio"))  # False

Hardware Constraints

The trade-off is direct: better semantic understanding costs more compute. CLIP-ViT-B/32 runs comfortably on a CPU or any GPU with a couple of gigabytes of memory. Qwen3-VL-Embedding-2B needs approximately 8 GB VRAM; the 8B variant needs around 20 GB. The library documentation recommends against CPU inference for VLM-based models entirely.

The BAAI BGE-VL-base at 100M parameters sits at the lighter end of the VLM-based options, offering reduced resource requirements with correspondingly reduced semantic depth. For high-throughput workloads where the queries are relatively simple, the CLIP models remain defensible, particularly clip-ViT-L-14, which achieves 75.4% ImageNet zero-shot top-1 accuracy at 427M parameters.

For throughput optimization, Flash Attention 2 is supported through model kwargs:

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={
        "attn_implementation": "flash_attention_2",
        "torch_dtype": "bfloat16",
    },
    revision="refs/pr/23",
)

Note that processor_kwargs is new in this release, replacing the older tokenizer_kwargs parameter name, though the old name remains supported for backward compatibility. The min_pixels and max_pixels processor options control image resolution bounds, trading quality against memory usage in a straightforward way.

Text-Only Rerankers

The release also adds text-only reranker support worth noting separately. The Qwen3-Reranker series (0.6B, 4B, and 8B variants) and MixedBread’s mxbai-rerank-v2 models (0.5B and 2B) are now natively supported through the same CrossEncoder interface. For text-only retrieval pipelines, these specialist models outperform the multimodal rerankers on text tasks:

model = CrossEncoder("mixedbread-ai/mxbai-rerank-base-v2")

scores = model.predict([
    ("How do I configure a reverse proxy?", "nginx reverse proxy configuration..."),
    ("How do I configure a reverse proxy?", "the history of HTTP proxies..."),
])
# [ 7.31 -2.62 ]

What This Enables

The clearest practical application is visual RAG. Building retrieval-augmented generation over document corpora that contain images, diagrams, and charts has historically required a captioning step: run a separate vision model over each image to produce text descriptions, index the text, and accept the semantic loss that accompanies any lossy conversion. With VLM-based embeddings, images are indexed directly. Queries retrieve on actual visual content, and reranking applies fine-grained scoring before results are passed to a generative model.

This also matters for mixed-modality document collections: slide decks, technical documentation with figures, financial reports with embedded charts, e-commerce catalogs where products have both images and descriptions. The combined input format (a dict with both text and image keys) means these can be represented and retrieved as coherent units rather than as separately indexed fragments that need to be rejoined at query time.

The models are currently in pull-request revisions rather than fully merged releases on Hugging Face, which suggests some caution is warranted in production deployments. The library integration follows established Sentence Transformers patterns closely enough that adoption should be straightforward for teams already using the library. The full documentation and the v5.4 integrations collection cover the complete model catalog and compatibility details.

The shift from CLIP to VLM-based embeddings is a meaningful advance in what multimodal retrieval can accomplish. CLIP made image-text matching accessible; VLM embeddings bring the semantic depth needed to make that matching accurate on the kinds of complex, compositional queries that appear in real applications. The two-stage retrieve-and-rerank pattern extending to multimodal content is the natural next step, and Sentence Transformers has packaged it in a way that slots into existing pipelines without requiring a full rewrite.