jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

· 2026 · cs.CL · arXiv 2605.08384

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.

representative citing papers

ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models

cs.SD · 2026-06-27 · unverdicted · novelty 5.0

ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.

citing papers explorer

Showing 1 of 1 citing paper after filters.

ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models cs.SD · 2026-06-27 · unverdicted · none · ref 27 · internal anchor
ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

fields

years

verdicts

representative citing papers

citing papers explorer