jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

Andreas Koukounas; Florian H\"onicke; Han Xiao; Michael G\"unther; Mohammad Kalim Akram; Saba Sturua; Scott Martens

arxiv: 2605.08384 · v3 · pith:CGB34AKJnew · submitted 2026-05-08 · 💻 cs.CL

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

Florian H\"onicke , Michael G\"unther , Andreas Koukounas , Mohammad Kalim Akram , Scott Martens , Saba Sturua , Han Xiao This is my paper

Pith reviewed 2026-06-30 22:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal embeddingsGELATOfrozen encodersmodality alignmenttext image audio videoefficient trainingVLM architecture

0 comments

The pith

By freezing text and modality encoders and training only 0.35% of weights, GELATO produces competitive multimodal embeddings while exactly preserving the original text model outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GELATO as a method to extend existing text embedding models to handle images, audio, and video by adding non-text encoders and training only the connecting components. The text backbone and added encoders stay frozen, so the language model produces identical embeddings for text inputs as the base Jina Embeddings v5 models. This yields a unified semantic space for all modalities with performance nearly matching larger state-of-the-art multimodal models. A sympathetic reader would care because the approach enables efficient multimodal extension without full retraining or loss of prior text capabilities.

Core claim

GELATO extends the two Jina Embeddings v5 Text models to support additional modalities by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. Only the connecting components, representing 0.35% of the total weights of the joint model, are trained. This produces the jina-embeddings-v5-omni suite that encodes text, image, audio, and video input into a single semantic embedding space, with evaluations showing results competitive with the state-of-the-art and nearly equal performance to larger multimodal embedding models.

What carries the argument

Locked aligned towers: the VLM-style setup in which frozen non-text encoders feed adapted inputs via trained connectors into a frozen language model that generates embeddings for every modality.

If this is right

The language model remains unaltered and produces exactly the same embeddings for text inputs as the base Jina Embeddings v5 Text models.
Training updates only 0.35% of total weights, making extension far more efficient than full-parameter retraining.
The resulting models achieve performance competitive with larger state-of-the-art multimodal embedding models across text, image, audio, and video.
All input types are mapped into one shared semantic embedding space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could allow existing text-only systems to incorporate new modalities without retraining or breaking compatibility.
Freezing the majority of weights may reduce the data and compute needed when adding further modalities later.
The same locked-tower pattern might apply to other frozen text embedding backbones beyond the Jina v5 series.

Load-bearing premise

That training only the connecting components while keeping the text backbone and non-text encoders frozen is sufficient to align modalities and preserve geometry without degrading performance.

What would settle it

A side-by-side evaluation showing that text embeddings from the omni model differ from those of the original Jina Embeddings v5 Text models on identical inputs, or that multimodal retrieval performance falls substantially below larger models.

Figures

Figures reproduced from arXiv: 2605.08384 by Andreas Koukounas, Florian H\"onicke, Han Xiao, Michael G\"unther, Mohammad Kalim Akram, Saba Sturua, Scott Martens.

**Figure 2.** Figure 2: Architecture of jina-embeddings-v5-omni (jina-embeddings-v5-omni-small shown; jina-embeddings-v5-omni-nano uses a smaller ViT and LLaVA-style tokens). Frozen towers feed trainable modality projectors into the frozen text backbone; task-specific exports select one projector/delimiter set and the matching LoRA adapter. models such as E5-Mistral [33] and NV-Embed [17]. Jina Embeddings v5 Text [1] draws on th… view at source ↗

**Figure 3.** Figure 3: Distribution of input tokens across semantic data types, averaged over the four task-specific checkpoints. 5 Evaluation We describe each evaluation suite by the types of tasks it covers: • Images: The Massive Image Embedding Benchmark (MIEB) [36] covers classification, clustering, visual semantic textual similarity (STS), retrieval, document retrieval, compositional reasoning, and vision-centric tasks. • … view at source ↗

**Figure 5.** Figure 5: Per-language audio retrieval. Tiles show [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 4.** Figure 4: XM3600 image-language comparison. Tiles show [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Vision ablations tests on CIRR-IT2I and NIGHTS [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Audio ablation tests on UrbanSound8K, Common [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 6.** Figure 6: Vision ablation tests on CIRR-IT2I and NIGHTS [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Matryoshka prefix tests across modalities. Curves [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They froze the text model and new encoders, trained 0.35% of weights to add image/audio, and kept text outputs identical, but the abstract gives no numbers to check if performance is actually competitive.

read the letter

The main takeaway is the locked-towers setup: start with a strong text embedding model, bolt on frozen image and audio encoders, train only the connectors, and get multimodal output in the same space without touching the original text path. This keeps text embeddings exactly the same by design and cuts training cost sharply.

What the paper actually contributes is a practical recipe for extending an existing text model to more modalities with minimal change. The 0.35% figure and the claim of unchanged text behavior are concrete and reproducible in principle. For teams that already rely on Jina v5 text embeddings and want to add retrieval over images or audio without retraining everything, this is a straightforward engineering path.

The soft spot is the performance claim. The abstract says the results are competitive with larger multimodal models and nearly equal to SOTA, yet it supplies no scores, no benchmarks, no baselines, and no dataset details. Without those, you cannot tell whether the alignment actually works at the level advertised or whether geometry preservation holds in practice. The architecture description is clear, but the empirical support is missing from the text provided.

This is aimed at applied retrieval and search work rather than theoretical embedding research. A reader who needs cheap multimodal extension and is willing to run their own checks would find the method description useful. The idea is solid enough on its own terms to deserve referee time so the full experiments can be examined.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces GELATO, a VLM-style architecture for creating multimodal embedding models (text, image, audio, video) by extending the Jina Embeddings v5 Text models. Non-text encoders and the text backbone remain frozen while only the connecting components (0.35% of total weights) are trained to align modalities; the authors claim this produces exactly unchanged text embeddings and competitive performance with larger state-of-the-art multimodal models.

Significance. If the empirical performance claims hold under rigorous evaluation, the approach would be significant for enabling efficient, low-cost extension of existing high-quality text embedding models to additional modalities without retraining or degrading the original text geometry.

major comments (1)

[Abstract] Abstract: the central claim that 'GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models' is presented without any quantitative metrics, baselines, datasets, or evaluation protocol, so the data-to-claim link cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We agree that the central performance claim requires supporting quantitative evidence within the abstract itself to allow immediate assessment of the data-to-claim link.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models' is presented without any quantitative metrics, baselines, datasets, or evaluation protocol, so the data-to-claim link cannot be assessed.

Authors: We agree that the abstract as written does not contain the requested quantitative support. The body of the manuscript reports full evaluations (including metrics, baselines, datasets, and protocols) demonstrating competitive performance against larger multimodal models while preserving exact text-embedding geometry. In the revised manuscript we will update the abstract to include specific key results (e.g., average scores on standard multimodal retrieval and classification benchmarks together with the primary baselines and evaluation settings) so that the claim is directly verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical architecture (GELATO) that extends frozen Jina Embeddings v5 text models by adding non-text encoders and training only the connecting components (0.35% of weights). The statement that text embeddings remain exactly unchanged follows directly from the explicit design choice to freeze the language model backbone; it is not presented as a derived prediction or result. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the manuscript. Performance claims are framed as empirical evaluation outcomes rather than mathematical necessities, rendering the work self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities beyond the method name are stated in the abstract; the work relies on standard transfer-learning assumptions common to the domain.

pith-pipeline@v0.9.1-grok · 5763 in / 1074 out tokens · 28887 ms · 2026-06-30T22:54:53.948837+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models
cs.SD 2026-06 unverdicted novelty 5.0

ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.

Reference graph

Works this paper leans on

47 extracted references · 23 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. 2026. jina-embeddings-v5- text: Task-Targeted Embedding Distillation. arXiv:2602.15547 [cs.CL] https: //arxiv.org/abs/2602.15547

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Alibaba Tongyi Lab. 2024. gte-Qwen2: General Text Embeddings Based on Qwen2. Hugging Face model collection. https://huggingface.co/collections/Alibaba- NLP/gte-qwen2

2024
[3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou
[5]

arXiv:2601.03666 [cs.CL] https://arxiv.org/abs/2601.03666

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings. arXiv:2601.03666 [cs.CL] https://arxiv.org/abs/2601.03666

work page arXiv
[6]

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang
[7]

arXiv:2503.18931 [cs.CV] https://arxiv.org/abs/2503.18931

CoMP: Continual Multimodal Pre-training for Vision Foundation Models. arXiv:2503.18931 [cs.CV] https://arxiv.org/abs/2503.18931

work page arXiv
[8]

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Haojie Zhang, Zhijie Gu, Yux- uan Zhou, Jingren Zhou, Junyang Lin, and Chang Zhou. 2025. Qwen2.5-Omni Technical Report. arXiv:2503.20215 [cs.CL] https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, and Kenneth Enevoldsen. 2026. MAEB: Massive Audio Embedding Benchmark. arXiv:2602.16008 [cs.SD] ...

work page arXiv 2026
[10]

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang
[11]

In IEEE International Conference on Acoustics, Speech and Signal Processing

CLAP Learning Audio Concepts From Natural Language Supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing. 1–5
[12]

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao...

work page arXiv 2025
[13]

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023. EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv:2303.15389 [cs.CV] https://arxiv.org/abs/ 2303.15389

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space To Bind Them All. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15180–15190

2023
[15]

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. 2024. E5-V: Universal Embeddings with Multimodal Large Language Models. arXiv:2407.12580 [cs.CL] https://arxiv. org/abs/2407.12580

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2025. MMEB: Massive Multi-discipline Multimodal Embedding Bench- mark. arXiv:2410.05160 [cs.CV] https://arxiv.org/abs/2410.05160 Introduced with VLM2Vec

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Moham- mad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, and Han Xiao. 2024. jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images. arXiv:2412.08802 [cs.CL] https://arxiv.org/abs/2412.08802

work page arXiv 2024
[18]

Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, and Han Xiao. 2024. Jina CLIP: Your CLIP Model Is Also Your Text Retriever. arXiv:2405.20204 [cs.CL] https://arxiv.org/abs/2405.20204

work page arXiv 2024
[19]

Aditya Kusupati, Ashish Bhatt, Matthew Wallingford, Aniruddha Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Jain, and Ali Farhadi. 8
[20]

InAdvances in Neural Information Processing Systems

Matryoshka Representation Learning. InAdvances in Neural Information Processing Systems
[21]

Chien Van Lee, Rajarshi Roy, Mengting Xu, Jonathan Raiman, Mohammad Shoeybi, and Bryan Catanzaro. 2025. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. arXiv:2412.04252 [cs.CL] https://arxiv.org/abs/2412.04252

work page arXiv 2025
[22]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. 9459–9474. https://proceedings.ne...

2020
[23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the International Conference on Machine Learning, Vol. 202. PMLR, 19730–19742

2023
[24]

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin
[25]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking. arXiv:2601.04720 [cs.CL] https://arxiv.org/abs/2601.04720

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou
[27]

InAdvances in Neural Information Processing Systems, Vol

Mind the Gap: Understanding the Modality Gap in Multi-modal Con- trastive Representation Learning. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., New Orleans, LA, USA, 17612–17625. arXiv:2203.02053 [cs.LG] https://arxiv.org/abs/2203.02053

work page arXiv
[28]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. InAdvances in Neural Information Processing Systems, Vol. 36

2023
[29]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations

2019
[30]

Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval. arXiv:2505.17166 [cs.IR] https://arxiv. org/abs/2505.17166

work page arXiv 2025
[31]

Zach Nussbaum, Brandon Duderstadt, and Andriy Mulyar. 2024. Nomic Embed Vision: Expanding the Latent Space. arXiv:2406.18587 [cs.CV] https://arxiv.org/ abs/2406.18587

work page arXiv 2024
[32]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5

2026
[33]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning, Vol. 139. PMLR, 8748–8763

2021
[34]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, Vol. 202. PMLR, 28492–28518

2023
[35]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, 3982–3992

2019
[36]

Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1874–1883. https://openaccess.thecvf.com...

2016
[37]

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, and Chao Zhang
[38]

InInternational Conference on Learning Representations

WAVE: Learning Unified and Versatile Audio-Visual Embeddings with Multimodal LLM. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=MiV3WXDYJb
[39]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. 2025. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Featur...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024. Multilingual E5 Text Embeddings: A Technical Report. arXiv:2402.05672 [cs.CL] https://arxiv.org/abs/2402.05672

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, and Yu Rong. 2025. Scaling Language-Centric Omnimodal Representation Learning. arXiv:2510.11693 [cs.CL] https://arxiv.org/abs/2510.11693

work page arXiv 2025
[43]

Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Már- ton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, and Niklas Muennighoff. 2025. MIEB: Massive Image Embedding Benchmark. arXiv:2504.10471 [cs.CV] https://arxiv.org/abs/2504.10471

work page arXiv 2025
[44]

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents. In International Conference on Learning Representations. https://openreview.net/ forum?id=zG459X3Xge

2025
[45]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid Loss for Language Image Pre-Training. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11975–11986

2023
[46]

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexan- der Kolesnikov, and Lucas Beyer. 2022. LiT: Zero-Shot Transfer With Locked- image Text Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18123–18133

2022
[47]

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2025. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs. arXiv:2412.16855 [cs.CL] https://arxiv.org/abs/2412.16855 Includes gme-Qwen2-VL checkpoints. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. 2026. jina-embeddings-v5- text: Task-Targeted Embedding Distillation. arXiv:2602.15547 [cs.CL] https: //arxiv.org/abs/2602.15547

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Alibaba Tongyi Lab. 2024. gte-Qwen2: General Text Embeddings Based on Qwen2. Hugging Face model collection. https://huggingface.co/collections/Alibaba- NLP/gte-qwen2

2024

[3] [3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou

[5] [5]

arXiv:2601.03666 [cs.CL] https://arxiv.org/abs/2601.03666

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings. arXiv:2601.03666 [cs.CL] https://arxiv.org/abs/2601.03666

work page arXiv

[6] [6]

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang

[7] [7]

arXiv:2503.18931 [cs.CV] https://arxiv.org/abs/2503.18931

CoMP: Continual Multimodal Pre-training for Vision Foundation Models. arXiv:2503.18931 [cs.CV] https://arxiv.org/abs/2503.18931

work page arXiv

[8] [8]

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Haojie Zhang, Zhijie Gu, Yux- uan Zhou, Jingren Zhou, Junyang Lin, and Chang Zhou. 2025. Qwen2.5-Omni Technical Report. arXiv:2503.20215 [cs.CL] https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, and Kenneth Enevoldsen. 2026. MAEB: Massive Audio Embedding Benchmark. arXiv:2602.16008 [cs.SD] ...

work page arXiv 2026

[10] [10]

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang

[11] [11]

In IEEE International Conference on Acoustics, Speech and Signal Processing

CLAP Learning Audio Concepts From Natural Language Supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing. 1–5

[12] [12]

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao...

work page arXiv 2025

[13] [13]

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023. EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv:2303.15389 [cs.CV] https://arxiv.org/abs/ 2303.15389

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space To Bind Them All. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15180–15190

2023

[15] [15]

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. 2024. E5-V: Universal Embeddings with Multimodal Large Language Models. arXiv:2407.12580 [cs.CL] https://arxiv. org/abs/2407.12580

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2025. MMEB: Massive Multi-discipline Multimodal Embedding Bench- mark. arXiv:2410.05160 [cs.CV] https://arxiv.org/abs/2410.05160 Introduced with VLM2Vec

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Moham- mad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, and Han Xiao. 2024. jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images. arXiv:2412.08802 [cs.CL] https://arxiv.org/abs/2412.08802

work page arXiv 2024

[18] [18]

Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, and Han Xiao. 2024. Jina CLIP: Your CLIP Model Is Also Your Text Retriever. arXiv:2405.20204 [cs.CL] https://arxiv.org/abs/2405.20204

work page arXiv 2024

[19] [19]

Aditya Kusupati, Ashish Bhatt, Matthew Wallingford, Aniruddha Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Jain, and Ali Farhadi. 8

[20] [20]

InAdvances in Neural Information Processing Systems

Matryoshka Representation Learning. InAdvances in Neural Information Processing Systems

[21] [21]

Chien Van Lee, Rajarshi Roy, Mengting Xu, Jonathan Raiman, Mohammad Shoeybi, and Bryan Catanzaro. 2025. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. arXiv:2412.04252 [cs.CL] https://arxiv.org/abs/2412.04252

work page arXiv 2025

[22] [22]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. 9459–9474. https://proceedings.ne...

2020

[23] [23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the International Conference on Machine Learning, Vol. 202. PMLR, 19730–19742

2023

[24] [24]

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin

[25] [25]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking. arXiv:2601.04720 [cs.CL] https://arxiv.org/abs/2601.04720

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou

[27] [27]

InAdvances in Neural Information Processing Systems, Vol

Mind the Gap: Understanding the Modality Gap in Multi-modal Con- trastive Representation Learning. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., New Orleans, LA, USA, 17612–17625. arXiv:2203.02053 [cs.LG] https://arxiv.org/abs/2203.02053

work page arXiv

[28] [28]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. InAdvances in Neural Information Processing Systems, Vol. 36

2023

[29] [29]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations

2019

[30] [30]

Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval. arXiv:2505.17166 [cs.IR] https://arxiv. org/abs/2505.17166

work page arXiv 2025

[31] [31]

Zach Nussbaum, Brandon Duderstadt, and Andriy Mulyar. 2024. Nomic Embed Vision: Expanding the Latent Space. arXiv:2406.18587 [cs.CV] https://arxiv.org/ abs/2406.18587

work page arXiv 2024

[32] [32]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5

2026

[33] [33]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning, Vol. 139. PMLR, 8748–8763

2021

[34] [34]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, Vol. 202. PMLR, 28492–28518

2023

[35] [35]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, 3982–3992

2019

[36] [36]

Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1874–1883. https://openaccess.thecvf.com...

2016

[37] [37]

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, and Chao Zhang

[38] [38]

InInternational Conference on Learning Representations

WAVE: Learning Unified and Versatile Audio-Visual Embeddings with Multimodal LLM. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=MiV3WXDYJb

[39] [39]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. 2025. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Featur...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024. Multilingual E5 Text Embeddings: A Technical Report. arXiv:2402.05672 [cs.CL] https://arxiv.org/abs/2402.05672

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, and Yu Rong. 2025. Scaling Language-Centric Omnimodal Representation Learning. arXiv:2510.11693 [cs.CL] https://arxiv.org/abs/2510.11693

work page arXiv 2025

[43] [43]

Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Már- ton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, and Niklas Muennighoff. 2025. MIEB: Massive Image Embedding Benchmark. arXiv:2504.10471 [cs.CV] https://arxiv.org/abs/2504.10471

work page arXiv 2025

[44] [44]

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents. In International Conference on Learning Representations. https://openreview.net/ forum?id=zG459X3Xge

2025

[45] [45]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid Loss for Language Image Pre-Training. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11975–11986

2023

[46] [46]

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexan- der Kolesnikov, and Lucas Beyer. 2022. LiT: Zero-Shot Transfer With Locked- image Text Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18123–18133

2022

[47] [47]

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2025. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs. arXiv:2412.16855 [cs.CL] https://arxiv.org/abs/2412.16855 Includes gme-Qwen2-VL checkpoints. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025