arxiv: 2605.08384 · v2 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

Florian H\"onicke , Michael G\"unther , Andreas Koukounas , Kalim Akram , Scott Martens , Saba Sturua , Han Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal embeddingsfrozen encoderscross-modal alignmentefficient trainingsemantic geometrytext embeddingsimage audio video

0 comments

The pith

GELATO extends existing text embedding models to images, audio and video by freezing nearly all weights and training only the connectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GELATO as a way to add multimodal support to strong text embedding models without retraining everything from scratch. It keeps the original text models and new modality encoders frozen, training only the small set of connecting components that link them into one shared space. This leaves text embeddings exactly unchanged while adding the ability to embed images, audio and video alongside text. The resulting models reach performance levels close to much larger multimodal systems. A reader would care because the method shows how to expand embedding capabilities with far lower compute cost and without sacrificing prior text quality.

Core claim

GELATO produces results that are competitive with the state-of-the-art by extending the Jina Embeddings v5 Text models with frozen non-text encoders for images and audio, training only the connecting components that represent 0.35 percent of total weights, and leaving the language model unaltered so it generates exactly the same embeddings for text inputs as the base models.

What carries the argument

Locked aligned towers consisting of frozen backbone text embedding models and frozen non-text modality encoders whose outputs are aligned into a shared semantic space through newly trained connecting components.

If this is right

Text inputs continue to produce identical embeddings to the original Jina Embeddings v5 Text models.
Training cost drops sharply because only 0.35 percent of the weights are updated.
Images, audio, and video can be encoded directly into the same semantic space as text.
Performance stays nearly equal to larger multimodal embedding models on standard evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same locked-tower pattern could be applied to other strong text embedding models to add new modalities quickly.
If connectors alone can align modalities without touching the core, future work might explore adding even more input types with minimal extra training.
Preservation of text geometry suggests that semantic relationships already learned in text can serve as stable anchors for cross-modal mapping.

Load-bearing premise

Freezing the backbone text embedding models and non-text modality encoders while training only the connecting components will preserve semantic geometry and enable effective cross-modal alignment without degrading original text performance.

What would settle it

A side-by-side test on the original text-only benchmarks showing that GELATO scores drop more than a few points below the base Jina Embeddings v5 Text models, or that its multimodal scores fall well below those of larger comparable models.

Figures

Figures reproduced from arXiv: 2605.08384 by Andreas Koukounas, Florian H\"onicke, Han Xiao, Kalim Akram, Michael G\"unther, Saba Sturua, Scott Martens.

**Figure 2.** Figure 2: Architecture of jina-embeddings-v5-omni (jina-embeddings-v5-omni-small shown; jina-embeddings-v5-omni-nano uses a smaller ViT and LLaVA-style tokens). Frozen towers feed trainable modality projectors into the frozen text backbone; task-specific exports select one projector/delimiter set and the matching LoRA adapter. models such as E5-Mistral [33] and NV-Embed [17]. Jina Embeddings v5 Text [1] draws on th… view at source ↗

**Figure 3.** Figure 3: Distribution of input tokens across semantic data types, averaged over the four task-specific checkpoints. 5 Evaluation We describe each evaluation suite by the types of tasks it covers: • Images: The Massive Image Embedding Benchmark (MIEB) [36] covers classification, clustering, visual semantic textual similarity (STS), retrieval, document retrieval, compositional reasoning, and vision-centric tasks. • … view at source ↗

**Figure 5.** Figure 5: Per-language audio retrieval. Tiles show [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 4.** Figure 4: XM3600 image-language comparison. Tiles show [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Vision ablations tests on CIRR-IT2I and NIGHTS [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Audio ablation tests on UrbanSound8K, Common [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 6.** Figure 6: Vision ablation tests on CIRR-IT2I and NIGHTS [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Matryoshka prefix tests across modalities. Curves [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GELATO shows a practical low-cost way to add modalities to text embeddings by freezing backbones and training only connectors, but the competitiveness claim needs the actual numbers to hold up.

read the letter

The main point is that they extend the Jina v5 text models to handle images, audio, and video by adding modality encoders, then training just the connectors between them and the frozen language model. This keeps the text embeddings exactly the same as before while using only 0.35% of the total weights for training. That efficiency angle is the real contribution here, and it is a sensible engineering move for anyone who already relies on the original text model and wants to avoid full retraining or risk changing existing outputs. The locked-tower setup directly addresses the goal of geometry preservation, which is a clear step beyond naive adapter methods that often require more updates. The architecture itself follows VLM patterns but narrows the trainable part to the connectors, which is a clean implementation choice. The soft spot is the evidence. The abstract states that results are competitive with larger multimodal models, yet the summary provides no metrics, baselines, error bars, or ablation tables to back that up. The stress-test concern lands: if the frozen non-text encoders produce features that sit far from what the language model expects, the small connectors may only achieve loose alignment, and we would see that in cross-modal retrieval or zero-shot scores. Without those details it is difficult to judge whether the geometry is truly preserved in a useful way or if text performance stayed flat in practice. This is aimed at teams building retrieval or search systems who need multimodal support without high compute costs. A reader working on embedding deployment would find the method description useful even if they have to wait for the full results. It deserves peer review because the core idea is straightforward, reproducible in principle, and addresses a real deployment pain point.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a VLM-style architecture that extends the Jina Embeddings v5 text models to multimodal inputs (text, image, audio, video) by adding frozen non-text encoders and training only the connecting components (0.35% of total weights). The language model backbone remains locked, so text embeddings are identical to the original Jina v5 models. The central claim is that this yields embeddings competitive with larger state-of-the-art multimodal models while preserving semantic geometry.

Significance. If the empirical claims are substantiated, the work would be significant for efficient multimodal embedding development: it demonstrates a low-cost way to add modalities without full-parameter retraining or degradation of existing text performance. The geometry-preservation emphasis and the explicit parameter count (0.35%) are strengths that could influence practical deployment in retrieval and zero-shot tasks.

major comments (2)

[Abstract] Abstract: The assertion that 'our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models' is presented without any quantitative metrics, baselines, tables, error bars, or evaluation protocols. This absence is load-bearing for the central claim of competitiveness and must be addressed with concrete results.
[Method / Training] Method description (training procedure): The claim that freezing the text embedding models and non-text encoders while training only the connectors preserves semantic geometry and enables effective cross-modal alignment lacks supporting ablations. No experiments are referenced that isolate the connectors' contribution, confirm unchanged text-only metrics, or demonstrate that the frozen encoders' feature distributions are successfully mapped into the LM's input space.

minor comments (1)

[Evaluations] The manuscript should include a dedicated evaluation section with explicit task definitions (e.g., cross-modal retrieval, zero-shot classification), datasets, and comparison models to allow readers to assess the 'nearly equal performance' statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and valuable suggestions. We address the major comments point-by-point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models' is presented without any quantitative metrics, baselines, tables, error bars, or evaluation protocols. This absence is load-bearing for the central claim of competitiveness and must be addressed with concrete results.

Authors: We agree that the abstract would be strengthened by including quantitative support. The full paper presents comprehensive evaluations in Sections 4 and 5, including tables with metrics on multiple benchmarks (e.g., image-text retrieval on COCO, audio classification on AudioSet), where GELATO matches or approaches SOTA models within small margins. We will revise the abstract to include key results such as specific accuracy or recall figures and reference the evaluation protocols used. revision: yes
Referee: [Method / Training] Method description (training procedure): The claim that freezing the text embedding models and non-text encoders while training only the connectors preserves semantic geometry and enables effective cross-modal alignment lacks supporting ablations. No experiments are referenced that isolate the connectors' contribution, confirm unchanged text-only metrics, or demonstrate that the frozen encoders' feature distributions are successfully mapped into the LM's input space.

Authors: The design ensures unchanged text metrics because the text embedding model is completely frozen and not updated during training; we explicitly verify and report this in the results section by comparing text-only performance before and after adding the multimodal components. For the alignment, the connectors are trained with a contrastive loss that maps non-text features into the text embedding space. We recognize that dedicated ablations would better isolate the connectors' role and visualize the mapping. We will add such ablations, including a comparison of performance with and without training the connectors, and analysis of embedding similarities. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on empirical evaluation of frozen-tower architecture

full rationale

The manuscript describes an engineering construction (GELATO) in which text and non-text encoders are frozen while only 0.35 % connecting weights are trained. The central assertions—preservation of text geometry and competitive multimodal performance—are presented as outcomes of this training procedure and are justified by reported benchmark numbers rather than by any equation, fitted parameter, or self-citation that reduces the claimed result to the input data by construction. No mathematical derivations appear; the single self-reference to prior Jina v5 text models is merely the frozen starting point and does not carry the load of proving the new cross-modal alignment. Consequently the derivation chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that VLM-style architectures can be extended via frozen encoders and small connectors without loss of geometric properties; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption VLM-style architecture allows non-text encoders to produce inputs for a language model that generates embeddings for all modalities
Invoked in the description of building on VLM-style architecture with added frozen encoders.

pith-pipeline@v0.9.0 · 5531 in / 1237 out tokens · 81900 ms · 2026-05-13T06:48:58.974981+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights
IndisputableMonolith/Cost/FunctionalEquation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 9 internal anchors

[1]

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. 2026. jina-embeddings-v5- text: Task-Targeted Embedding Distillation. arXiv:2602.15547 [cs.CL] https: //arxiv.org/abs/2602.15547

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Alibaba Tongyi Lab. 2024. gte-Qwen2: General Text Embeddings Based on Qwen2. Hugging Face model collection. https://huggingface.co/collections/Alibaba- NLP/gte-qwen2

work page 2024
[3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou

work page
[5]

arXiv:2601.03666 [cs.CL] https://arxiv.org/abs/2601.03666

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings. arXiv:2601.03666 [cs.CL] https://arxiv.org/abs/2601.03666

work page arXiv
[6]

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang

work page
[7]

arXiv:2503.18931 [cs.CV] https://arxiv.org/abs/2503.18931

CoMP: Continual Multimodal Pre-training for Vision Foundation Models. arXiv:2503.18931 [cs.CV] https://arxiv.org/abs/2503.18931

work page arXiv
[8]

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Haojie Zhang, Zhijie Gu, Yux- uan Zhou, Jingren Zhou, Junyang Lin, and Chang Zhou. 2025. Qwen2.5-Omni Technical Report. arXiv:2503.20215 [cs.CL] https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, and Kenneth Enevoldsen. 2026. MAEB: Massive Audio Embedding Benchmark. arXiv:2602.16008 [cs.SD] ...

work page arXiv 2026
[10]

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang

work page
[11]

In IEEE International Conference on Acoustics, Speech and Signal Processing

CLAP Learning Audio Concepts From Natural Language Supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing. 1–5

work page
[12]

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao...

work page arXiv 2025
[13]

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023. EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv:2303.15389 [cs.CV] https://arxiv.org/abs/ 2303.15389

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space To Bind Them All. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15180–15190

work page 2023
[15]

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. 2024. E5-V: Universal Embeddings with Multimodal Large Language Models. arXiv:2407.12580 [cs.CL] https://arxiv. org/abs/2407.12580

work page arXiv 2024
[16]

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2025. MMEB: Massive Multi-discipline Multimodal Embedding Bench- mark. arXiv:2410.05160 [cs.CV] https://arxiv.org/abs/2410.05160 Introduced with VLM2Vec

work page arXiv 2025
[17]

Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Moham- mad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, and Han Xiao. 2024. jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images. arXiv:2412.08802 [cs.CL] https://arxiv.org/abs/2412.08802

work page arXiv 2024
[18]

Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, and Han Xiao. 2024. Jina CLIP: Your CLIP Model Is Also Your Text Retriever. arXiv:2405.20204 [cs.CL] https://arxiv.org/abs/2405.20204

work page arXiv 2024
[19]

Aditya Kusupati, Ashish Bhatt, Matthew Wallingford, Aniruddha Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Jain, and Ali Farhadi. 8

work page
[20]

InAdvances in Neural Information Processing Systems

Matryoshka Representation Learning. InAdvances in Neural Information Processing Systems

work page
[21]

Chien Van Lee, Rajarshi Roy, Mengting Xu, Jonathan Raiman, Mohammad Shoeybi, and Bryan Catanzaro. 2025. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. arXiv:2412.04252 [cs.CL] https://arxiv.org/abs/2412.04252

work page arXiv 2025
[22]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. 9459–9474. https://proceedings.ne...

work page 2020
[23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the International Conference on Machine Learning, Vol. 202. PMLR, 19730–19742

work page 2023
[24]

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin

work page
[25]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking. arXiv:2601.04720 [cs.CL] https://arxiv.org/abs/2601.04720

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou

work page
[27]

InAdvances in Neural Information Processing Systems, Vol

Mind the Gap: Understanding the Modality Gap in Multi-modal Con- trastive Representation Learning. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., New Orleans, LA, USA, 17612–17625. arXiv:2203.02053 [cs.LG] https://arxiv.org/abs/2203.02053

work page arXiv
[28]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. InAdvances in Neural Information Processing Systems, Vol. 36

work page 2023
[29]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations

work page 2019
[30]

Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval. arXiv:2505.17166 [cs.IR] https://arxiv. org/abs/2505.17166

work page arXiv 2025
[31]

Zach Nussbaum, Brandon Duderstadt, and Andriy Mulyar. 2024. Nomic Embed Vision: Expanding the Latent Space. arXiv:2406.18587 [cs.CV] https://arxiv.org/ abs/2406.18587

work page arXiv 2024
[32]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5

work page 2026
[33]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning, Vol. 139. PMLR, 8748–8763

work page 2021
[34]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, Vol. 202. PMLR, 28492–28518

work page 2023
[35]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, 3982–3992

work page 2019
[36]

Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1874–1883. https://openaccess.thecvf.com...

work page 2016
[37]

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, and Chao Zhang

work page
[38]

InInternational Conference on Learning Representations

WAVE: Learning Unified and Versatile Audio-Visual Embeddings with Multimodal LLM. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=MiV3WXDYJb

work page
[39]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. 2025. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Featur...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024. Multilingual E5 Text Embeddings: A Technical Report. arXiv:2402.05672 [cs.CL] https://arxiv.org/abs/2402.05672

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, and Yu Rong. 2025. Scaling Language-Centric Omnimodal Representation Learning. arXiv:2510.11693 [cs.CL] https://arxiv.org/abs/2510.11693

work page arXiv 2025
[43]

Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Már- ton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, and Niklas Muennighoff. 2025. MIEB: Massive Image Embedding Benchmark. arXiv:2504.10471 [cs.CV] https://arxiv.org/abs/2504.10471

work page arXiv 2025
[44]

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents. In International Conference on Learning Representations. https://openreview.net/ forum?id=zG459X3Xge

work page 2025
[45]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid Loss for Language Image Pre-Training. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11975–11986

work page 2023
[46]

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexan- der Kolesnikov, and Lucas Beyer. 2022. LiT: Zero-Shot Transfer With Locked- image Text Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18123–18133

work page 2022
[47]

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2025. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs. arXiv:2412.16855 [cs.CL] https://arxiv.org/abs/2412.16855 Includes gme-Qwen2-VL checkpoints. 9

work page internal anchor Pith review arXiv 2025