pith. sign in

arxiv: 2509.18095 · v2 · submitted 2025-09-22 · 💻 cs.IR · cs.CL· cs.CV

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Pith reviewed 2026-05-18 14:05 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.CV
keywords multimodal retrievalmeta tokensmatryoshka traininglate interactiontest-time scalingmulti-vector embeddingsinformation retrieval
0
0 comments X

The pith

MetaEmbed lets users select how many Meta Tokens to use at test time for balancing multimodal retrieval quality against speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MetaEmbed as a framework that appends a fixed number of learnable Meta Tokens to the input sequence of multimodal models during training. At test time the last-layer representations of these tokens become compact multi-vector embeddings for late-interaction retrieval. Matryoshka Multi-Vector Retrieval training teaches the model to spread information by granularity across the tokens so that any smaller subset still supports useful retrieval. This organization removes the need to retrain when a user wants a different quality-efficiency tradeoff. Experiments on MMEB and ViDoRe show state-of-the-art results that hold when the underlying model grows to 32 billion parameters.

Core claim

MetaEmbed appends a fixed set of learnable Meta Tokens to each input. Their contextualized representations serve as multi-vector embeddings. Matryoshka Multi-Vector Retrieval training arranges semantic content by granularity across these vectors, so any prefix or subset of the tokens remains effective for indexing and late-interaction scoring without retraining.

What carries the argument

Matryoshka Multi-Vector Retrieval training that distributes information by granularity across the fixed Meta Tokens so subsets stay useful for retrieval.

If this is right

  • Retrieval pipelines can choose a smaller token count at query time to reduce latency while retaining acceptable accuracy.
  • The same trained model works across efficiency regimes without additional fine-tuning.
  • Performance scales to 32B-parameter backbones on both MMEB and ViDoRe benchmarks.
  • Index size and interaction cost can be traded off by selecting different numbers of tokens per collection or query.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production systems could adapt the token count per query according to current load or user priority.
  • The same training pattern might transfer to text-only retrieval or other late-interaction tasks.
  • Vector storage requirements could be lowered by indexing less critical items with fewer tokens.
  • Automatic policies might learn to pick the right token count from query features alone.

Load-bearing premise

Matryoshka training succeeds in organizing the Meta Tokens so any smaller number of them still produces useful retrieval signals without retraining.

What would settle it

An experiment showing that retrieval accuracy drops sharply when using only the first half of the Meta Tokens on MMEB, compared with using all tokens, would falsify the flexible scaling claim.

read the original abstract

Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters. Code is available at https://github.com/facebookresearch/MetaEmbed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces MetaEmbed, a multimodal retrieval framework that appends a fixed number of learnable Meta Tokens to the input sequence during training. The last-layer contextualized representations of these tokens are used as compact multi-vector embeddings at test time. The Matryoshka Multi-Vector Retrieval training is claimed to organize information by granularity across the Meta Tokens, enabling users to select the number of tokens for indexing and retrieval interactions to trade off quality against efficiency. The work reports state-of-the-art results on the Massive Multimodal Embedding Benchmark (MMEB) and Visual Document Retrieval Benchmark (ViDoRe) while scaling to 32B-parameter backbones.

Significance. If the granularity organization holds, the approach would offer a meaningful advance by providing test-time flexibility in multi-vector retrieval without retraining, addressing the expressiveness-efficiency trade-off at scale. The reported scaling to 32B models and SOTA benchmark results would be notable strengths if supported by robust verification of the core mechanism.

major comments (1)
  1. [Abstract and Matryoshka Multi-Vector Retrieval training description] The central claim that Matryoshka Multi-Vector Retrieval training produces ordered, prefix-usable representations (allowing arbitrary subsets of Meta Tokens to remain effective at test time) is load-bearing but not directly verified. No ablation compares retrieval metrics such as recall or mAP when using the first k Meta Tokens versus a random k-subset (or versus the full set), which is required to confirm hierarchical organization rather than incidental post-hoc usability.
minor comments (2)
  1. [Evaluation on MMEB and ViDoRe] The evaluation sections would be strengthened by reporting error bars or results across multiple random seeds, particularly for the scaling experiments with 32B models and varying token counts.
  2. [Method] Clarify the precise architectural integration of the Meta Tokens (e.g., whether they are added only to the query, the candidate, or both) and how late interaction is computed when using partial token sets.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The feedback highlights an important verification point for our central claim, and we address it directly below.

read point-by-point responses
  1. Referee: [Abstract and Matryoshka Multi-Vector Retrieval training description] The central claim that Matryoshka Multi-Vector Retrieval training produces ordered, prefix-usable representations (allowing arbitrary subsets of Meta Tokens to remain effective at test time) is load-bearing but not directly verified. No ablation compares retrieval metrics such as recall or mAP when using the first k Meta Tokens versus a random k-subset (or versus the full set), which is required to confirm hierarchical organization rather than incidental post-hoc usability.

    Authors: We appreciate the referee's emphasis on directly verifying the hierarchical organization induced by Matryoshka Multi-Vector Retrieval training. Our training objective explicitly encourages the Meta Tokens to capture information at progressively finer granularities, with earlier tokens handling coarser semantics and later tokens adding detail; this design is reflected in the consistent performance gains observed when scaling the number of tokens at test time on MMEB and ViDoRe. We acknowledge, however, that the original manuscript did not include an explicit ablation contrasting the first-k prefix against a random k-subset (or the full set) using recall or mAP. Such a comparison would indeed strengthen the evidence that the ordering is not incidental. We will add this ablation to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: training objective and test-time selection are independently motivated

full rationale

The paper introduces a training procedure called Matryoshka Multi-Vector Retrieval that is explicitly designed to produce granularity-organized representations across a fixed set of learnable Meta Tokens. The test-time flexibility (selecting k tokens for indexing and interaction) is presented as a direct consequence of that training rather than a redefinition or self-referential fit. No equation or claim reduces the claimed property to a fitted parameter that is then relabeled as a prediction; the organization by granularity is an empirical outcome of the loss, not a definitional tautology. The derivation chain therefore remains self-contained against external benchmarks such as MMEB and ViDoRe, with no load-bearing self-citation or ansatz smuggling required for the central result.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the assumption that appending a fixed number of learnable tokens and training them with a Matryoshka objective produces independently useful partial embeddings; no explicit free parameters beyond the token count and model size are named, and no new physical entities are introduced.

free parameters (1)
  • number of Meta Tokens
    A fixed but learnable count of special tokens appended to every input; the exact count is chosen during training and determines the maximum number of vectors available at test time.
invented entities (1)
  • Meta Tokens no independent evidence
    purpose: Learnable tokens whose contextualized representations serve as the multi-vector embedding
    New tokens introduced by the framework; no independent evidence outside the training procedure is provided in the abstract.

pith-pipeline@v0.9.0 · 5769 in / 1246 out tokens · 27002 ms · 2026-05-18T14:05:40.780312+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...

  2. Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models

    cs.LG 2026-05 unverdicted novelty 7.0

    A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.

  3. Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.

  4. CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

    cs.CL 2026-01 unverdicted novelty 6.0

    CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.

  5. FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

    cs.AI 2026-05 unverdicted novelty 5.0

    FLUID retires candidate-side item IDs in production livestream rankers via cross-domain multimodal hierarchical codes and late-fusion ID-free design, reporting online gains of +0.55% Quality Watch Duration and +2.05% ...

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 4 Pith papers · 12 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Phi-3 technical report: A highly capable language model locally on your phone, 2024

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming - Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical repo...

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, Andr \'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024

  5. [5]

    Matryoshka multimodal models

    Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models. In The Thirteenth International Conference on Learning Representations, 2025

  6. [6]

    Moca: Modality-aware continual pre-training makes better bidirectional multimodal embeddings

    Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, and Zhicheng Dou. Moca: Modality-aware continual pre-training makes better bidirectional multimodal embeddings. arXiv preprint arXiv:2506.23115, 2025 a

  7. [7]

    mme5: Improving multimodal multilingual embeddings via high-quality synthetic data

    Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, and Zhicheng Dou. mme5: Improving multimodal multilingual embeddings via high-quality synthetic data. CoRR, abs/2502.08468, 2025 b . doi:10.48550/ARXIV.2502.08468

  8. [8]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

  9. [9]

    Meta clip 2: A worldwide scaling recipe

    Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe. arXiv preprint arXiv:2507.22062, 2025

  10. [10]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024

  11. [11]

    DESSERT : An efficient algorithm for vector set search with vector set queries

    Joshua Engels, Benjamin Coleman, Vihan Lakshman, and Anshumali Shrivastava. DESSERT : An efficient algorithm for vector set search with vector set queries. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  12. [12]

    Vse++: Improving visual-semantic embeddings with hard negatives

    Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic embeddings with hard negatives. 2018

  13. [13]

    Colpali: Efficient document retrieval with vision language models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C \' e line Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025

  14. [14]

    Devise: A deep visual-semantic embedding model

    Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013

  15. [15]

    Deep image retrieval: Learning global representations for image search

    Albert Gordo, Jon Almaz \'a n, Jerome Revaud, and Diane Larlus. Deep image retrieval: Learning global representations for image search. In European conference on computer vision, pages 241--257. Springer, 2016

  16. [16]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  17. [17]

    Breaking the modality barrier: Universal embedding learning with multimodal llms

    Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. Breaking the modality barrier: Universal embedding learning with multimodal llms. arXiv preprint arXiv:2504.17432, 2025

  18. [18]

    Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P

    Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  19. [19]

    jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval, 2025

    Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, and Han Xiao. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval, 2025

  20. [20]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

  21. [21]

    Learning answer embeddings for visual question answering

    Hexiang Hu, Wei-Lun Chao, and Fei Sha. Learning answer embeddings for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5428--5436, 2018

  22. [22]

    Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

    Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12065--12075, October 2023

  23. [23]

    Mimicking or reasoning: Rethinking multi-modal in-context learning in vision-language models

    Chengyue Huang, Yuchen Zhu, Sichen Zhu, Jingyun Xiao, Moises Andrade, Shivang Chopra, and Zsolt Kira. Mimicking or reasoning: Rethinking multi-modal in-context learning in vision-language models. arXiv preprint arXiv:2506.07936, 2025

  24. [24]

    Muvera: Multi-vector retrieval via fixed dimensional encoding

    Rajesh Jayaram, Laxman Dhulipala, Majid Hadian, Jason D Lee, and Vahab Mirrokni. Muvera: Multi-vector retrieval via fixed dimensional encoding. Advances in Neural Information Processing Systems, 37: 0 101042--101073, 2024

  25. [25]

    VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024

  26. [26]

    VLM 2vec: Training vision-language models for massive multimodal embedding tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM 2vec: Training vision-language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    From generator to embedder: Harnessing innate abilities of multimodal llms via building zero-shot discriminative embedding model

    Yeong-Joon Ju and Seong-Whan Lee. From generator to embedder: Harnessing innate abilities of multimodal llms via building zero-shot discriminative embedding model. arXiv preprint arXiv:2508.00955, 2025

  28. [28]

    R efer I t G ame: Referring to Objects in Photographs of Natural Scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. " R efer I t G ame: Referring to Objects in Photographs of Natural Scenes" . In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 787--798, Doha, Qatar, October 2014. Associat...

  29. [29]

    Colbert: Efficient and effective passage search via contextualized late interaction over bert

    Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39--48, 2020

  30. [30]

    Jamie Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models, 2014

  31. [31]

    Modality curation: Building universal embeddings for advanced multimodal information retrieval

    Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Fuzheng Zhang, Guorui Zhou, et al. Modality curation: Building universal embeddings for advanced multimodal information retrieval. arXiv preprint arXiv:2505.19650, 2025

  32. [32]

    Matryoshka representation learning

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35: 0 30233--30249, 2022

  33. [33]

    Llave: Large language and vision embedding models with hardness-weighted contrastive learning

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning. arXiv preprint arXiv:2503.04812, 2025

  34. [34]

    Rethinking the role of token retrieval in multi-vector retrieval

    Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vincent Zhao. Rethinking the role of token retrieval in multi-vector retrieval. Advances in Neural Information Processing Systems, 36: 0 15384--15405, 2023

  35. [35]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

  36. [36]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv \' a ri, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimor...

  37. [37]

    CITADEL : Conditional token interaction via dynamic lexical routing for efficient and effective multi-vector retrieval

    Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. CITADEL : Conditional token interaction via dynamic lexical routing for efficient and effective multi-vector retrieval. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association fo...

  38. [38]

    Pytorch distributed: experiences on accelerating data parallel training

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13 0 (12): 0 3005--3018, 2020

  39. [39]

    Mm-embed: Universal multimodal retrieval with multimodal llms, 2024

    Sheng - chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms, 2024

  40. [40]

    MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS . In The Thirteenth International Conference on Learning Representations, 2025

  41. [41]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision -- ECCV 2014, pages 740--755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1

  42. [42]

    Visual news: Benchmark and challenges in news image captioning

    Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6761--6771, Online and Punta Cana, Dominican Republi...

  43. [43]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, ...

  44. [44]

    Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

    Quentin Mac \'e , Ant \'o nio Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval. arXiv preprint arXiv:2505.17166, 2025

  45. [45]

    C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. " C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning" . In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2263--2279, Dublin, Ireland, May 2022. As...

  46. [46]

    VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents. arXiv preprint arXiv:2507.04590, 2025

  47. [47]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  48. [48]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

  49. [49]

    UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings

    Jiajun Qin, Yuan Pu, Zhuolun He, Seunggeun Kim, David Z Pan, and Bei Yu. Unimoco: Unified modality completion for robust multi-modal embeddings. arXiv preprint arXiv:2505.11815, 2025

  50. [50]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

  51. [51]

    Multiple instance visual-semantic embedding

    Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. Multiple instance visual-semantic embedding. In Gabriel Brostow Tae-Kyun Kim, Stefanos Zafeiriou and Krystian Mikolajczyk, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 89.1--89.12. BMVA Press, September 2017

  52. [52]

    Plaid: an efficient engine for late interaction retrieval

    Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. Plaid: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 1747--1756, 2022 a

  53. [53]

    Colbertv2: Effective and efficient retrieval via lightweight late interaction

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715--3734, 2022 b

  54. [54]

    Drill-down: Interactive retrieval of complex scenes using natural language queries

    Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. Drill-down: Interactive retrieval of complex scenes using natural language queries. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  55. [56]

    Breaking the batch barrier (b3) of contrastive learning via smart batch mining

    Raghuveer Thirukovalluru, Rui Meng, Ye Liu, Mingyi Su, Ping Nie, Semih Yavuz, Yingbo Zhou, Wenhu Chen, Bhuwan Dhingra, et al. Breaking the batch barrier (b3) of contrastive learning via smart batch mining. arXiv preprint arXiv:2505.11293, 2025 b

  56. [57]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238--5248, 2022

  57. [58]

    Particular object retrieval with integral max-pooling of cnn activations

    Giorgos Tolias, Ronan Sicre, and Herv \'e J \'e gou. Particular object retrieval with integral max-pooling of cnn activations. In ICLR 2016-International Conference on Learning Representations, pages 1--12, 2016

  58. [59]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. CoRR, abs/2409.12191, 2024. doi:10.4855...

  59. [60]

    Bfloat16: The secret to high performance on cloud tpus

    Shibo Wang and Pankaj Kanwar. Bfloat16: The secret to high performance on cloud tpus. Google Cloud Blog, 8 2019

  60. [61]

    Uniir: Training and benchmarking universal multimodal information retrievers

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and G \" u l Varol, editors, Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, Septemb...

  61. [62]

    On the theoretical limitations of embedding-based retrieval

    Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval. arXiv preprint arXiv:2508.21038, 2025

  62. [63]

    Fashion iq: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11307--11317, June 2021

  63. [64]

    Grounding language models for visual entity recognition

    Zilin Xiao, Ming Gong, Paola Cascante-Bonilla, Xingyao Zhang, Jie Wu, and Vicente Ordonez. Grounding language models for visual entity recognition. In European Conference on Computer Vision, pages 393--411. Springer, 2024

  64. [65]

    Demystifying CLIP Data

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023

  65. [66]

    Xu, M., Moreira, G., Ak, R., Osmulski, R., Babakhin, Y ., Yu, Z., Schifferer, B., and Oldridge, E

    Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, and Even Oldridge. Llama nemoretriever colembed: Top-performing text-image retrieval model. arXiv preprint arXiv:2507.05513, 2025

  66. [67]

    FILIP : Fine-grained interactive language-image pre-training

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. FILIP : Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2022

  67. [68]

    Berg, and Tamara L

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision -- ECCV 2016, pages 69--85, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46475-6

  68. [69]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023 , pages 11941--11952. IEEE , 2023. doi:10.1109/ICCV51070.2023.01100

  69. [70]

    Magiclens: Self-supervised image retrieval with open-ended instructions

    Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming - Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024 a

  70. [71]

    Gme: Improving universal multimodal retrieval by multimodal llms, 2024 b

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms, 2024 b

  71. [72]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025

  72. [73]

    Pytorch fsdp: Experiences on scaling fully sharded data parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proceedings of the VLDB Endowment, 16 0 (12): 0 3848--3860, 2023

  73. [74]

    Knowledge base graph embedding module design for visual question answering model

    Wenfeng Zheng, Lirong Yin, Xiaobing Chen, Zhiyang Ma, Shan Liu, and Bo Yang. Knowledge base graph embedding module design for visual question answering model. Pattern recognition, 120: 0 108153, 2021

  74. [75]

    Megapairs: Massive data synthesis for universal multimodal retrieval.arXiv preprint arXiv:2412.14475,

    Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, and Yongping Xiong. Megapairs: Massive data synthesis for universal multimodal retrieval. arXiv preprint arXiv:2412.14475, 2024