MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
Pith reviewed 2026-05-18 14:05 UTC · model grok-4.3
The pith
MetaEmbed lets users select how many Meta Tokens to use at test time for balancing multimodal retrieval quality against speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MetaEmbed appends a fixed set of learnable Meta Tokens to each input. Their contextualized representations serve as multi-vector embeddings. Matryoshka Multi-Vector Retrieval training arranges semantic content by granularity across these vectors, so any prefix or subset of the tokens remains effective for indexing and late-interaction scoring without retraining.
What carries the argument
Matryoshka Multi-Vector Retrieval training that distributes information by granularity across the fixed Meta Tokens so subsets stay useful for retrieval.
If this is right
- Retrieval pipelines can choose a smaller token count at query time to reduce latency while retaining acceptable accuracy.
- The same trained model works across efficiency regimes without additional fine-tuning.
- Performance scales to 32B-parameter backbones on both MMEB and ViDoRe benchmarks.
- Index size and interaction cost can be traded off by selecting different numbers of tokens per collection or query.
Where Pith is reading between the lines
- Production systems could adapt the token count per query according to current load or user priority.
- The same training pattern might transfer to text-only retrieval or other late-interaction tasks.
- Vector storage requirements could be lowered by indexing less critical items with fewer tokens.
- Automatic policies might learn to pick the right token count from query features alone.
Load-bearing premise
Matryoshka training succeeds in organizing the Meta Tokens so any smaller number of them still produces useful retrieval signals without retraining.
What would settle it
An experiment showing that retrieval accuracy drops sharply when using only the first half of the Meta Tokens on MMEB, compared with using all tokens, would falsify the flexible scaling claim.
read the original abstract
Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters. Code is available at https://github.com/facebookresearch/MetaEmbed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MetaEmbed, a multimodal retrieval framework that appends a fixed number of learnable Meta Tokens to the input sequence during training. The last-layer contextualized representations of these tokens are used as compact multi-vector embeddings at test time. The Matryoshka Multi-Vector Retrieval training is claimed to organize information by granularity across the Meta Tokens, enabling users to select the number of tokens for indexing and retrieval interactions to trade off quality against efficiency. The work reports state-of-the-art results on the Massive Multimodal Embedding Benchmark (MMEB) and Visual Document Retrieval Benchmark (ViDoRe) while scaling to 32B-parameter backbones.
Significance. If the granularity organization holds, the approach would offer a meaningful advance by providing test-time flexibility in multi-vector retrieval without retraining, addressing the expressiveness-efficiency trade-off at scale. The reported scaling to 32B models and SOTA benchmark results would be notable strengths if supported by robust verification of the core mechanism.
major comments (1)
- [Abstract and Matryoshka Multi-Vector Retrieval training description] The central claim that Matryoshka Multi-Vector Retrieval training produces ordered, prefix-usable representations (allowing arbitrary subsets of Meta Tokens to remain effective at test time) is load-bearing but not directly verified. No ablation compares retrieval metrics such as recall or mAP when using the first k Meta Tokens versus a random k-subset (or versus the full set), which is required to confirm hierarchical organization rather than incidental post-hoc usability.
minor comments (2)
- [Evaluation on MMEB and ViDoRe] The evaluation sections would be strengthened by reporting error bars or results across multiple random seeds, particularly for the scaling experiments with 32B models and varying token counts.
- [Method] Clarify the precise architectural integration of the Meta Tokens (e.g., whether they are added only to the query, the candidate, or both) and how late interaction is computed when using partial token sets.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. The feedback highlights an important verification point for our central claim, and we address it directly below.
read point-by-point responses
-
Referee: [Abstract and Matryoshka Multi-Vector Retrieval training description] The central claim that Matryoshka Multi-Vector Retrieval training produces ordered, prefix-usable representations (allowing arbitrary subsets of Meta Tokens to remain effective at test time) is load-bearing but not directly verified. No ablation compares retrieval metrics such as recall or mAP when using the first k Meta Tokens versus a random k-subset (or versus the full set), which is required to confirm hierarchical organization rather than incidental post-hoc usability.
Authors: We appreciate the referee's emphasis on directly verifying the hierarchical organization induced by Matryoshka Multi-Vector Retrieval training. Our training objective explicitly encourages the Meta Tokens to capture information at progressively finer granularities, with earlier tokens handling coarser semantics and later tokens adding detail; this design is reflected in the consistent performance gains observed when scaling the number of tokens at test time on MMEB and ViDoRe. We acknowledge, however, that the original manuscript did not include an explicit ablation contrasting the first-k prefix against a random k-subset (or the full set) using recall or mAP. Such a comparison would indeed strengthen the evidence that the ordering is not incidental. We will add this ablation to the revised manuscript. revision: yes
Circularity Check
No circularity: training objective and test-time selection are independently motivated
full rationale
The paper introduces a training procedure called Matryoshka Multi-Vector Retrieval that is explicitly designed to produce granularity-organized representations across a fixed set of learnable Meta Tokens. The test-time flexibility (selecting k tokens for indexing and interaction) is presented as a direct consequence of that training rather than a redefinition or self-referential fit. No equation or claim reduces the claimed property to a fitted parameter that is then relabeled as a prediction; the organization by granularity is an empirical outcome of the loss, not a definitional tautology. The derivation chain therefore remains self-contained against external benchmarks such as MMEB and ViDoRe, with no load-bearing self-citation or ansatz smuggling required for the central result.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of Meta Tokens
invented entities (1)
-
Meta Tokens
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors... fix G group sizes... 1 ≤ r(1)_q < … < r(G)_q = R_q … s^(g)(q,c) … L_final = Σ w_g L^(g)_NCE
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the nested design yields a simple accuracy-efficiency knob... select (r^(g)_q , r^(g)_c) based on latency constraints
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
-
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
-
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.
-
FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation
FLUID retires candidate-side item IDs in production livestream rankers via cross-domain multimodal hierarchical codes and late-fusion ID-free design, reporting online gains of +0.55% Quality Watch Duration and +2.05% ...
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Phi-3 technical report: A highly capable language model locally on your phone, 2024
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024
work page 2024
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming - Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical repo...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
-
[4]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, Andr \'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[6]
Moca: Modality-aware continual pre-training makes better bidirectional multimodal embeddings
Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, and Zhicheng Dou. Moca: Modality-aware continual pre-training makes better bidirectional multimodal embeddings. arXiv preprint arXiv:2506.23115, 2025 a
-
[7]
mme5: Improving multimodal multilingual embeddings via high-quality synthetic data
Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, and Zhicheng Dou. mme5: Improving multimodal multilingual embeddings via high-quality synthetic data. CoRR, abs/2502.08468, 2025 b . doi:10.48550/ARXIV.2502.08468
-
[8]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[9]
Meta clip 2: A worldwide scaling recipe
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe. arXiv preprint arXiv:2507.22062, 2025
-
[10]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[11]
DESSERT : An efficient algorithm for vector set search with vector set queries
Joshua Engels, Benjamin Coleman, Vihan Lakshman, and Anshumali Shrivastava. DESSERT : An efficient algorithm for vector set search with vector set queries. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[12]
Vse++: Improving visual-semantic embeddings with hard negatives
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic embeddings with hard negatives. 2018
work page 2018
-
[13]
Colpali: Efficient document retrieval with vision language models
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C \' e line Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025
work page 2025
-
[14]
Devise: A deep visual-semantic embedding model
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013
work page 2013
-
[15]
Deep image retrieval: Learning global representations for image search
Albert Gordo, Jon Almaz \'a n, Jerome Revaud, and Diane Larlus. Deep image retrieval: Learning global representations for image search. In European conference on computer vision, pages 241--257. Springer, 2016
work page 2016
-
[16]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Breaking the modality barrier: Universal embedding learning with multimodal llms
Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. Breaking the modality barrier: Universal embedding learning with multimodal llms. arXiv preprint arXiv:2504.17432, 2025
-
[18]
Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P
Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
work page 2018
-
[19]
jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval, 2025
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, and Han Xiao. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval, 2025
work page 2025
-
[20]
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022
work page 2022
-
[21]
Learning answer embeddings for visual question answering
Hexiang Hu, Wei-Lun Chao, and Fei Sha. Learning answer embeddings for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5428--5436, 2018
work page 2018
-
[22]
Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities
Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12065--12075, October 2023
work page 2023
-
[23]
Mimicking or reasoning: Rethinking multi-modal in-context learning in vision-language models
Chengyue Huang, Yuchen Zhu, Sichen Zhu, Jingyun Xiao, Moises Andrade, Shivang Chopra, and Zsolt Kira. Mimicking or reasoning: Rethinking multi-modal in-context learning in vision-language models. arXiv preprint arXiv:2506.07936, 2025
-
[24]
Muvera: Multi-vector retrieval via fixed dimensional encoding
Rajesh Jayaram, Laxman Dhulipala, Majid Hadian, Jason D Lee, and Vahab Mirrokni. Muvera: Multi-vector retrieval via fixed dimensional encoding. Advances in Neural Information Processing Systems, 37: 0 101042--101073, 2024
work page 2024
-
[25]
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
VLM 2vec: Training vision-language models for massive multimodal embedding tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM 2vec: Training vision-language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[27]
Yeong-Joon Ju and Seong-Whan Lee. From generator to embedder: Harnessing innate abilities of multimodal llms via building zero-shot discriminative embedding model. arXiv preprint arXiv:2508.00955, 2025
-
[28]
R efer I t G ame: Referring to Objects in Photographs of Natural Scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. " R efer I t G ame: Referring to Objects in Photographs of Natural Scenes" . In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 787--798, Doha, Qatar, October 2014. Associat...
work page 2014
-
[29]
Colbert: Efficient and effective passage search via contextualized late interaction over bert
Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39--48, 2020
work page 2020
-
[30]
Jamie Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models, 2014
work page 2014
-
[31]
Modality curation: Building universal embeddings for advanced multimodal information retrieval
Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Fuzheng Zhang, Guorui Zhou, et al. Modality curation: Building universal embeddings for advanced multimodal information retrieval. arXiv preprint arXiv:2505.19650, 2025
-
[32]
Matryoshka representation learning
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35: 0 30233--30249, 2022
work page 2022
-
[33]
Llave: Large language and vision embedding models with hardness-weighted contrastive learning
Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning. arXiv preprint arXiv:2503.04812, 2025
-
[34]
Rethinking the role of token retrieval in multi-vector retrieval
Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vincent Zhao. Rethinking the role of token retrieval in multi-vector retrieval. Advances in Neural Information Processing Systems, 36: 0 15384--15405, 2023
work page 2023
-
[35]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv \' a ri, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimor...
work page 2022
-
[37]
Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. CITADEL : Conditional token interaction via dynamic lexical routing for efficient and effective multi-vector retrieval. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association fo...
-
[38]
Pytorch distributed: experiences on accelerating data parallel training
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13 0 (12): 0 3005--3018, 2020
work page 2020
-
[39]
Mm-embed: Universal multimodal retrieval with multimodal llms, 2024
Sheng - chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms, 2024
work page 2024
-
[40]
MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS . In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[41]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision -- ECCV 2014, pages 740--755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1
work page 2014
-
[42]
Visual news: Benchmark and challenges in news image captioning
Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6761--6771, Online and Punta Cana, Dominican Republi...
work page 2021
-
[43]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, ...
work page 2022
-
[44]
Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025
Quentin Mac \'e , Ant \'o nio Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval. arXiv preprint arXiv:2505.17166, 2025
-
[45]
C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. " C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning" . In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2263--2279, Dublin, Ireland, May 2022. As...
work page 2022
-
[46]
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents. arXiv preprint arXiv:2507.04590, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[48]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019
work page 2019
-
[49]
UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings
Jiajun Qin, Yuan Pu, Zhuolun He, Seunggeun Kim, David Z Pan, and Bei Yu. Unimoco: Unified modality completion for robust multi-modal embeddings. arXiv preprint arXiv:2505.11815, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...
work page 2021
-
[51]
Multiple instance visual-semantic embedding
Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. Multiple instance visual-semantic embedding. In Gabriel Brostow Tae-Kyun Kim, Stefanos Zafeiriou and Krystian Mikolajczyk, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 89.1--89.12. BMVA Press, September 2017
work page 2017
-
[52]
Plaid: an efficient engine for late interaction retrieval
Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. Plaid: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 1747--1756, 2022 a
work page 2022
-
[53]
Colbertv2: Effective and efficient retrieval via lightweight late interaction
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715--3734, 2022 b
work page 2022
-
[54]
Drill-down: Interactive retrieval of complex scenes using natural language queries
Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. Drill-down: Interactive retrieval of complex scenes using natural language queries. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019
work page 2019
-
[56]
Breaking the batch barrier (b3) of contrastive learning via smart batch mining
Raghuveer Thirukovalluru, Rui Meng, Ye Liu, Mingyi Su, Ping Nie, Semih Yavuz, Yingbo Zhou, Wenhu Chen, Bhuwan Dhingra, et al. Breaking the batch barrier (b3) of contrastive learning via smart batch mining. arXiv preprint arXiv:2505.11293, 2025 b
-
[57]
Winoground: Probing vision and language models for visio-linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238--5248, 2022
work page 2022
-
[58]
Particular object retrieval with integral max-pooling of cnn activations
Giorgos Tolias, Ronan Sicre, and Herv \'e J \'e gou. Particular object retrieval with integral max-pooling of cnn activations. In ICLR 2016-International Conference on Learning Representations, pages 1--12, 2016
work page 2016
-
[59]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. CoRR, abs/2409.12191, 2024. doi:10.4855...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12191 2024
-
[60]
Bfloat16: The secret to high performance on cloud tpus
Shibo Wang and Pankaj Kanwar. Bfloat16: The secret to high performance on cloud tpus. Google Cloud Blog, 8 2019
work page 2019
-
[61]
Uniir: Training and benchmarking universal multimodal information retrievers
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and G \" u l Varol, editors, Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, Septemb...
-
[62]
On the theoretical limitations of embedding-based retrieval
Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval. arXiv preprint arXiv:2508.21038, 2025
-
[63]
Fashion iq: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11307--11317, June 2021
work page 2021
-
[64]
Grounding language models for visual entity recognition
Zilin Xiao, Ming Gong, Paola Cascante-Bonilla, Xingyao Zhang, Jie Wu, and Vicente Ordonez. Grounding language models for visual entity recognition. In European Conference on Computer Vision, pages 393--411. Springer, 2024
work page 2024
-
[65]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Xu, M., Moreira, G., Ak, R., Osmulski, R., Babakhin, Y ., Yu, Z., Schifferer, B., and Oldridge, E
Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, and Even Oldridge. Llama nemoretriever colembed: Top-performing text-image retrieval model. arXiv preprint arXiv:2507.05513, 2025
-
[67]
FILIP : Fine-grained interactive language-image pre-training
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. FILIP : Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2022
work page 2022
-
[68]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision -- ECCV 2016, pages 69--85, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46475-6
work page 2016
-
[69]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023 , pages 11941--11952. IEEE , 2023. doi:10.1109/ICCV51070.2023.01100
-
[70]
Magiclens: Self-supervised image retrieval with open-ended instructions
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming - Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024 a
work page 2024
-
[71]
Gme: Improving universal multimodal retrieval by multimodal llms, 2024 b
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms, 2024 b
work page 2024
-
[72]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Pytorch fsdp: Experiences on scaling fully sharded data parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proceedings of the VLDB Endowment, 16 0 (12): 0 3848--3860, 2023
work page 2023
-
[74]
Knowledge base graph embedding module design for visual question answering model
Wenfeng Zheng, Lirong Yin, Xiaobing Chen, Zhiyang Ma, Shan Liu, and Bo Yang. Knowledge base graph embedding module design for visual question answering model. Pattern recognition, 120: 0 108153, 2021
work page 2021
-
[75]
Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, and Yongping Xiong. Megapairs: Massive data synthesis for universal multimodal retrieval. arXiv preprint arXiv:2412.14475, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.