Recognition: 2 theorem links
· Lean TheoremGME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Pith reviewed 2026-05-15 06:31 UTC · model grok-4.3
The pith
Training an MLLM on synthetically balanced fused text-image data produces a single dense retriever that leads on universal multimodal search tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The General Multimodal Embedder is an MLLM turned into a dense retriever by training it on a large-scale synthetic fused-modal dataset constructed through a dedicated synthesis pipeline; this training regime lifts performance to state-of-the-art levels on the new Universal Multimodal Retrieval Benchmark across text-only, image-only, and mixed-modality query-candidate pairs.
What carries the argument
The General Multimodal Embedder (GME), an MLLM-based dense retriever whose embeddings are learned from the authors' synthetically generated fused-modal training set to support retrieval regardless of whether queries and candidates are text, images, or combinations.
If this is right
- A single embedding space now supports retrieval when the query is text, an image, or both, and the candidate set can be any of the same combinations.
- Model scaling and careful choice of training strategy continue to raise accuracy on the UMR benchmark once the fused data is available.
- Ablation results isolate the contribution of data diversity and show that removing the synthesis step drops performance back toward prior text-only baselines.
- The new UMRB provides a standardized test bed that future universal retrievers can be measured against directly.
Where Pith is reading between the lines
- Similar synthesis pipelines could be applied to add video or audio modalities without requiring massive new human annotation.
- Real-world search engines might adopt one index instead of maintaining separate text and vision indexes, lowering storage and maintenance costs.
- The same training recipe could be tested on open-ended multimodal question answering to check whether retrieval gains translate to generation tasks.
Load-bearing premise
The synthetic fused-modal training dataset is of high quality and sufficiently diverse to unlock MLLM potential for universal retrieval without introducing biases or artifacts.
What would settle it
If an identically sized MLLM trained on an equal volume of real, balanced multimodal data instead of the synthetic set achieves equal or higher accuracy on the UMRB, the necessity of the synthesis pipeline for the claimed gains would be refuted.
read the original abstract
Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. Previous work has attempted to adopt multimodal large language models (MLLMs) to realize UMR using only text data. However, our preliminary experiments demonstrate that more diverse multimodal training data can further unlock the potential of MLLMs. Despite its effectiveness, the existing multimodal training data is highly imbalanced in terms of modality, which motivates us to develop a training data synthesis pipeline and construct a large-scale, high-quality fused-modal training dataset. Based on the synthetic training data, we develop the General Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR. Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the effectiveness of our approach. Experimental results show that our method achieves state-of-the-art performance among existing UMR methods. Last, we provide in-depth analyses of model scaling and training strategies, and perform ablation studies on both the model and synthetic data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the General Multimodal Embedder (GME), an MLLM-based dense retriever for universal multimodal retrieval (UMR) that supports text, image, and fused-modal queries/candidates. To overcome modality imbalance in prior training data, the authors introduce a synthesis pipeline that constructs a large-scale fused-modal dataset; they also release the UMR Benchmark (UMRB) and report that GME attains state-of-the-art results on it, supported by scaling studies, training-strategy analyses, and ablations.
Significance. If the synthetic-data quality and experimental superiority hold, the work would meaningfully advance UMR by showing that carefully balanced multimodal training data can better exploit MLLM capacity for cross-modal retrieval, providing both a practical model and a new evaluation benchmark.
major comments (1)
- [§3 and §4] §3 (Dataset Synthesis Pipeline) and §4 (Experiments): the central SOTA claim on UMRB rests on the assumption that the synthetic fused-modal dataset is high-quality, balanced, and free of systematic artifacts or hallucinations; however, the manuscript reports no independent quantitative checks (e.g., modality-balance statistics, diversity metrics, or human validation of generated pairs) that would confirm this assumption, leaving the performance gains vulnerable to data-induced bias.
minor comments (2)
- [Abstract and §4] Abstract and §4: experimental details on exact baselines, evaluation metrics, statistical significance testing, and hyper-parameter settings are only sketched; these should be expanded with concrete numbers and tables for reproducibility.
- [§5] §5 (Ablations): the scaling and training-strategy analyses would benefit from clearer notation distinguishing the contributions of data volume versus data modality diversity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address the major comment point by point below.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Dataset Synthesis Pipeline) and §4 (Experiments): the central SOTA claim on UMRB rests on the assumption that the synthetic fused-modal dataset is high-quality, balanced, and free of systematic artifacts or hallucinations; however, the manuscript reports no independent quantitative checks (e.g., modality-balance statistics, diversity metrics, or human validation of generated pairs) that would confirm this assumption, leaving the performance gains vulnerable to data-induced bias.
Authors: We thank the referee for this important observation. The synthesis pipeline in §3 incorporates explicit balancing steps (equal sampling across text-only, image-only, and fused-modal pairs) and quality filters (heuristic length checks plus MLLM-based relevance scoring) to mitigate imbalance and hallucinations. However, we acknowledge that the original manuscript did not report standalone quantitative validation metrics for the final dataset. In the revised version we will add: (i) modality-balance statistics (exact counts and percentages of each modality combination), (ii) diversity metrics (average token length, unique n-gram coverage, and average pairwise embedding cosine similarity), and (iii) a human validation study on a random 500-pair subset reporting hallucination and relevance rates. These results will appear in §3 with supporting tables and examples moved to the appendix. We believe the added evidence will directly address the concern while preserving the experimental claims. revision: yes
Circularity Check
No significant circularity; empirical training and evaluation chain is self-contained
full rationale
The paper constructs a synthetic fused-modal dataset to address modality imbalance in existing data, trains the GME MLLM-based retriever on it, builds the UMRB benchmark, and reports SOTA empirical results. No derivation step reduces by construction to its inputs, no fitted parameter is relabeled as a prediction, and no load-bearing claim rests on self-citation chains or imported uniqueness theorems. All central results derive from standard training-plus-evaluation on held-out benchmarks rather than tautological re-expression of the synthesis pipeline or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diverse multimodal training data improves MLLM performance on UMR tasks
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose a novel data synthesis pipeline for constructing large-scale, fused-modal training data... This pipeline is more efficient than previous approaches
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
-
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL
MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
-
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
Introduces the OACIR task requiring instance-level consistency via bounding-box anchors, a 160K real-world benchmark OACIRR, and the AdaFocal framework that adaptively focuses attention on the anchored region.
-
PLUME: Latent Reasoning Based Universal Multimodal Embedding
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
-
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
-
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
-
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
-
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...
-
TriAlignGR: Triangular Multitask Alignment with Multimodal Deep Interest Mining for Generative Recommendation
TriAlignGR integrates visual content and latent user interests into Semantic IDs via cross-modal alignment, CoT-based interest mining, and triangular multitask training to address content degradation and semantic opac...
-
Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval
SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
-
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment
BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...
Reference graph
Works this paper leans on
-
[1]
Overview of touch ´e 2020: Argument retrieval - extended abstract
Alexander Bondarenko, Maik Fr ¨obe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Pot- thast, and Matthias Hagen. Overview of touch ´e 2020: Argument retrieval - extended abstract. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - CLEF 2020, pages 384–395. ...
work page 2020
-
[2]
A full-text learning to rank dataset for medical information retrieval
Vera Boteva, Demian Gholipour Ghalandari, Artem Sokolov, and Stefan Riezler. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, pages 716–722. Springer, 2016. 3
work page 2016
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...
work page 2020
-
[4]
Webqa: Multihop and multimodal QA
Yingshan Chang, Guihong Cao, Mridu Narang, Jianfeng Gao, Hisami Suzuki, and Yonatan Bisk. Webqa: Multihop and multimodal QA. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16474–16483, 2022. 1, 3, 12, 14
work page 2022
-
[5]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016. 6
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 14948–14968, Singapore, 2023. Association for Computational Linguistics....
work page 2023
-
[7]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101,
-
[8]
SPECTER: Document-level representa- tion learning using citation-informed transformers
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. SPECTER: Document-level representa- tion learning using citation-informed transformers. In Pro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online, 2020. Association for Computational Linguistics. 3
work page 2020
-
[9]
Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B
Zhuyun Dai, Vincent Y . Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 6, 18
work page 2023
-
[10]
Imagenet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, USA , pages 248–255. IEEE Computer Society, 2009. 4, 6
work page 2009
-
[11]
Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold
Thomas Diggelmann, Jordan L. Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. CLIMATE- FEVER: A dataset for verification of real-world climate claims. CoRR, abs/2012.00614, 2020. 3
-
[12]
Col- pali: Efficient document retrieval with vision language mod- els
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els. In The Thirteenth International Conference on Learning Representations, 2025. 2, 3, 7, 14, 16
work page 2025
-
[13]
Dreamsim: Learning new dimensions of human visual simi- larity using synthetic data
Stephanie Fu, Netanel Yakir Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual simi- larity using synthetic data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 3, 12
work page 2023
-
[14]
SimCSE: Simple contrastive learning of sentence embeddings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 2, 6
work page 2021
-
[15]
Mitko Gospodinov, Sean MacAvaney, and Craig Macdonald. Doc2query-: When less is more. In Advances in Information Retrieval - 45th European Conference on Information Re- trieval, ECIR 2023 , pages 414–422, Dublin, Ireland, 2023. Springer. 5
work page 2023
-
[16]
Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. Au- tomatic spatially-aware fashion concept discovery. In IEEE International Conference on Computer Vision, ICCV 2017 , pages 1472–1480, Venice, Italy, 2017. IEEE Computer Soci- ety. 3, 14
work page 2017
-
[17]
Dbpedia-entity v2: A test collection for entity search
Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. Dbpedia-entity v2: A test collection for entity search. In Proceedings of the 40th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, page 1265–1268, New York, NY , USA, 2017. Association for Comp...
work page 2017
-
[18]
Verspoor, and Timothy Bald- win
Doris Hoogeveen, Karin M. Verspoor, and Timothy Bald- win. Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian Document Computing Symposium , New York, NY , USA, 2015. Association for Computing Machinery. 3
work page 2015
-
[19]
9 LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 9 LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 6, 15
work page 2022
-
[20]
Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities
Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming- Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. In IEEE/CVF International Conference on Computer Vision, ICCV 2023 , pages 12031–12041, Paris, France, 2023. IEEE. 3, 14
work page 2023
-
[21]
Unsupervised dense information retrieval with con- trastive learning
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas- tian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with con- trastive learning. Transactions on Machine Learning Re- search, 2022. 3
work page 2022
-
[22]
E5-V: universal embeddings with multi- modal large language models
Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-V: universal embeddings with multi- modal large language models. CoRR, abs/2407.12580, 2024. 1, 2, 3, 4, 6, 7
-
[23]
VLM2vec: Training vision- language models for massive multimodal embedding tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2vec: Training vision- language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025. 2, 3
work page 2025
-
[24]
TriviaQA: A large scale distantly supervised chal- lenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettle- moyer. TriviaQA: A large scale distantly supervised chal- lenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers) , pages 1601– 1611, Vancouver, Canada, 2017. Association for Computa- tional Linguistics. 6
work page 2017
-
[25]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computa- tional Linguistics. 1
work page 2020
-
[26]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering re- search. Tra...
work page 2019
-
[27]
Building and better understanding vision- language models: insights and future directions
Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and Leo Tronchon. Building and better understanding vision- language models: insights and future directions. In Work- shop on Responsibly Building the Next Generation of Multi- modal Foundational Models, 2024. 5, 6
work page 2024
-
[28]
NV- embed: Improved techniques for training LLMs as generalist embedding models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NV- embed: Improved techniques for training LLMs as generalist embedding models. In The Thirteenth International Confer- ence on Learning Representations, 2025. 3
work page 2025
-
[29]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. InIn- ternational Conference on Machine Learning, ICML 2022 , pages 12888–12900, Baltimore, Maryland, USA, 2022. PMLR. 2
work page 2022
-
[30]
Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 14369–14387, Bangkok, Thailand, 2024. Association ...
work page 2024
-
[31]
Towards General Text Embeddings with Multi-stage Contrastive Learning
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning. CoRR, abs/2308.03281, 2023. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In 13th European Conference on Computer Vision, ECCV 2014 , pages 740–755, Zurich, Switzerland, 2014. Springer. 3, 6, 14
work page 2014
-
[33]
PreFLMR: Scaling up fine-grained late-interaction multi- modal retrievers
Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. PreFLMR: Scaling up fine-grained late-interaction multi- modal retrievers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5294–5316, Bangkok, Thailand, 2024. Association for Computational Linguistics. 3, 14
work page 2024
-
[34]
Visual news: Benchmark and challenges in news im- age captioning
Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Or- donez. Visual news: Benchmark and challenges in news im- age captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 6761–6771, Online and Punta Cana, Dominican Republic,
work page 2021
-
[35]
Association for Computational Linguistics. 3, 13
-
[36]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 1, 2
work page 2023
-
[37]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2
work page 2024
-
[38]
EDIS: Entity-driven image search over multimodal web content
Siqi Liu, Weixi Feng, Tsu-Jui Fu, Wenhu Chen, and William Wang. EDIS: Entity-driven image search over multimodal web content. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4877–4894, Singapore, 2023. Association for Computational Linguistics. 3, 14
work page 2023
-
[39]
Image retrieval on real-life images with pre-trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021 , pages 2105–2114, Montreal, Canada, 2021. IEEE. 3, 5, 14
work page 2021
-
[40]
Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, and Ge Yu. Universal vision-language dense retrieval: Learning a unified representation space for multi-modal re- trieval. In The Eleventh International Conference on Learn- ing Representations, 2023. 1, 2
work page 2023
-
[41]
End-to-end knowledge retrieval with multi- 10 modal queries
Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, and Chitta Baral. End-to-end knowledge retrieval with multi- 10 modal queries. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8573–8589, Toronto, Canada, 2023. As- sociation for Computational Linguistics. 3, 14
work page 2023
-
[42]
Unifying multimodal retrieval via document screenshot embedding
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. In Proceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing, pages 6492–6505, Miami, Florida, USA, 2024. Association for Computational Linguistics. 2, 6, 7
work page 2024
-
[43]
Www’18 open challenge: Financial opinion min- ing and question answering
Macedo Maia, Siegfried Handschuh, Andr ´e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: Financial opinion min- ing and question answering. In Companion of the The Web Conference 2018 on The Web Conference 2018, pages 1941– 1942, Lyon, France, 2018. ACM. 3
work page 2018
-
[44]
OK-VQA: A visual question answer- ing benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answer- ing benchmark requiring external knowledge. In IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2019, pages 3195–3204, Long Beach, CA, USA, 2019. Com- puter Vision Foundation / IEEE. 3, 14
work page 2019
-
[45]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawa- har. Docvqa: A dataset for VQA on document images. In IEEE Winter Conference on Applications of Computer Vi- sion, WACV 2021 , pages 2199–2208, Waikoloa, HI, USA,
work page 2021
-
[46]
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. Infograph- icvqa. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 2582–2591. IEEE, 2022. 3
work page 2022
-
[47]
Thomas Mensink, Jasper R. R. Uijlings, Llu ´ıs Castrej´on, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr ´e Ara´ujo, and Vittorio Ferrari. Encyclopedic VQA: visual questions about detailed properties of fine-grained cate- gories. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, pages 3090–3101, Paris, France, 2023. IEEE. 3, 5, 14
work page 2023
-
[48]
MS MARCO: A human generated machine reading comprehen- sion dataset
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehen- sion dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016, Barcelona, Spain, 2016. CEUR-WS.org. 3, 4, 6
work page 2016
- [49]
-
[50]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, pages 2641–2649, Santiago, Chile, 2015. IEEE Com- puter Society. 3, 14
work page 2015
-
[52]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, pages 8748–8763....
work page 2021
-
[53]
SQuAD: 100,000+ questions for machine com- prehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine com- prehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2383–2392, Austin, Texas, 2016. Association for Computa- tional Linguistics. 6
work page 2016
-
[54]
Contrastive learning with hard negative samples
Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. In International Conference on Learning Repre- sentations, 2021. 4
work page 2021
-
[55]
LAION-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...
work page 2022
-
[56]
BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models
Nandan Thakur, Nils Reimers, Andreas R ¨uckl´e, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. 2, 3, 7, 13, 16, 18
work page 2021
-
[57]
FEVER: a large-scale dataset for fact extraction and VERification
James Thorne, Andreas Vlachos, Christos Christodoulopou- los, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long Papers), pages 809–819, New Or- leans, Louis...
work page 2018
-
[58]
Representation Learning with Contrastive Predictive Coding
A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[59]
Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang
Ellen V oorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: constructing a pandemic information retrieval test collection.SIGIR Forum, 54(1), 2021. 3
work page 2021
-
[60]
Retrieval of the best counterargument without prior topic knowledge
Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia, 2018. Asso- ciation for Computational Linguistics. 3
work page 2018
-
[61]
Fact or fiction: Verifying scientific claims
David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Ha- jishirzi. Fact or fiction: Verifying scientific claims. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online, 2020. Association for Computational Linguistics. 3 11
work page 2020
-
[62]
A Comprehensive Survey on Cross-modal Retrieval
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. A comprehensive survey on cross-modal retrieval. CoRR, abs/1607.06215, 2016. 2
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[63]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre- training. CoRR, abs/2212.03533, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Improving text embed- dings with large language models
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embed- dings with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 11897–11916, Bangkok, Thailand, 2024. Association for Computational Linguistics. 3, 4
work page 2024
-
[65]
ONE-PEACE: exploring one general representation model toward unlimited modalities
Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiao- huan Zhou, Jingren Zhou, Xinggang Wang, and Chang Zhou. ONE-PEACE: exploring one general representation model toward unlimited modalities. CoRR, abs/2305.11172, 2023. 6, 7
-
[66]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR, abs/2409.12191, 2024. 1, 2, 6, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Uniir: Training and benchmarking universal multimodal information retriev- ers
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retriev- ers. In 18th European Conference on Computer Vision, page 387–404, Milan, Italy, 2024. Springer-Verlag. 1, 2, 3, 6, 7, 16, 18
work page 2024
-
[68]
Fashion IQ: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rog ´erio Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2021 , pages 11307–11317. Com- puter Vision Foundation / IEEE, 2021. 3, 14
work page 2021
-
[69]
C-pack: Packed resources for general chinese embeddings
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th In- ternational ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, page 641–649, New York, NY , USA, 2024. Association for Computing Machinery. 3
work page 2024
-
[70]
Bennett, Junaid Ahmed, and Arnold Overwijk
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, 2021. 4
work page 2021
-
[71]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2369–2380, Brussels, Belgium, 2018. Asso- ciation for Computation...
work page 2018
-
[72]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-v: A GPT-4V level MLLM on your phone. CoRR, abs/2408.01800, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
A survey on multimodal large language models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. National Science Review, 11(12), 2024. 2
work page 2024
-
[74]
Dense text retrieval based on pretrained language models: A survey
Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on pretrained language models: A survey. ACM Trans. Inf. Syst., 42(4):89:1–89:60, 2024. 2
work page 2024
-
[75]
VISTA: Visualized text embedding for univer- sal multi-modal retrieval
Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, and Yong- ping Xiong. VISTA: Visualized text embedding for univer- sal multi-modal retrieval. In Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers) , pages 3185–3200, Bangkok, Thailand, 2024. Association for Computational Linguistics. 1, 2, 6, 7
work page 2024
-
[76]
MARVEL: Unlock- ing the multi-modal capability of dense retrieval via visual module plugin
Tianshuo Zhou, Sen Mei, Xinze Li, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, Yu Gu, and Ge Yu. MARVEL: Unlock- ing the multi-modal capability of dense retrieval via visual module plugin. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 14608–14624, Bangkok, Thailand,
-
[77]
Association for Computational Linguistics. 1, 2
-
[78]
Towards complex doc- ument understanding by discrete reasoning
Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. Towards complex doc- ument understanding by discrete reasoning. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022 , pages 4857–4866. ACM,
work page 2022
-
[79]
I →T”, which retrieves the caption given an image and “T→I
UMRB Details Table 6 summarizes all UMRB tasks along with their statis- tics. Table 14 provides examples of different task types. Below is a brief description of each dataset included in the UMRB. 7.1. Single-Modal Tasks WebQA [4] This dataset is derived from Wikipedia. In the T→T setup, both the query and candidate are text. The objective is to find a Wi...
-
[80]
Addi- tionally, we provide results from other benchmarks, includ- ing BEIR, M-BEIR, and ViDoRe
Results Details In this section, we present the detailed scores achieved by our GME and the baseline models on various tasks. Addi- tionally, we provide results from other benchmarks, includ- ing BEIR, M-BEIR, and ViDoRe. 8.1. Detailed Results on UMRB Table 7 presents the detailed evaluation results of the base- line systems alongside our GME on UMRB task...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.