Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)
Pith reviewed 2026-06-28 10:29 UTC · model grok-4.3
The pith
Decoder-based Qwen2-VL embedders power the winning systems in the multimodal document retrieval challenge, with a training-free entry nearly matching the fine-tuned leader.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All three winning teams built their systems on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders. The teams differed mainly in whether they reached the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within 0.1 point of the fine-tuned winner on the macro-averaged metric.
What carries the argument
Decoder-based Multimodal-LLM embedders from the Qwen2-VL family, which generate the embeddings used for ranking in both retrieval regimes.
If this is right
- Decoder-based embedders from the Qwen2-VL family outperform CLIP-style encoders on both closed-set document page retrieval and open-domain image-based passage retrieval.
- Training-free methods using zero-shot late interaction can reach performance levels within 0.1 points of heavily fine-tuned ensembles.
- Multi-route fusion combined with a vision-language re-ranker offers a competitive alternative to full fine-tuning.
- A single system architecture can effectively address complementary retrieval regimes when evaluated on the combined macro-average metric.
Where Pith is reading between the lines
- The pattern suggests decoder-only multimodal models may generalize better than contrastive encoders when documents contain interleaved figures, tables, and charts.
- The near-parity of training-free systems could reduce the data and compute barriers for deploying multimodal retrievers in practice.
- These findings point toward testing whether similar decoder embedders maintain their edge on larger collections or additional languages.
- The results have direct bearing on retrieval-augmented generation pipelines that must surface visually rich content accurately.
Load-bearing premise
The macro-average of mean Recall at 1, 3 and 5 over the two tasks provides a fair and representative ranking of retrieval systems across the two regimes.
What would settle it
A retrieval system built on a CLIP-style encoder that scores higher than the Qwen2-VL winners on the same macro-averaged Recall metric across both tasks would falsify the observed pattern.
Figures
read the original abstract
Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is an overview report on the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1). It describes the challenge motivation for multimodal retrieval over visually-rich documents, the two tasks (closed-set MMDocIR page retrieval from text queries and open-domain M2KR passage retrieval from image or image+text queries), the ranking metric (macro-average of mean Recall@{1,3,5} across tasks), participation (455 entrants, 586 submissions, 22 teams), and the architectures and relative performance of the top three systems, all of which rely on Qwen2-VL decoder-based Multimodal-LLM embedders rather than CLIP-style encoders, with a training-free multi-route system finishing within 0.1 points of the fine-tuned winner.
Significance. If the reported participation numbers, standings, and system descriptions hold, the paper supplies a useful empirical snapshot of current practice in multimodal document retrieval. It documents the shift toward decoder-based MLLM embedders and the viability of training-free fusion approaches, which can inform subsequent work on retrieval-augmented generation over interleaved text, figures, and tables.
minor comments (3)
- [Abstract] The abstract and § on final standings refer to a '0.1-point gap' without quoting the exact macro-average scores of the top two systems; adding these numbers would improve precision.
- A compact table listing the three winning teams, their key design choices (fine-tuning, fusion, late interaction), and per-task Recall values would make the comparative analysis easier to follow.
- [Challenge Design] The description of the M2KR task should explicitly note the size and source of the Wikipedia passage corpus used for open-domain retrieval to allow readers to assess scale.
Simulated Author's Rebuttal
We thank the referee for the careful reading and positive assessment of our manuscript, including the recommendation for minor revision. The referee summary accurately reflects the challenge overview, participation statistics, and key findings regarding Qwen2-VL-based systems.
Circularity Check
No significant circularity
full rationale
The paper is a descriptive challenge overview report with no equations, derivations, predictions, or first-principles claims. It reports external team submissions, architectures, and standings under fixed challenge rules and metrics without advancing any internal mathematical reduction, fitted parameter, or self-referential derivation. All content is observational reporting of independent external results, so no load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]
Pith/arXiv arXiv 2024
-
[2]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computat...
-
[3]
Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. 2024. M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi- document Understanding. https://arxiv.org/abs/2411.04952v1
arXiv 2024
-
[4]
Kuicai Dong, Yujing Chang, Derrick Goh Xin Deik, Dexun Li, Ruiming Tang, and Yong Liu. 2025. MMDocIR: Benchmarking Multimodal Retrieval for Long Documents. InProceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Comput...
-
[5]
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. ColPali: Efficient Document Retrieval with Vision Language Models. arXiv:2407.01449 [cs.IR] https://arxiv.org/abs/2407. 01449
Pith/arXiv arXiv 2024
-
[6]
Junchen Fu, Xuri Ge, Xin Xin, Haitao Yu, Yue Feng, Alexandros Karatzoglou, Ioan- nis Arapakis, and Joemon Jose. 2025. The 1st EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval. InCompanion Proceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, ...
-
[7]
Bohan Hou, Haoqiang Lin, Xuemeng Song, Haokun Wen, and Liqiang Nie. 2025. Visual Anchor Point for Multimodal Document Retrieval. https://github.com/ hbhalpha/MDR. Winning solution, EReL@MIR 2025 MIRC Track 1
2025
-
[8]
Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Ken- ton Lee, Kristina Toutanova, and Ming-Wei Chang. 2023. Open-domain Vi- sual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities. arXiv:2302.11154 (Feb. 2023). http://arxiv.org/abs/2302.11154 arXiv:2302.11154 [cs]
arXiv 2023
-
[9]
Bargav Jagatha and Abhishek Varshney. 2025. Multimodal Information Retrieval Challenge Solution. https://github.com/bargav25/MultiModal_ InformationRetrieval. Third-place solution, EReL@MIR 2025 MIRC Track 1
2025
-
[10]
Ting Jiang, Shaohan Huang, Minghui Song, Zihan Zhang, Haizhen Huang, Liang Wang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, deqing wang, and Fuzhen Zhuang. 2025. E5-V: Universal Embeddings with Multimodal Large Language Models. https://openreview.net/forum?id=rD6LQagatR
2025
-
[11]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 39–48. https://doi.org/10.1145/3397271.3401075
-
[12]
Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. 2023. Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering. InAdvances in Neu- ral Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 22820–2284...
2023
-
[13]
Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. 2024. PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok,...
-
[14]
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin
-
[15]
Unifying Multimodal Retrieval via Document Screenshot Embedding. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 6492–6505. https://doi.org/10.18653/v1/2024.emnlp-main.373
-
[16]
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi
-
[17]
InConference on Computer Vision and Pattern Recognition (CVPR)
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. InConference on Computer Vision and Pattern Recognition (CVPR)
-
[18]
Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, and Vittorio Ferrari. 2023. Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Paris, France, 3090–3101. https://doi.org/10.1109/IC...
-
[19]
Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatu...
2023
-
[20]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning. PMLR, 8748–8763. https...
2021
-
[21]
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguisti...
-
[22]
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. WIT: Wikipedia-based Image Text Dataset for Multimodal Multi- lingual Machine Learning. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2443–2449
2021
-
[23]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 (Oct. 2024). htt...
Pith/arXiv arXiv 2024
-
[24]
Mingjun Xu, Zehui Wang, Hengxing Cai, and Renxin Zhong. 2025. A Multi- Granularity Retrieval Framework for Visually-Rich Documents. arXiv:2505.01457 (May 2025). https://doi.org/10.48550/arXiv.2505.01457 arXiv:2505.01457 [cs.IR]. Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)
-
[25]
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. Vis- RAG: Vision-based Retrieval-augmented Generation on Multi-modality Doc- uments. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=zG459X3Xge
2025
-
[26]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid Loss for Language Image Pre-Training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Paris, France, 11941–11952. https: //doi.org/10.1109/ICCV51070.2023.01100
-
[27]
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2025. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs. arXiv:2412.16855 (April 2025). https://doi.org/10.48550/arXiv.2412.16855 arXiv:2412.16855 [cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16855 2025
-
[28]
Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, and Yongping Xiong. 2024. VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Ban...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.