arxiv: 2604.07079 · v1 · submitted 2026-04-08 · 💻 cs.IR

Recognition: no theorem link

MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

Mahmoud SalahEldin Kasem , Mohamed Mahmoud , Mostafa Farouk Senussi , Mahmoud Abdalla , Abdelrahman Abdallah , Hyun-Soo Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:48 UTC · model grok-4.3

classification 💻 cs.IR

keywords multimodal retrievalquery expansionchain-of-thought rerankingdense retrieverMM-BRIGHT benchmarkreasoning-intensive retrievalinformation retrievalnDCG@10

0 comments

The pith

A three-stage pipeline of query expansion, reasoning retrieval, and step-by-step reranking raises multimodal retrieval performance to 37.9 nDCG@10 on a 29-domain benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing multimodal encoders are limited on tasks that require understanding complex intent across text and images, reaching only 27.6 nDCG@10 on the MM-BRIGHT benchmark. MARVEL integrates three components that prior work handled separately: LLM-driven expansion of the query's latent intent, a dense retriever fine-tuned on complex multimodal reasoning examples, and explicit chain-of-thought reranking over candidates using GPT-4o. This combination produces 37.9 nDCG@10 overall and beats every single-stage baseline in 27 of 29 technical domains. A reader would care because the result suggests that retrieval over specialized knowledge bases can be improved by inserting explicit reasoning stages rather than relying on end-to-end encoder scaling alone.

Core claim

MARVEL is a unified pipeline that first expands a multimodal query with an LLM to surface latent intent, then retrieves candidates with MARVEL-Retriever (a dense model fine-tuned for complex reasoning), and finally reranks the candidates with GPT-4o chain-of-thought reasoning plus optional multi-pass reciprocal rank fusion, achieving 37.9 nDCG@10 on MM-BRIGHT and outperforming all single-stage baselines in 27 of 29 domains.

What carries the argument

The integrated expand-retrieve-rerank framework that couples LLM query expansion, a reasoning-enhanced dense retriever, and explicit step-by-step reranking.

If this is right

The system outperforms the strongest multimodal encoder by 10.3 nDCG@10 points on the full MM-BRIGHT benchmark.
It exceeds every single-stage baseline in 27 of the 29 technical domains tested.
In the two remaining highly specialized domains (Crypto, Quantum Computing) it matches or approaches the best baseline.
The results indicate that multimodal retrieval over technical corpora benefits more from explicit reasoning stages than from further scaling of vision-language encoders alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged design could be applied to other retrieval settings where queries involve multi-hop reasoning over mixed media, such as scientific literature search.
Replacing the GPT-4o reranker with an open-source reasoning model would provide a direct test of whether the gains remain accessible without proprietary APIs.
The separation of expansion, retrieval, and reranking stages suggests that end-to-end trained multimodal models may be leaving performance on the table by trying to solve all three subtasks simultaneously.

Load-bearing premise

The performance gains come from the specific integration of expansion, reasoning retrieval, and reranking rather than from the raw capability of GPT-4o or from unstated details of fine-tuning and data selection.

What would settle it

An ablation that disables the chain-of-thought reranking stage or the query-expansion stage and measures whether nDCG@10 falls back to the level of the best prior multimodal encoder (27.6) would directly test whether the unified reasoning framework is required for the reported gains.

Figures

Figures reproduced from arXiv: 2604.07079 by Abdelrahman Abdallah, Hyun-Soo Kang, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi.

**Figure 1.** Figure 1: An example where single-stage multimodal retrievers fail to identify the relevant document. MARVEL expands the query [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the MARVEL pipeline. A multimodal query [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of retrieval depth K0 (number of candidates passed to the reranker) on MARVEL performance (nDCG@10, MM-BRIGHT average). The orange star marks our default K0 = 100. The dashed red line indicates the best single-stage multimodal baseline (Nomic-Vision: 27.6). Even at K0 = 20, MARVEL substantially outperforms all baselines. 5.5. Effect of Retrieval Depth K0 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Multimodal retrieval over text corpora remains a fundamental challenge: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, a reasoning-intensive multimodal retrieval benchmark, underperforming strong text-only systems. We argue that effective multimodal retrieval requires three tightly integrated capabilities that existing approaches address only in isolation: expanding the query's latent intent, retrieving with a model trained for complex reasoning, and reranking via explicit step-by-step reasoning over candidates. We introduce \textbf{MARVEL} (\textbf{M}ultimodal \textbf{A}daptive \textbf{R}easoning-intensi\textbf{V}e \textbf{E}xpand-rerank and retrieva\textbf{L}), a unified pipeline that combines LLM-driven query expansion, \textbf{MARVEL-Retriever} -- a reasoning-enhanced dense retriever fine-tuned for complex multimodal queries -- and GPT-4o-based chain-of-thought reranking with optional multi-pass reciprocal rank fusion. Evaluated on MM-BRIGHT across 29 technical domains, MARVEL achieves \textbf{37.9} nDCG@10, surpassing the best multimodal encoder by \textbf{+10.3 points} and outperforming all single-stage baselines in 27 of 29 domains and matching or approaching the best baseline in the remaining two highly-specialized domains (Crypto, Quantum Computing), demonstrating that reasoning-intensive multimodal retrieval is best addressed through a unified expand-retrieve-rerank framework. https://github.com/mm-bright/multimodal-reasoning-retrieval

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARVEL layers LLM expansion, a fine-tuned retriever, and GPT-4o CoT reranking to lift multimodal retrieval scores, but the lift may mostly come from the reranker itself.

read the letter

The main thing to know is that MARVEL combines three pieces for multimodal retrieval over text: LLM query expansion, a custom dense retriever trained on complex queries, and GPT-4o chain-of-thought reranking, sometimes with multi-pass fusion. On the MM-BRIGHT benchmark it reaches 37.9 nDCG@10, beating the prior best multimodal encoder by 10 points and winning in 27 of 29 technical domains. The code is on GitHub, which is helpful for anyone who wants to reproduce or extend it. The paper does a reasonable job showing that single-stage encoders fall short on reasoning-heavy queries and that adding explicit reasoning steps helps across many domains. That domain coverage is a practical strength for technical search applications. The soft spot is attribution. The headline gains line up with the addition of a strong LLM reranker, yet the writeup does not include the ablations that would show whether the custom retriever or the expansion step contributes much once the reranker is fixed. A direct comparison of the same GPT-4o reranker applied to the strongest single-stage baseline would clarify how much the full pipeline actually matters. Fine-tuning details and data selection for the retriever also stay light. This work is aimed at people building retrieval systems for technical or scientific content who need better handling of multimodal reasoning queries. It is concrete enough and the evaluation is broad enough that it deserves a serious referee, mainly to check the component contributions and reproducibility. I would send it out for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces MARVEL, a unified multimodal retrieval pipeline that integrates LLM-driven query expansion, a reasoning-enhanced dense retriever (MARVEL-Retriever) fine-tuned on complex queries, and GPT-4o-based chain-of-thought reranking with optional multi-pass reciprocal rank fusion. It claims that on the MM-BRIGHT benchmark across 29 technical domains, this approach achieves 37.9 nDCG@10, surpassing the best multimodal encoder by +10.3 points and outperforming all single-stage baselines in 27 of 29 domains (with near-parity in the remaining two).

Significance. If the reported gains are robustly verified through ablations and reproducible experiments, the work would demonstrate that tightly integrated expand-retrieve-rerank reasoning is required for effective multimodal retrieval on reasoning-intensive tasks, where current vision-language encoders fall short. The public GitHub release supports potential reproducibility.

major comments (2)

[Evaluation section] Evaluation section (results on MM-BRIGHT): The headline claim of 37.9 nDCG@10 and +10.3 improvement over the best multimodal encoder is presented without component ablations that remove or replace the GPT-4o CoT reranker while holding query expansion and MARVEL-Retriever fixed, leaving open whether the gains derive from the integrated framework or from GPT-4o reranking power alone.
[Experimental setup] Experimental setup and baselines: No details are provided on baseline implementations (e.g., whether single-stage baselines also incorporate LLM augmentation), statistical significance tests, or variance across runs, which is load-bearing for the claim of outperformance in 27 of 29 domains.

minor comments (1)

[Title] The title acronym expansion is somewhat contrived but does not affect readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to incorporate additional ablations and experimental details as requested.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (results on MM-BRIGHT): The headline claim of 37.9 nDCG@10 and +10.3 improvement over the best multimodal encoder is presented without component ablations that remove or replace the GPT-4o CoT reranker while holding query expansion and MARVEL-Retriever fixed, leaving open whether the gains derive from the integrated framework or from GPT-4o reranking power alone.

Authors: We agree that a dedicated ablation isolating the GPT-4o CoT reranker (while fixing query expansion and MARVEL-Retriever) would more clearly attribute the gains to the integrated framework. In the revised manuscript we have added this ablation to the Evaluation section, reporting nDCG@10 for the full pipeline versus the version without reranking and versus a non-reasoning reranker baseline. The results show that the reranker contributes meaningfully but that the full expand-retrieve-rerank combination is required to reach the reported 37.9 score and to outperform the single-stage baselines. revision: yes
Referee: [Experimental setup] Experimental setup and baselines: No details are provided on baseline implementations (e.g., whether single-stage baselines also incorporate LLM augmentation), statistical significance tests, or variance across runs, which is load-bearing for the claim of outperformance in 27 of 29 domains.

Authors: We acknowledge that these details are necessary to support the robustness claims. The revised Experimental Setup section now specifies the exact implementation of every baseline (including whether LLM query augmentation was applied), reports paired t-test p-values for all head-to-head comparisons, and includes standard deviations computed over five independent runs with different random seeds. These additions confirm that the outperformance in 27 of 29 domains is statistically significant and stable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation of an integrated retrieval pipeline

full rationale

The paper reports experimental nDCG@10 results on MM-BRIGHT for a composite pipeline (LLM query expansion + fine-tuned MARVEL-Retriever + GPT-4o CoT reranking). No equations, first-principles derivations, or predictions are claimed; the central numbers are direct benchmark measurements. No self-definitional reductions, fitted inputs relabeled as predictions, or load-bearing self-citations appear in the provided text. The evaluation is self-contained against external baselines and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claim depends on the unverified assumption that LLMs can reliably perform query expansion and chain-of-thought reranking for retrieval; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Large language models such as GPT-4o can perform effective query expansion and explicit chain-of-thought reasoning for retrieval reranking
The pipeline's reported gains rest on this assumed capability of current LLMs.

pith-pipeline@v0.9.0 · 5603 in / 1201 out tokens · 34137 ms · 2026-05-10T17:48:49.842055+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 24 canonical work pages · 2 internal anchors

[1]

Receiptqa: A question-answering dataset for receipt under- standing.Mathematics, 13(11):1760, 2025

Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Bilel Yagoub, Mostafa Farouk Senussi, Abdel- rahman Abdallah, Seung Hun Kang, and Hyun Soo Kang. Receiptqa: A question-answering dataset for receipt under- standing.Mathematics, 13(11):1760, 2025. 1

2025
[2]

Dear: Dual-stage document reranking with reasoning agents via llm distillation.arXiv preprint arXiv:2508.16998, 2025

Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, and Adam Jatowt. Dear: Dual-stage document reranking with reasoning agents via LLM distillation.arXiv preprint arXiv:2508.16998, 2025. 1

work page arXiv 2025
[3]

Rankify: A comprehensive python toolkit for retrieval, re-ranking, and retrieval-augmented genera- tion, arXiv preprint arXiv:2502.02464, 2025

Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, and Adam Jatowt. Rankify: A comprehen- sive python toolkit for retrieval, re-ranking, and retrieval- augmented generation.arXiv preprint arXiv:2502.02464,

work page arXiv
[4]

Tempo: A realistic multi-domain benchmark for temporal reasoning-intensive retrieval.arXiv preprint arXiv:2601.09523, 2026

Abdelrahman Abdallah, Mohammed Ali, Muhammad Abdul-Mageed, and Adam Jatowt. TEMPO: A realistic multi-domain benchmark for temporal reasoning-intensive retrieval.arXiv preprint arXiv:2601.09523, 2026. 2

work page arXiv 2026
[5]

Mm-bright: A multi-task multimodal benchmark for reasoning-intensive retrieval

Abdelrahman Abdallah, Mohamed Darwish Mounis, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mostafa Farouk Senussi, Mohamed Mahmoud, Mohammed Ali, Adam Jatowt, and Hyun-Soo Kang. MM-BRIGHT: A multi-task multimodal benchmark for reasoning-intensive retrieval.arXiv preprint arXiv:2601.09562, 2026. 1, 3, 5

work page arXiv 2026
[6]

Reasoning-focused Multi-turn Conversational Retrieval Benchmark

Mohammed Ali, Abdelrahman Abdallah, Amit Agarwal, Hitesh Laxmichand Patel, and Adam Jatowt. Recor: Reasoning-focused multi-turn conversational retrieval benchmark.arXiv preprint arXiv:2601.05461, 2026. 1

work page arXiv 2026
[7]

Re-invoke: Tool invocation rewriting for zero- shot tool retrieval,

Yanfei Chen, Jinsung Yoon, Chanyeol Lee, et al. Re-invoke: Tool retrieval via reversed instructions.arXiv preprint arXiv:2408.01875, 2024. 3

work page arXiv 2024
[8]

Reciprocal rank fusion outperforms condorcet and individual rank learning methods

Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759, 2009. 5

2009
[9]

Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, and Chen Zhao

Debrup Das, Sam O’Nuallain, and Razieh Rahimi. RaDeR: Reasoning-aware dense retrieval models.arXiv preprint arXiv:2505.18405, 2025. 2

work page arXiv 2025
[10]

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- Pali: Efficient document retrieval with vision language mod- els.arXiv preprint arXiv:2407.01449, 2024. 3

work page internal anchor Pith review arXiv 2024
[11]

Precise zero-shot dense retrieval without relevance labels,

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496, 2022. 1, 3, 8

work page arXiv 2022
[12]

Cumulated gain- based evaluation of ir techniques.ACM Transactions on In- formation Systems (TOIS), 20(4):422–446, 2002

Kalervo J ¨arvelin and Jaana Kek ¨al¨ainen. Cumulated gain- based evaluation of ir techniques.ACM Transactions on In- formation Systems (TOIS), 20(4):422–446, 2002. 6

2002
[13]

Jiang, R

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2Vec: Training vision- language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024. 1, 3

work page arXiv 2024
[14]

Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2021

Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2021. 2

2021
[15]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computa- tional Linguistics. 2

2020
[16]

Httd: A hierarchical transformer for accu- rate table detection in document images.Mathematics, 13 (2):266, 2025

Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Bilel Yagoub, Mostafa Farouk Senussi, Mahmoud Abdalla, and Hyun-Soo Kang. Httd: A hierarchical transformer for accu- rate table detection in document images.Mathematics, 13 (2):266, 2025. 1

2025
[17]

Korie: A multi-task benchmark for detection, ocr, and information extraction on korean retail receipts

Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Mahmoud Abdalla, and Hyun Soo Kang. Korie: A multi-task benchmark for detection, ocr, and information extraction on korean retail receipts. Mathematics, 14(1):187, 2026. 1

2026
[18]

Jina clip: Your clip model is also your text retriever.arXiv preprint arXiv:2405.20204, 2024

Andreas Koukounas et al. Jina CLIP: Your CLIP model is also your text retriever.arXiv preprint arXiv:2405.20204,

work page arXiv
[19]

Thinkqe: Query expansion via an evolving thinking process.arXiv preprint arXiv:2506.09260, 2025

Yibin Lei, Tao Shen, and Andrew Yates. ThinkQE: Query expansion via an evolving thinking process.arXiv preprint arXiv:2506.09260, 2025. 2, 3, 8

work page arXiv 2025
[20]

Retrieval-augmented gener- ation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebas- tian Riedel, and Douwe Kiela. Retrieval-augmented gener- ation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, pages 9459–9474,
[21]

arXiv preprint arXiv:2508.07995 , year=

Dingkun Long et al. DIVER: A multi-stage approach for reasoning-intensive information retrieval.arXiv preprint arXiv:2508.07995, 2025. 3

work page arXiv 2025
[22]

Unifying multimodal retrieval via document screenshot embedding.arXiv preprint arXiv:2406.11251,

Xueguang Ma, Jimmy Lin, Minghan Zhang, and Sheng- Chieh Lin. Unifying multimodal retrieval via document screenshot embedding.arXiv preprint arXiv:2406.11251,

work page arXiv
[23]

How good are llm-based rerankers? an empirical analysis of state-of-the-art reranking models

Abdelrahman Abdallah Bhawna Piryani Jamshid Mozafari and Mohammed Ali Adam Jatowt. How good are llm-based rerankers? an empirical analysis of state-of-the-art reranking models. 2025. 1

2025
[24]

MTEB: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, 2023. Association for Com- putational Linguistics. 1, 2

2014
[25]

Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024

Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613,

work page arXiv
[26]

Learning transferable visual 9 models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual 9 models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2, 6

2021
[27]

Sentence-BERT: Sen- tence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sen- tence embeddings using Siamese BERT-networks. InPro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 3982–3992, Hong Kong, China, 2019. As- sociation for Computational Lin...

2019
[28]

Robertson and Hugo Zaragoza

Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. 3

2009
[29]

Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,

Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Se- won Min, Wen-tau Yih, Pang Wei Koh, and Luke Zettle- moyer. ReasonIR: Training retrievers for reasoning tasks. arXiv preprint arXiv:2504.20595, 2025. 2

work page arXiv 2025
[30]

Bright: A realistic and challenging bench- mark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883,

Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan ¨O. Arik, Danqi Chen, and Tao Yu. BRIGHT: A real- istic and challenging benchmark for reasoning-intensive re- trieval.arXiv preprint arXiv:2407.12883, 2024. 2

work page arXiv 2024
[31]

Is ChatGPT good at search? Investigating large language models as re-ranking agents,

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is ChatGPT good at search? investigating large language models as re-ranking agents.arXiv preprint arXiv:2304.09542, 2023. 1, 3

work page arXiv 2023
[32]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas R ¨uckl´e, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InProceedings of the Neural Information Process- ing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021), 2021. 1, 2

2021
[33]

Query2doc: Query expansion with large language models

Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models.arXiv preprint arXiv:2303.07678, 2023. 1, 2, 3, 8

work page arXiv 2023
[34]

Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418,

Orion Weller et al. Rank1: Test-time compute for reranking. arXiv preprint arXiv:2502.18418, 2025. 1, 3

work page arXiv 2025
[35]

Visrag: Vision-based retrieval-augmented generation on multi-modality documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. VisRAG: Vision-based retrieval- augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594, 2024. 3

work page arXiv 2024
[36]

Sigmoid loss for language image pre-training, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023. 2, 6

work page arXiv 2023
[37]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. GME: Improving universal multimodal retrieval by multimodal LLMs.arXiv preprint arXiv:2412.16855,

work page internal anchor Pith review arXiv
[38]

Megapairs: Massive data synthesis for universal multi- modal retrieval

Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. Megapairs: Massive data synthesis for universal multi- modal retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19076–19095, 2025. 6

2025
[39]

Rank-r1: Enhancing reasoning in llm-based document rerankers via reinforcement learning, 2025

Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon. Rank-R1: Enhancing reasoning in llm-based document rerankers via reinforcement learning. arXiv preprint arXiv:2503.06034, 2025. 3 10

work page arXiv 2025