Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

Chunyi Peng; Ge Yu; Maosong Sun; Minghe Yu; Shuo Wang; Yishan Li; Yu Gu; Yukun Yan; Zhenghao Liu; Zhipeng Xu

arxiv: 2505.22095 · v2 · submitted 2025-05-28 · 💻 cs.CL

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

Chunyi Peng , Zhipeng Xu , Zhenghao Liu , Yishan Li , Yukun Yan , Shuo Wang , Yu Gu , Minghe Yu

show 2 more authors

Ge Yu Maosong Sun

This is my paper

Pith reviewed 2026-05-19 13:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal retrieval-augmented generationmixture of retrieval expertsreasoning-guided selectionmultimodal large language modelsstepwise policy optimizationopen-domain question answeringknowledge exploitationdynamic expert coordination

0 comments

The pith

Multimodal models gain accuracy on open-domain questions by learning to pick the right retrieval expert at each step of their reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mixture-of-Retrieval Experts as a way for multimodal large language models to move beyond fixed retrieval routines and instead call on different knowledge sources depending on what the current reasoning step requires. A training procedure called Stepwise Group Relative Policy Optimization supplies rewards at each step so the model learns to coordinate several experts and combine their outputs rather than relying on sparse end-of-task feedback. The authors test this on multiple open-domain question-answering benchmarks and report average gains above 7 percent relative to prior retrieval-augmented methods. The core idea is that effective knowledge use depends on matching the retrieval tool to the model's momentary information gap. If the approach holds, models would treat retrieval as an active, reasoning-guided part of answering rather than a single upfront step.

Core claim

MoRE enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation by dynamically determining which expert to engage with conditioned on the evolving reasoning state, achieving average performance gains of over 7% on diverse open-domain QA benchmarks.

What carries the argument

Mixture-of-Retrieval Experts (MoRE) framework that conditions expert selection on the model's evolving reasoning state and is trained via Stepwise Group Relative Policy Optimization to produce fine-grained coordination rewards.

If this is right

The model can locate relevant external information more precisely by switching experts mid-reasoning.
Sparse outcome rewards are replaced by step-level signals that encourage full use of all available experts.
Heterogeneous retrieval systems become interchangeable tools rather than competing alternatives.
Reasoning traces become explicit about which knowledge source is consulted at each point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection logic might extend to non-retrieval tools such as code interpreters or image generators when the reasoning state indicates their utility.
Longer reasoning chains could reveal whether repeated expert switches accumulate errors or compound benefits.
If selection policies prove robust, retrieval-augmented systems could reduce the need for ever-larger internal knowledge stores.

Load-bearing premise

Step-GRPO training can produce stable expert-selection policies that continue to work well on new queries and new expert combinations.

What would settle it

Evaluating MoRE on a fresh collection of multimodal open-domain questions and observing no gain or a drop relative to single-expert baselines would show the dynamic coordination does not generalize.

Figures

Figures reproduced from arXiv: 2505.22095 by Chunyi Peng, Ge Yu, Maosong Sun, Minghe Yu, Shuo Wang, Yishan Li, Yu Gu, Yukun Yan, Zhenghao Liu, Zhipeng Xu.

**Figure 2.** Figure 2: The Architecture of Stepwise GRPO Training Used by MoRE. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: The Prompt Template of MoRE for Answer Genera [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: The Prompt Template of MoRE (Step-GRPO) for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Training and Inference Effectiveness of Step-GRPO. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Expert Interaction Distribution across Reasoning [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Case Studies. We select two cases from the Visual QA task to illustrate different expert interaction mechanisms. The [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge. However, existing methods typically adhere to rigid retrieval paradigms by mimicking fixed retrieval trajectories and thus fail to fully exploit the knowledge of different retrieval experts through dynamic interaction based on the model's knowledge needs or evolving reasoning states. To overcome this limitation, we introduce Mixture-of-Retrieval Experts (MoRE), a novel framework that enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation. Specifically, MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state. To effectively train this capability, we propose Stepwise Group Relative Policy Optimization (Step-GRPO), which goes beyond sparse outcome-based supervision by encouraging MLLMs to interact with multiple retrieval experts and synthesize fine-grained rewards, thereby teaching the MLLM to fully coordinate all experts when answering a given query. Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines. Notably, MoRE exhibits strong adaptability by dynamically coordinating heterogeneous experts to precisely locate relevant information, validating its capability for robust, reasoning-driven expert collaboration. All codes and data are released on https://github.com/OpenBMB/MoRE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoRE adds dynamic reasoning-state routing to retrieval experts in MRAG with Step-GRPO training, but the abstract leaves the experimental details too thin to confirm the mechanism drives the gains.

read the letter

The main point is that this paper introduces a Mixture-of-Retrieval Experts framework where an MLLM learns to switch among different retrieval specialists depending on its current reasoning state, trained with a stepwise group relative policy optimization that supplies finer-grained rewards than final-answer supervision alone. That conditioning on evolving state is the clearest departure from the fixed-trajectory retrieval setups common in prior MRAG work. The authors also release code and data, which makes the contribution more usable than many similar submissions. The reported average lift above 7 percent on open-domain QA benchmarks is the empirical hook, and the framing of the problem as under-exploitation of heterogeneous experts is straightforward and on target. Step-GRPO itself looks like a sensible attempt to push the model toward multi-expert use rather than single-expert collapse. On the weaker side, the abstract supplies almost no information on baselines, dataset splits, variance, or how expert selections actually change across reasoning steps. That absence makes it hard to tell whether the gains come from true state-dependent collaboration or from learning static preferences per query type. The stress-test worry about mode collapse or benchmark overfitting is therefore still live; the paper would need ablations or usage statistics to close it. The citation pattern is conventional for the subfield and does not raise red flags. This is aimed at people already working on multimodal RAG and MLLM factuality who want a concrete routing mechanism they can try. A reader focused on training methods for dynamic expert selection will get the most out of the Step-GRPO description. The work is coherent enough on its own terms to merit referee time, even if the experiments will need tightening. I would send it for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mixture-of-Retrieval Experts (MoRE), a framework enabling multimodal large language models (MLLMs) to dynamically select and collaborate with diverse retrieval experts conditioned on the evolving reasoning state during multimodal retrieval-augmented generation. It introduces Stepwise Group Relative Policy Optimization (Step-GRPO) to train this routing via fine-grained rewards synthesized from multi-expert interactions, going beyond sparse outcome supervision. Experiments on open-domain QA benchmarks report average gains exceeding 7% over competitive baselines, with public release of code and data.

Significance. If the empirical gains prove robust and the dynamic routing generalizes, MoRE could advance MRAG by replacing rigid retrieval trajectories with reasoning-state-conditioned expert coordination, offering a path to more effective knowledge exploitation and hallucination mitigation in MLLMs. The public code and data release is a clear strength supporting reproducibility.

major comments (2)

[Abstract] Abstract: the central claim of average performance gains of over 7% is presented without any description of the benchmarks, baselines, number of runs, error bars, or statistical tests. This gap is load-bearing for assessing whether the data support the effectiveness of MoRE and Step-GRPO.
[Method] Step-GRPO description: no explicit mechanism (entropy regularization, diversity penalty, or per-step credit assignment) is stated to prevent mode collapse or ensure the policy evolves with intermediate reasoning steps rather than learning static per-query expert selection. This directly affects the claim of collaborative, state-dependent exploitation.

minor comments (2)

[Abstract] Add a short experimental setup paragraph in the abstract or introduction to orient readers before the results claim.
[Introduction] Ensure consistent definition of acronyms (MLLM, MRAG, Step-GRPO) on first use throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of average performance gains of over 7% is presented without any description of the benchmarks, baselines, number of runs, error bars, or statistical tests. This gap is load-bearing for assessing whether the data support the effectiveness of MoRE and Step-GRPO.

Authors: We acknowledge that the abstract's brevity omits these specifics. The main text (Section 4) fully describes the open-domain QA benchmarks, the set of competitive baselines, the number of runs, and reports means with standard deviations. To improve accessibility, we will revise the abstract to briefly reference the benchmark suite and note that gains are averaged across multiple runs with variability measures. revision: yes
Referee: [Method] Step-GRPO description: no explicit mechanism (entropy regularization, diversity penalty, or per-step credit assignment) is stated to prevent mode collapse or ensure the policy evolves with intermediate reasoning steps rather than learning static per-query expert selection. This directly affects the claim of collaborative, state-dependent exploitation.

Authors: We thank the referee for this observation. Step-GRPO synthesizes fine-grained rewards from multi-expert interactions at each reasoning step, which is intended to drive state-dependent routing rather than static per-query choices. The group-relative formulation inherently compares alternative expert trajectories within the same query. To make the safeguards explicit, we will add a description of the entropy regularization term in the policy objective and clarify the per-step credit assignment in the revised Method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper introduces MoRE and Step-GRPO as a novel framework for dynamic, reasoning-state-conditioned retrieval expert coordination in MLLMs. Its central claims of >7% average gains are presented as outcomes of experiments on open-domain QA benchmarks rather than any mathematical derivation or prediction that reduces by construction to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are shown that would force the reported results from the method's own inputs. The framework description and training procedure contain no load-bearing self-references that substitute for external validation; results are independently falsifiable via the released code and benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The paper introduces new framework MoRE and training method Step-GRPO; relies on standard assumptions that MLLMs can learn dynamic expert selection and that benchmark gains reflect genuine coordination improvements.

free parameters (1)

Step-GRPO training hyperparameters
Policy optimization parameters chosen or tuned to produce the reported expert coordination and performance gains.

axioms (1)

domain assumption MLLMs can be trained to dynamically select and coordinate retrieval experts based on evolving reasoning states
Core premise invoked in the design of MoRE and Step-GRPO as described in the abstract.

invented entities (2)

Mixture-of-Retrieval Experts (MoRE) no independent evidence
purpose: Framework for dynamic expert collaboration in multimodal retrieval
Newly introduced system whose effectiveness is demonstrated only through the paper's experiments.
Step-GRPO no independent evidence
purpose: Training algorithm for fine-grained expert interaction rewards
New optimization method proposed to train the dynamic selection capability.

pith-pipeline@v0.9.0 · 5802 in / 1461 out tokens · 78219 ms · 2026-05-19T13:43:19.221562+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state... Stepwise Group Relative Policy Optimization (Step-GRPO)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-grained rewards by integrating feedback from the interaction process with retrieval experts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
cs.LG 2026-05 unverdicted novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...
Learning Agent Routing From Early Experience
cs.CL 2026-05 unverdicted novelty 6.0

BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 3 Pith papers · 16 internal anchors

[1]

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Moham- madali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soley- mani Baghshah, and Ehsaneddin Asgari. 2025. Ask in Any Modality: A Compre- hensive Survey on Multimodal Retrieval-Augmented Generation.arXiv preprint arXiv:2502.08826(2025)

work page arXiv 2025
[2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations

work page 2023
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-llava: Hierarchical retrieval- augmented generation for multimodal llms. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 1818–1826

work page 2024
[6]

Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. 2022. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16495–16504

work page 2022
[7]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Fan Yang, Zenan Zhou, Weipeng Chen, Haofen Wang, Jeff Z Pan, et al . 2025. Learning to Reason with Search for LLMs via Reinforcement Learning.arXiv preprint arXiv:2503.19470(2025). Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation Conference’17, July 2017, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W Cohen. 2022. Murag: Multimodal retrieval-augmented generator for open question answering over images and text.arXiv preprint arXiv:2210.02928(2022)

work page arXiv 2022
[10]

Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2019. Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164(2019)

work page arXiv 2019
[11]

Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen. 2024. Unified hallucination detection for multimodal large language models.arXiv preprint arXiv:2402.03190 (2024)

work page arXiv 2024
[12]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can pre-trained vision and language models answer visual information-seeking questions?arXiv preprint arXiv:2302.11713(2023)

work page arXiv 2023
[13]

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Suyu Ge, Chenyan Xiong, Corby Rosset, Arnold Overwijk, Jiawei Han, and Paul Bennett. 2023. Augmenting Zero-Shot Dense Retrievers with Plug-in Mixture-of- Memories. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1796–1812

work page 2023
[15]

Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. 2025. DeepRAG: Thinking to Retrieval Step by Step for Large Language Models.arXiv preprint arXiv:2502.01142(2025)

work page arXiv 2025
[16]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Rein- forcement learning: A survey.Journal of artificial intelligence research4 (1996), 237–285

work page 1996
[19]

Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. 2023. Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288(2023)

work page arXiv 2023
[20]

Aritra Kumar Lahiri and Qinmin Vivian Hu. 2024. AlzheimerRAG: Multi- modal Retrieval Augmented Generation for PubMed articles.arXiv preprint arXiv:2412.16701(2024)

work page arXiv 2024
[21]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

work page 2020
[22]

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic search-enhanced large reasoning models.arXiv preprint arXiv:2501.05366(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Fei Huang, Jingren Zhou, et al. 2024. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self- adaptive planning agent.arXiv preprint arXiv:2411.02937(2024)

work page arXiv 2024
[24]

Zhenghao Liu, Pengcheng Huang, Zhipeng Xu, Xinze Li, Shuliang Liu, Chunyi Peng, Haidong Xin, Yukun Yan, Shuo Wang, Xu Han, et al . 2025. Knowledge intensive agents.A vailable at SSRN 5459034(2025)

work page 2025
[25]

Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Yu Gu, Ge Yu, and Maosong Sun. 2025. Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts.arXiv preprint arXiv:2502.17297(2025)

work page arXiv 2025
[26]

Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. 2025. A Survey of Multimodal Retrieval-Augmented Generation.arXiv preprint arXiv:2504.08748(2025)

work page arXiv 2025
[27]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36 (2023), 53728–53741

work page 2023
[28]

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-Context Retrieval-Augmented Lan- guage Models.Transactions of the Association for Computational Linguistics(2023), 1316–1331

work page 2023
[29]

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.arXiv preprint arXiv:2305.15294(2023)

work page arXiv 2023
[30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al . 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, and Jimmy Lin. 2024. Unirag: Universal retrieval augmentation for multi-modal large language models. arXiv preprint arXiv:2405.10311(2024)

work page arXiv 2024
[32]

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2503.05592 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

work page
[35]

Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions.arXiv preprint arXiv:2212.10509(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404

work page 2024
[37]

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. 2025. MMSearch-R1: Incentivizing LMMs to Search.arXiv preprint arXiv:2506.20670(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Yin Wu, Quanyu Long, Jing Li, Jianfei Yu, and Wenya Wang. 2025. Visual- rag: Benchmarking text-to-image retrieval augmented generation for visual knowledge intensive queries.arXiv preprint arXiv:2502.16636(2025)

work page arXiv 2025
[39]

Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, and Huaxiu Yao. 2024. Rule: Reliable multimodal rag for factuality in medical vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 1081–1093

work page 2024
[40]

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. 2025. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. 2025. UniversalRAG: Retrieval-Augmented Generation over Multiple Cor- pora with Diverse Modalities and Granularities.arXiv preprint arXiv:2504.20734 (2025)

work page internal anchor Pith review arXiv 2025
[42]

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al . 2024. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Xiaohan Yu, Zhihan Yang, and Chong Chen. 2025. Unveiling the Potential of Multimodal Retrieval Augmented Generation with Planning.arXiv preprint arXiv:2501.15470(2025)

work page arXiv 2025
[44]

Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, et al. 2024. m2RAG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA.arXiv preprint arXiv:2411.15041(2024)

work page arXiv 2024
[45]

Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. 2023. Retrieving multimodal information for augmented generation: A survey.arXiv preprint arXiv:2303.10868(2023)

work page arXiv 2023

[1] [1]

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Moham- madali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soley- mani Baghshah, and Ehsaneddin Asgari. 2025. Ask in Any Modality: A Compre- hensive Survey on Multimodal Retrieval-Augmented Generation.arXiv preprint arXiv:2502.08826(2025)

work page arXiv 2025

[2] [2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations

work page 2023

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-llava: Hierarchical retrieval- augmented generation for multimodal llms. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 1818–1826

work page 2024

[6] [6]

Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. 2022. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16495–16504

work page 2022

[7] [7]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Fan Yang, Zenan Zhou, Weipeng Chen, Haofen Wang, Jeff Z Pan, et al . 2025. Learning to Reason with Search for LLMs via Reinforcement Learning.arXiv preprint arXiv:2503.19470(2025). Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation Conference’17, July 2017, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W Cohen. 2022. Murag: Multimodal retrieval-augmented generator for open question answering over images and text.arXiv preprint arXiv:2210.02928(2022)

work page arXiv 2022

[10] [10]

Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2019. Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164(2019)

work page arXiv 2019

[11] [11]

Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen. 2024. Unified hallucination detection for multimodal large language models.arXiv preprint arXiv:2402.03190 (2024)

work page arXiv 2024

[12] [12]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can pre-trained vision and language models answer visual information-seeking questions?arXiv preprint arXiv:2302.11713(2023)

work page arXiv 2023

[13] [13]

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Suyu Ge, Chenyan Xiong, Corby Rosset, Arnold Overwijk, Jiawei Han, and Paul Bennett. 2023. Augmenting Zero-Shot Dense Retrievers with Plug-in Mixture-of- Memories. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1796–1812

work page 2023

[15] [15]

Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. 2025. DeepRAG: Thinking to Retrieval Step by Step for Large Language Models.arXiv preprint arXiv:2502.01142(2025)

work page arXiv 2025

[16] [16]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[18] [18]

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Rein- forcement learning: A survey.Journal of artificial intelligence research4 (1996), 237–285

work page 1996

[19] [19]

Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. 2023. Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288(2023)

work page arXiv 2023

[20] [20]

Aritra Kumar Lahiri and Qinmin Vivian Hu. 2024. AlzheimerRAG: Multi- modal Retrieval Augmented Generation for PubMed articles.arXiv preprint arXiv:2412.16701(2024)

work page arXiv 2024

[21] [21]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

work page 2020

[22] [22]

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic search-enhanced large reasoning models.arXiv preprint arXiv:2501.05366(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Fei Huang, Jingren Zhou, et al. 2024. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self- adaptive planning agent.arXiv preprint arXiv:2411.02937(2024)

work page arXiv 2024

[24] [24]

Zhenghao Liu, Pengcheng Huang, Zhipeng Xu, Xinze Li, Shuliang Liu, Chunyi Peng, Haidong Xin, Yukun Yan, Shuo Wang, Xu Han, et al . 2025. Knowledge intensive agents.A vailable at SSRN 5459034(2025)

work page 2025

[25] [25]

Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Yu Gu, Ge Yu, and Maosong Sun. 2025. Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts.arXiv preprint arXiv:2502.17297(2025)

work page arXiv 2025

[26] [26]

Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. 2025. A Survey of Multimodal Retrieval-Augmented Generation.arXiv preprint arXiv:2504.08748(2025)

work page arXiv 2025

[27] [27]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36 (2023), 53728–53741

work page 2023

[28] [28]

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-Context Retrieval-Augmented Lan- guage Models.Transactions of the Association for Computational Linguistics(2023), 1316–1331

work page 2023

[29] [29]

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.arXiv preprint arXiv:2305.15294(2023)

work page arXiv 2023

[30] [30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al . 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, and Jimmy Lin. 2024. Unirag: Universal retrieval augmentation for multi-modal large language models. arXiv preprint arXiv:2405.10311(2024)

work page arXiv 2024

[32] [32]

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2503.05592 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

work page

[35] [35]

Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions.arXiv preprint arXiv:2212.10509(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404

work page 2024

[37] [37]

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. 2025. MMSearch-R1: Incentivizing LMMs to Search.arXiv preprint arXiv:2506.20670(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Yin Wu, Quanyu Long, Jing Li, Jianfei Yu, and Wenya Wang. 2025. Visual- rag: Benchmarking text-to-image retrieval augmented generation for visual knowledge intensive queries.arXiv preprint arXiv:2502.16636(2025)

work page arXiv 2025

[39] [39]

Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, and Huaxiu Yao. 2024. Rule: Reliable multimodal rag for factuality in medical vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 1081–1093

work page 2024

[40] [40]

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. 2025. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. 2025. UniversalRAG: Retrieval-Augmented Generation over Multiple Cor- pora with Diverse Modalities and Granularities.arXiv preprint arXiv:2504.20734 (2025)

work page internal anchor Pith review arXiv 2025

[42] [42]

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al . 2024. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Xiaohan Yu, Zhihan Yang, and Chong Chen. 2025. Unveiling the Potential of Multimodal Retrieval Augmented Generation with Planning.arXiv preprint arXiv:2501.15470(2025)

work page arXiv 2025

[44] [44]

Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, et al. 2024. m2RAG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA.arXiv preprint arXiv:2411.15041(2024)

work page arXiv 2024

[45] [45]

Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. 2023. Retrieving multimodal information for augmented generation: A survey.arXiv preprint arXiv:2303.10868(2023)

work page arXiv 2023