Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
Pith reviewed 2026-05-19 13:43 UTC · model grok-4.3
The pith
Multimodal models gain accuracy on open-domain questions by learning to pick the right retrieval expert at each step of their reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoRE enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation by dynamically determining which expert to engage with conditioned on the evolving reasoning state, achieving average performance gains of over 7% on diverse open-domain QA benchmarks.
What carries the argument
Mixture-of-Retrieval Experts (MoRE) framework that conditions expert selection on the model's evolving reasoning state and is trained via Stepwise Group Relative Policy Optimization to produce fine-grained coordination rewards.
If this is right
- The model can locate relevant external information more precisely by switching experts mid-reasoning.
- Sparse outcome rewards are replaced by step-level signals that encourage full use of all available experts.
- Heterogeneous retrieval systems become interchangeable tools rather than competing alternatives.
- Reasoning traces become explicit about which knowledge source is consulted at each point.
Where Pith is reading between the lines
- The same selection logic might extend to non-retrieval tools such as code interpreters or image generators when the reasoning state indicates their utility.
- Longer reasoning chains could reveal whether repeated expert switches accumulate errors or compound benefits.
- If selection policies prove robust, retrieval-augmented systems could reduce the need for ever-larger internal knowledge stores.
Load-bearing premise
Step-GRPO training can produce stable expert-selection policies that continue to work well on new queries and new expert combinations.
What would settle it
Evaluating MoRE on a fresh collection of multimodal open-domain questions and observing no gain or a drop relative to single-expert baselines would show the dynamic coordination does not generalize.
Figures
read the original abstract
Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge. However, existing methods typically adhere to rigid retrieval paradigms by mimicking fixed retrieval trajectories and thus fail to fully exploit the knowledge of different retrieval experts through dynamic interaction based on the model's knowledge needs or evolving reasoning states. To overcome this limitation, we introduce Mixture-of-Retrieval Experts (MoRE), a novel framework that enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation. Specifically, MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state. To effectively train this capability, we propose Stepwise Group Relative Policy Optimization (Step-GRPO), which goes beyond sparse outcome-based supervision by encouraging MLLMs to interact with multiple retrieval experts and synthesize fine-grained rewards, thereby teaching the MLLM to fully coordinate all experts when answering a given query. Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines. Notably, MoRE exhibits strong adaptability by dynamically coordinating heterogeneous experts to precisely locate relevant information, validating its capability for robust, reasoning-driven expert collaboration. All codes and data are released on https://github.com/OpenBMB/MoRE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mixture-of-Retrieval Experts (MoRE), a framework enabling multimodal large language models (MLLMs) to dynamically select and collaborate with diverse retrieval experts conditioned on the evolving reasoning state during multimodal retrieval-augmented generation. It introduces Stepwise Group Relative Policy Optimization (Step-GRPO) to train this routing via fine-grained rewards synthesized from multi-expert interactions, going beyond sparse outcome supervision. Experiments on open-domain QA benchmarks report average gains exceeding 7% over competitive baselines, with public release of code and data.
Significance. If the empirical gains prove robust and the dynamic routing generalizes, MoRE could advance MRAG by replacing rigid retrieval trajectories with reasoning-state-conditioned expert coordination, offering a path to more effective knowledge exploitation and hallucination mitigation in MLLMs. The public code and data release is a clear strength supporting reproducibility.
major comments (2)
- [Abstract] Abstract: the central claim of average performance gains of over 7% is presented without any description of the benchmarks, baselines, number of runs, error bars, or statistical tests. This gap is load-bearing for assessing whether the data support the effectiveness of MoRE and Step-GRPO.
- [Method] Step-GRPO description: no explicit mechanism (entropy regularization, diversity penalty, or per-step credit assignment) is stated to prevent mode collapse or ensure the policy evolves with intermediate reasoning steps rather than learning static per-query expert selection. This directly affects the claim of collaborative, state-dependent exploitation.
minor comments (2)
- [Abstract] Add a short experimental setup paragraph in the abstract or introduction to orient readers before the results claim.
- [Introduction] Ensure consistent definition of acronyms (MLLM, MRAG, Step-GRPO) on first use throughout the manuscript.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of average performance gains of over 7% is presented without any description of the benchmarks, baselines, number of runs, error bars, or statistical tests. This gap is load-bearing for assessing whether the data support the effectiveness of MoRE and Step-GRPO.
Authors: We acknowledge that the abstract's brevity omits these specifics. The main text (Section 4) fully describes the open-domain QA benchmarks, the set of competitive baselines, the number of runs, and reports means with standard deviations. To improve accessibility, we will revise the abstract to briefly reference the benchmark suite and note that gains are averaged across multiple runs with variability measures. revision: yes
-
Referee: [Method] Step-GRPO description: no explicit mechanism (entropy regularization, diversity penalty, or per-step credit assignment) is stated to prevent mode collapse or ensure the policy evolves with intermediate reasoning steps rather than learning static per-query expert selection. This directly affects the claim of collaborative, state-dependent exploitation.
Authors: We thank the referee for this observation. Step-GRPO synthesizes fine-grained rewards from multi-expert interactions at each reasoning step, which is intended to drive state-dependent routing rather than static per-query choices. The group-relative formulation inherently compares alternative expert trajectories within the same query. To make the safeguards explicit, we will add a description of the entropy regularization term in the policy objective and clarify the per-step credit assignment in the revised Method section. revision: yes
Circularity Check
No significant circularity; claims rest on empirical benchmarks
full rationale
The paper introduces MoRE and Step-GRPO as a novel framework for dynamic, reasoning-state-conditioned retrieval expert coordination in MLLMs. Its central claims of >7% average gains are presented as outcomes of experiments on open-domain QA benchmarks rather than any mathematical derivation or prediction that reduces by construction to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are shown that would force the reported results from the method's own inputs. The framework description and training procedure contain no load-bearing self-references that substitute for external validation; results are independently falsifiable via the released code and benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Step-GRPO training hyperparameters
axioms (1)
- domain assumption MLLMs can be trained to dynamically select and coordinate retrieval experts based on evolving reasoning states
invented entities (2)
-
Mixture-of-Retrieval Experts (MoRE)
no independent evidence
-
Step-GRPO
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state... Stepwise Group Relative Policy Optimization (Step-GRPO)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-grained rewards by integrating feedback from the interaction process with retrieval experts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
-
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...
-
Learning Agent Routing From Early Experience
BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
Reference graph
Works this paper leans on
-
[1]
Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Moham- madali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soley- mani Baghshah, and Ehsaneddin Asgari. 2025. Ask in Any Modality: A Compre- hensive Survey on Multimodal Retrieval-Augmented Generation.arXiv preprint arXiv:2502.08826(2025)
-
[2]
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations
work page 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-llava: Hierarchical retrieval- augmented generation for multimodal llms. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 1818–1826
work page 2024
-
[6]
Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. 2022. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16495–16504
work page 2022
-
[7]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Fan Yang, Zenan Zhou, Weipeng Chen, Haofen Wang, Jeff Z Pan, et al . 2025. Learning to Reason with Search for LLMs via Reinforcement Learning.arXiv preprint arXiv:2503.19470(2025). Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation Conference’17, July 2017, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
- [10]
- [11]
- [12]
-
[13]
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Suyu Ge, Chenyan Xiong, Corby Rosset, Arnold Overwijk, Jiawei Han, and Paul Bennett. 2023. Augmenting Zero-Shot Dense Retrievers with Plug-in Mixture-of- Memories. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1796–1812
work page 2023
- [15]
-
[16]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[18]
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Rein- forcement learning: A survey.Journal of artificial intelligence research4 (1996), 237–285
work page 1996
- [19]
- [20]
-
[21]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474
work page 2020
-
[22]
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic search-enhanced large reasoning models.arXiv preprint arXiv:2501.05366(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Fei Huang, Jingren Zhou, et al. 2024. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self- adaptive planning agent.arXiv preprint arXiv:2411.02937(2024)
-
[24]
Zhenghao Liu, Pengcheng Huang, Zhipeng Xu, Xinze Li, Shuliang Liu, Chunyi Peng, Haidong Xin, Yukun Yan, Shuo Wang, Xu Han, et al . 2025. Knowledge intensive agents.A vailable at SSRN 5459034(2025)
work page 2025
- [25]
- [26]
-
[27]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36 (2023), 53728–53741
work page 2023
-
[28]
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-Context Retrieval-Augmented Lan- guage Models.Transactions of the Association for Computational Linguistics(2023), 1316–1331
work page 2023
- [29]
-
[30]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al . 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [31]
-
[32]
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2503.05592 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal
-
[35]
Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions.arXiv preprint arXiv:2212.10509(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404
work page 2024
-
[37]
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. 2025. MMSearch-R1: Incentivizing LMMs to Search.arXiv preprint arXiv:2506.20670(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [38]
-
[39]
Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, and Huaxiu Yao. 2024. Rule: Reliable multimodal rag for factuality in medical vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 1081–1093
work page 2024
-
[40]
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. 2025. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. 2025. UniversalRAG: Retrieval-Augmented Generation over Multiple Cor- pora with Diverse Modalities and Granularities.arXiv preprint arXiv:2504.20734 (2025)
work page internal anchor Pith review arXiv 2025
-
[42]
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al . 2024. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [43]
- [44]
- [45]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.