pith. sign in

arxiv: 2505.22095 · v2 · submitted 2025-05-28 · 💻 cs.CL

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

Pith reviewed 2026-05-19 13:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal retrieval-augmented generationmixture of retrieval expertsreasoning-guided selectionmultimodal large language modelsstepwise policy optimizationopen-domain question answeringknowledge exploitationdynamic expert coordination
0
0 comments X

The pith

Multimodal models gain accuracy on open-domain questions by learning to pick the right retrieval expert at each step of their reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mixture-of-Retrieval Experts as a way for multimodal large language models to move beyond fixed retrieval routines and instead call on different knowledge sources depending on what the current reasoning step requires. A training procedure called Stepwise Group Relative Policy Optimization supplies rewards at each step so the model learns to coordinate several experts and combine their outputs rather than relying on sparse end-of-task feedback. The authors test this on multiple open-domain question-answering benchmarks and report average gains above 7 percent relative to prior retrieval-augmented methods. The core idea is that effective knowledge use depends on matching the retrieval tool to the model's momentary information gap. If the approach holds, models would treat retrieval as an active, reasoning-guided part of answering rather than a single upfront step.

Core claim

MoRE enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation by dynamically determining which expert to engage with conditioned on the evolving reasoning state, achieving average performance gains of over 7% on diverse open-domain QA benchmarks.

What carries the argument

Mixture-of-Retrieval Experts (MoRE) framework that conditions expert selection on the model's evolving reasoning state and is trained via Stepwise Group Relative Policy Optimization to produce fine-grained coordination rewards.

If this is right

  • The model can locate relevant external information more precisely by switching experts mid-reasoning.
  • Sparse outcome rewards are replaced by step-level signals that encourage full use of all available experts.
  • Heterogeneous retrieval systems become interchangeable tools rather than competing alternatives.
  • Reasoning traces become explicit about which knowledge source is consulted at each point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection logic might extend to non-retrieval tools such as code interpreters or image generators when the reasoning state indicates their utility.
  • Longer reasoning chains could reveal whether repeated expert switches accumulate errors or compound benefits.
  • If selection policies prove robust, retrieval-augmented systems could reduce the need for ever-larger internal knowledge stores.

Load-bearing premise

Step-GRPO training can produce stable expert-selection policies that continue to work well on new queries and new expert combinations.

What would settle it

Evaluating MoRE on a fresh collection of multimodal open-domain questions and observing no gain or a drop relative to single-expert baselines would show the dynamic coordination does not generalize.

Figures

Figures reproduced from arXiv: 2505.22095 by Chunyi Peng, Ge Yu, Maosong Sun, Minghe Yu, Shuo Wang, Yishan Li, Yu Gu, Yukun Yan, Zhenghao Liu, Zhipeng Xu.

Figure 1
Figure 1. Figure 1: Illustration of Our Mixture-of-Retrieval Experts [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Architecture of Stepwise GRPO Training Used by MoRE. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The Prompt Template of MoRE for Answer Genera [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Prompt Template of MoRE (Step-GRPO) for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training and Inference Effectiveness of Step-GRPO. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Expert Interaction Distribution across Reasoning [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Studies. We select two cases from the Visual QA task to illustrate different expert interaction mechanisms. The [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge. However, existing methods typically adhere to rigid retrieval paradigms by mimicking fixed retrieval trajectories and thus fail to fully exploit the knowledge of different retrieval experts through dynamic interaction based on the model's knowledge needs or evolving reasoning states. To overcome this limitation, we introduce Mixture-of-Retrieval Experts (MoRE), a novel framework that enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation. Specifically, MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state. To effectively train this capability, we propose Stepwise Group Relative Policy Optimization (Step-GRPO), which goes beyond sparse outcome-based supervision by encouraging MLLMs to interact with multiple retrieval experts and synthesize fine-grained rewards, thereby teaching the MLLM to fully coordinate all experts when answering a given query. Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines. Notably, MoRE exhibits strong adaptability by dynamically coordinating heterogeneous experts to precisely locate relevant information, validating its capability for robust, reasoning-driven expert collaboration. All codes and data are released on https://github.com/OpenBMB/MoRE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mixture-of-Retrieval Experts (MoRE), a framework enabling multimodal large language models (MLLMs) to dynamically select and collaborate with diverse retrieval experts conditioned on the evolving reasoning state during multimodal retrieval-augmented generation. It introduces Stepwise Group Relative Policy Optimization (Step-GRPO) to train this routing via fine-grained rewards synthesized from multi-expert interactions, going beyond sparse outcome supervision. Experiments on open-domain QA benchmarks report average gains exceeding 7% over competitive baselines, with public release of code and data.

Significance. If the empirical gains prove robust and the dynamic routing generalizes, MoRE could advance MRAG by replacing rigid retrieval trajectories with reasoning-state-conditioned expert coordination, offering a path to more effective knowledge exploitation and hallucination mitigation in MLLMs. The public code and data release is a clear strength supporting reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central claim of average performance gains of over 7% is presented without any description of the benchmarks, baselines, number of runs, error bars, or statistical tests. This gap is load-bearing for assessing whether the data support the effectiveness of MoRE and Step-GRPO.
  2. [Method] Step-GRPO description: no explicit mechanism (entropy regularization, diversity penalty, or per-step credit assignment) is stated to prevent mode collapse or ensure the policy evolves with intermediate reasoning steps rather than learning static per-query expert selection. This directly affects the claim of collaborative, state-dependent exploitation.
minor comments (2)
  1. [Abstract] Add a short experimental setup paragraph in the abstract or introduction to orient readers before the results claim.
  2. [Introduction] Ensure consistent definition of acronyms (MLLM, MRAG, Step-GRPO) on first use throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of average performance gains of over 7% is presented without any description of the benchmarks, baselines, number of runs, error bars, or statistical tests. This gap is load-bearing for assessing whether the data support the effectiveness of MoRE and Step-GRPO.

    Authors: We acknowledge that the abstract's brevity omits these specifics. The main text (Section 4) fully describes the open-domain QA benchmarks, the set of competitive baselines, the number of runs, and reports means with standard deviations. To improve accessibility, we will revise the abstract to briefly reference the benchmark suite and note that gains are averaged across multiple runs with variability measures. revision: yes

  2. Referee: [Method] Step-GRPO description: no explicit mechanism (entropy regularization, diversity penalty, or per-step credit assignment) is stated to prevent mode collapse or ensure the policy evolves with intermediate reasoning steps rather than learning static per-query expert selection. This directly affects the claim of collaborative, state-dependent exploitation.

    Authors: We thank the referee for this observation. Step-GRPO synthesizes fine-grained rewards from multi-expert interactions at each reasoning step, which is intended to drive state-dependent routing rather than static per-query choices. The group-relative formulation inherently compares alternative expert trajectories within the same query. To make the safeguards explicit, we will add a description of the entropy regularization term in the policy objective and clarify the per-step credit assignment in the revised Method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper introduces MoRE and Step-GRPO as a novel framework for dynamic, reasoning-state-conditioned retrieval expert coordination in MLLMs. Its central claims of >7% average gains are presented as outcomes of experiments on open-domain QA benchmarks rather than any mathematical derivation or prediction that reduces by construction to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are shown that would force the reported results from the method's own inputs. The framework description and training procedure contain no load-bearing self-references that substitute for external validation; results are independently falsifiable via the released code and benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The paper introduces new framework MoRE and training method Step-GRPO; relies on standard assumptions that MLLMs can learn dynamic expert selection and that benchmark gains reflect genuine coordination improvements.

free parameters (1)
  • Step-GRPO training hyperparameters
    Policy optimization parameters chosen or tuned to produce the reported expert coordination and performance gains.
axioms (1)
  • domain assumption MLLMs can be trained to dynamically select and coordinate retrieval experts based on evolving reasoning states
    Core premise invoked in the design of MoRE and Step-GRPO as described in the abstract.
invented entities (2)
  • Mixture-of-Retrieval Experts (MoRE) no independent evidence
    purpose: Framework for dynamic expert collaboration in multimodal retrieval
    Newly introduced system whose effectiveness is demonstrated only through the paper's experiments.
  • Step-GRPO no independent evidence
    purpose: Training algorithm for fine-grained expert interaction rewards
    New optimization method proposed to train the dynamic selection capability.

pith-pipeline@v0.9.0 · 5802 in / 1461 out tokens · 78219 ms · 2026-05-19T13:43:19.221562+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

    cs.LG 2026-05 unverdicted novelty 7.0

    SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

  2. VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...

  3. Learning Agent Routing From Early Experience

    cs.CL 2026-05 unverdicted novelty 6.0

    BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 3 Pith papers · 16 internal anchors

  1. [1]

    Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Moham- madali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soley- mani Baghshah, and Ehsaneddin Asgari. 2025. Ask in Any Modality: A Compre- hensive Survey on Multimodal Retrieval-Augmented Generation.arXiv preprint arXiv:2502.08826(2025)

  2. [2]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)

  4. [4]

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930(2024)

  5. [5]

    Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-llava: Hierarchical retrieval- augmented generation for multimodal llms. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 1818–1826

  6. [6]

    Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. 2022. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16495–16504

  7. [7]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216 (2024)

  8. [8]

    Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Fan Yang, Zenan Zhou, Weipeng Chen, Haofen Wang, Jeff Z Pan, et al . 2025. Learning to Reason with Search for LLMs via Reinforcement Learning.arXiv preprint arXiv:2503.19470(2025). Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation Conference’17, July 2017, ...

  9. [9]

    Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W Cohen. 2022. Murag: Multimodal retrieval-augmented generator for open question answering over images and text.arXiv preprint arXiv:2210.02928(2022)

  10. [10]

    Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2019. Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164(2019)

  11. [11]

    Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen. 2024. Unified hallucination detection for multimodal large language models.arXiv preprint arXiv:2402.03190 (2024)

  12. [12]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can pre-trained vision and language models answer visual information-seeking questions?arXiv preprint arXiv:2302.11713(2023)

  13. [13]

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161(2025)

  14. [14]

    Suyu Ge, Chenyan Xiong, Corby Rosset, Arnold Overwijk, Jiawei Han, and Paul Bennett. 2023. Augmenting Zero-Shot Dense Retrievers with Plug-in Mixture-of- Memories. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1796–1812

  15. [15]

    Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. 2025. DeepRAG: Thinking to Retrieval Step by Step for Large Language Models.arXiv preprint arXiv:2502.01142(2025)

  16. [16]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  17. [17]

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060(2020)

  18. [18]

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Rein- forcement learning: A survey.Journal of artificial intelligence research4 (1996), 237–285

  19. [19]

    Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. 2023. Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288(2023)

  20. [20]

    Aritra Kumar Lahiri and Qinmin Vivian Hu. 2024. AlzheimerRAG: Multi- modal Retrieval Augmented Generation for PubMed articles.arXiv preprint arXiv:2412.16701(2024)

  21. [21]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  22. [22]

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic search-enhanced large reasoning models.arXiv preprint arXiv:2501.05366(2025)

  23. [23]

    Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Fei Huang, Jingren Zhou, et al. 2024. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self- adaptive planning agent.arXiv preprint arXiv:2411.02937(2024)

  24. [24]

    Zhenghao Liu, Pengcheng Huang, Zhipeng Xu, Xinze Li, Shuliang Liu, Chunyi Peng, Haidong Xin, Yukun Yan, Shuo Wang, Xu Han, et al . 2025. Knowledge intensive agents.A vailable at SSRN 5459034(2025)

  25. [25]

    Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Yu Gu, Ge Yu, and Maosong Sun. 2025. Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts.arXiv preprint arXiv:2502.17297(2025)

  26. [26]

    Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. 2025. A Survey of Multimodal Retrieval-Augmented Generation.arXiv preprint arXiv:2504.08748(2025)

  27. [27]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36 (2023), 53728–53741

  28. [28]

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-Context Retrieval-Augmented Lan- guage Models.Transactions of the Association for Computational Linguistics(2023), 1316–1331

  29. [29]

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.arXiv preprint arXiv:2305.15294(2023)

  30. [30]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al . 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  31. [31]

    Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, and Jimmy Lin. 2024. Unirag: Universal retrieval augmentation for multi-modal large language models. arXiv preprint arXiv:2405.10311(2024)

  32. [32]

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652(2023)

  33. [33]

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2503.05592 (2025)

  34. [34]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

  35. [35]

    Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions.arXiv preprint arXiv:2212.10509(2022)

  36. [36]

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404

  37. [37]

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. 2025. MMSearch-R1: Incentivizing LMMs to Search.arXiv preprint arXiv:2506.20670(2025)

  38. [38]

    Yin Wu, Quanyu Long, Jing Li, Jianfei Yu, and Wenya Wang. 2025. Visual- rag: Benchmarking text-to-image retrieval augmented generation for visual knowledge intensive queries.arXiv preprint arXiv:2502.16636(2025)

  39. [39]

    Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, and Huaxiu Yao. 2024. Rule: Reliable multimodal rag for factuality in medical vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 1081–1093

  40. [40]

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. 2025. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615(2025)

  41. [41]

    Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. 2025. UniversalRAG: Retrieval-Augmented Generation over Multiple Cor- pora with Diverse Modalities and Granularities.arXiv preprint arXiv:2504.20734 (2025)

  42. [42]

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al . 2024. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594(2024)

  43. [43]

    Xiaohan Yu, Zhihan Yang, and Chong Chen. 2025. Unveiling the Potential of Multimodal Retrieval Augmented Generation with Planning.arXiv preprint arXiv:2501.15470(2025)

  44. [44]

    Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, et al. 2024. m2RAG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA.arXiv preprint arXiv:2411.15041(2024)

  45. [45]

    Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. 2023. Retrieving multimodal information for augmented generation: A survey.arXiv preprint arXiv:2303.10868(2023)