pith. sign in

arxiv: 2505.17086 · v4 · submitted 2025-05-20 · 💻 cs.CL

Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning

Pith reviewed 2026-05-22 13:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords multi-agent RAGreinforcement learninglong contextquestion answeringretrieval augmented generationpolicy gradient optimizationmulti-hop reasoning
0
0 comments X

The pith

Multi-agent decomposition paired with minimalist reinforcement learning overcomes long-context limits in RAG systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mujica-MyGO as a framework that lets large language models manage complex multi-turn reasoning in retrieval-augmented generation without the context lengths growing out of control. It splits multi-turn interactions into smaller cooperative sub-tasks handled by multiple agents, following a divide-and-conquer strategy. A lightweight reinforcement learning procedure called MyGO then trains the models directly, removing the need to stuff few-shot examples into every prompt. The authors supply convergence proofs for the RL step and report stronger results than prior systems on question-answering tests that use both plain text and knowledge graphs. Readers would care because the method keeps effective context short while still supporting deep reasoning chains.

Core claim

Mujica-MyGO combines Mujica, a multi-agent RAG workflow that decomposes multi-turn interactions into cooperative sub-interactions to reduce context length, with MyGO, a minimalist policy gradient optimization algorithm that enables effective post-training of LLMs in RAG pipelines without in-context learning and supplies theoretical guarantees of convergence to the optimal policy, yielding superior empirical performance across text-corpus and knowledge-graph question-answering benchmarks.

What carries the argument

Mujica multi-agent decomposition workflow together with MyGO minimalist policy gradient optimization for post-training.

If this is right

  • Multi-turn RAG reasoning no longer forces exponential growth in prompt length.
  • LLMs can be post-trained for RAG tasks using lightweight RL rather than prompt-based few-shot demonstrations.
  • The same workflow applies to both text-based corpora and structured knowledge graphs.
  • Theoretical convergence of the RL component supports stable optimization inside complex agent pipelines.
  • Overall accuracy on diverse question-answering tasks improves without larger context windows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar decomposition patterns could shorten context demands in other multi-step LLM applications such as planning or summarization.
  • The minimalist RL approach may lower the cost of adapting models to specialized retrieval pipelines.
  • Pairing the method with shorter-context base models could produce still more efficient end-to-end systems.

Load-bearing premise

Decomposing multi-turn interactions into cooperative sub-interactions will sufficiently reduce the long-context limitations that LLMs face in RAG pipelines.

What would settle it

A controlled run of the same benchmarks in which the multi-agent decomposition is used but performance remains no better than strong single-agent baselines or long-context errors persist.

Figures

Figures reproduced from arXiv: 2505.17086 by Ho-fung Leung, Irwin King, Jiaming Zhou, Jianye Hao, Jian-Yun Nie, Lei Ding, Liheng Ma, Muzhi Li, Yihong Wu, Yingxue Zhang.

Figure 1
Figure 1. Figure 1: The end-to-end architecture of the proposed Mujica framework. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: , there might be two conditionally independent subquestions, S2,1 and S2,2, which are dependent in the later subquestion S4,1. The dependency relations form a directed acyclic graph (DAG). Directly applying Eq. 1 to such a reasoning process can be both inefficient and suboptimal. Therefore, to effectively handle DAG dependency graphs, we allow our Mujica planner to ask subquestions in multi￾ple iterations … view at source ↗
Figure 3
Figure 3. Figure 3: The Proposed Minimalist Policy Gradient Optimization Framework. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: F1 score on 2Wiki-KG training 1K 3K 5K 7K 9K 11K 13K 15K 17K 19K Sample Iteration 0.63 0.64 0.65 0.66 0.67 0.68 Avg F1 Score (pass@1) Offline [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) equipped with modern Retrieval-Augmented Generation (RAG) systems often employ multi-turn interaction pipelines to interface with search engines for complex reasoning tasks. However, such multi-turn interactions inevitably produce long intermediate contexts, as context length grows exponentially with exploration depth. This leads to a well-known limitation of LLMs: their difficulty in effectively leveraging information from long contexts. This problem is further amplified in RAG systems that depend on in-context learning, where few-shot demonstrations must also be included in the prompt, compounding the context-length bottleneck. To address these challenges, we propose Mujica-MyGo, a unified framework for efficient multi-turn reasoning in RAG. Inspired by the divide-and-conquer principle, we introduce Mujica (Multi-hop Joint Intelligence for Complex Question Answering), a multi-agent RAG workflow that decomposes multi-turn interactions into cooperative sub-interactions, thereby mitigating long-context issues. To eliminate the dependency on in-context learning, we further develop MyGO (Minimalist Policy Gradient Optimization), a lightweight and efficient reinforcement learning algorithm that enables effective post-training of LLMs within complex RAG pipelines. We provide theoretical guarantees for MyGO's convergence to the optimal policy. Empirical evaluations across diverse question-answering benchmarks, covering both text corpora and knowledge graphs, show that Mujica-MyGO achieves superior performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mujica-MyGO, a unified multi-agent RAG framework for complex question answering. Mujica applies a divide-and-conquer decomposition to break multi-turn interactions into cooperative sub-interactions, aiming to mitigate exponential context growth in LLMs. MyGO introduces a lightweight policy-gradient RL algorithm for post-training that removes reliance on in-context learning, with claimed theoretical convergence guarantees to the optimal policy. Experiments across text-corpus and knowledge-graph QA benchmarks report superior performance over baselines.

Significance. If the central claims hold, the work offers a practical route to scalable multi-turn RAG by combining agent decomposition with minimalist RL, potentially reducing both context-length bottlenecks and few-shot prompting overhead while providing convergence assurances. The emphasis on lightweight, theoretically grounded RL distinguishes it from heavier fine-tuning approaches and could influence future multi-agent retrieval systems.

major comments (2)
  1. [§4] §4 (Experiments) and associated figures/tables: No quantitative evidence—such as average token counts, context-length histograms, or ablation on input length—is provided comparing the Mujica workflow against standard multi-turn RAG baselines. Without these measurements it remains unclear whether the divide-and-conquer decomposition actually shortens effective contexts or whether reported gains arise from the RL component or other factors.
  2. [§3.2–3.3] §3.2–3.3 (MyGO algorithm and theory): The convergence guarantee is stated to hold for the minimalist policy gradient, yet the manuscript supplies neither the full proof nor the precise assumptions on the reward function and policy parameterization needed to verify that the result is independent of fitted quantities or in-context demonstrations.
minor comments (2)
  1. [Abstract] Notation for the two components is inconsistent between “Mujica-MyGo” and “Mujica-MyGO” in the abstract and section headings; standardize capitalization.
  2. [§3.1] The description of cooperative sub-interactions in Mujica would benefit from an explicit diagram or pseudocode showing message passing between sub-agents to clarify how history accumulation is avoided.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's thoughtful review and constructive feedback on our manuscript. We have carefully considered each major comment and provide detailed responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated figures/tables: No quantitative evidence—such as average token counts, context-length histograms, or ablation on input length—is provided comparing the Mujica workflow against standard multi-turn RAG baselines. Without these measurements it remains unclear whether the divide-and-conquer decomposition actually shortens effective contexts or whether reported gains arise from the RL component or other factors.

    Authors: We acknowledge that the original manuscript lacks direct quantitative measurements of context lengths for the Mujica workflow compared to baselines. To address this, in the revised version we will include average token usage statistics, context length histograms, and an ablation study on varying input lengths. These additions will demonstrate the effectiveness of the divide-and-conquer decomposition in reducing context growth and help distinguish its impact from that of the MyGO reinforcement learning component. revision: yes

  2. Referee: [§3.2–3.3] §3.2–3.3 (MyGO algorithm and theory): The convergence guarantee is stated to hold for the minimalist policy gradient, yet the manuscript supplies neither the full proof nor the precise assumptions on the reward function and policy parameterization needed to verify that the result is independent of fitted quantities or in-context demonstrations.

    Authors: We agree with the referee that the full proof and precise assumptions were not supplied in the manuscript. In the revised version, we will include the complete proof of convergence for the minimalist policy gradient along with the detailed assumptions regarding the reward function and policy parameterization. This will enable verification that the result holds independently of fitted quantities or in-context demonstrations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The paper presents Mujica as a multi-agent decomposition workflow inspired by divide-and-conquer to address long-context growth in RAG, and MyGO as a new minimalist RL method with independently claimed theoretical convergence guarantees. No equations, fitted parameters renamed as predictions, or load-bearing self-citations are exhibited that reduce the central performance claims or mitigation assertions to tautological inputs by construction. The divide-and-conquer motivation and RL convergence statement stand as external premises rather than self-referential reductions. The derivation is therefore not forced by definition or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the effectiveness of the proposed decomposition strategy and the convergence properties of the new RL algorithm; no numerical free parameters are mentioned.

axioms (1)
  • domain assumption Decomposing multi-turn RAG interactions into cooperative sub-interactions via divide-and-conquer mitigates long-context limitations.
    This premise is invoked to justify the Mujica multi-agent workflow.
invented entities (2)
  • Mujica no independent evidence
    purpose: Multi-agent RAG workflow that decomposes interactions
    New framework introduced to address context length growth.
  • MyGO no independent evidence
    purpose: Minimalist policy gradient optimization for post-training LLMs in RAG
    New RL algorithm claimed to provide convergence guarantees without in-context learning.

pith-pipeline@v0.9.0 · 5802 in / 1396 out tokens · 61127 ms · 2026-05-22T13:30:14.867333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

124 extracted references · 124 canonical work pages · 16 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Prerna Agarwal, Nishant Kumar, and Srikanta Bedathur. 2024. SymKGQA: Few- Shot Knowledge Graph Question Answering via Symbolic Program Generation and Execution. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computation...

  3. [3]

    Cecilia Aguerrebere, Ishwar Bhati, Mark Hildebrand, Mariano Tepper, and Ted Willke. 2023. Similarity search in the blink of an eye with compressed indices. Proceedings of the VLDB Endowment16, 11 (2023), 3433–3446

  4. [4]

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740(2024)

  5. [5]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi

  6. [6]

    InThe Twelfth International Conference on Learning Representations

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self- Reflection. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=hSyW5go0v8

  7. [7]

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:arXiv:1606.01540

  8. [8]

    Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation. InFirst Conference on Language Modeling. https://openreview.net/ forum?id=tzE7VqsaJ4

  9. [9]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

  10. [10]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL] https://arxiv.org/abs/2402.03216

  11. [11]

    Liyi Chen, Panrong Tong, Zhongming Jin, Ying Sun, Jieping Ye, and Hui Xiong

  12. [12]

    InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

    Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=CwCUEr6wO5

  13. [13]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. 2025. ReSearch: Learning to Reason with Search for LLMs via Reinforce- ment Learning. arXiv:2503.19470 [cs.AI] https://arxiv.org/abs/2503.19470

  14. [14]

    Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. 2024. Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958(2024)

  15. [15]

    Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. 2025. Gpg: A simple and strong reinforcement learning baseline for model reasoning. arXiv preprint arXiv:2504.02546(2025)

  16. [16]

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, SHUM KaShun, and Tong Zhang. 2023. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment.Transactions on Machine Learning Research(2023)

  17. [17]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Comput...

  18. [18]

    Siyuan Fang, Kaijing Ma, Tianyu Zheng, Xinrun Du, Ningxuan Lu, Ge Zhang, and Qingkun Tang. 2024. KARPA: A Training-free Method of Adapting Knowledge Graph as References for Large Language Model’s Reasoning Path Aggregation. arXiv:2412.20995 [cs.CL] https://arxiv.org/abs/2412.20995

  19. [19]

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su

  20. [20]

    InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

    HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

  21. [21]

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. InInternational conference on machine learning. PMLR, 3929–3938

  22. [22]

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. InProceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Sp...

  23. [23]

    Jian Hu. 2025. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262(2025)

  24. [24]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55

  25. [25]

    Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Xin Zhao, Yang Song, and Tao Zhang. 2025. RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...

  26. [26]

    Zhouyu Jiang, Mengshu Sun, Lei Liang, and Zhiqiang Zhang. 2025. Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach. arXiv:2407.13101 [cs.CL] https://arxiv.org/abs/2407.13101

  27. [27]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

  28. [28]

    Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christo- pher Potts, and Matei Zaharia. 2022. Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP.arXiv preprint arXiv:2212.14024(2022)

  29. [29]

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526

  30. [30]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

  31. [31]

    Yunshi Lan and Jing Jiang. 2020. Query Graph Generation for Answering Multi- hop Complex Questions from Knowledge Bases. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 969–974. https://doi.or...

  32. [32]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  33. [33]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Van...

  34. [34]

    Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James Glass, and Helen Meng

  35. [35]

    arXiv:2410.18415 [cs.CL] https: //arxiv.org/abs/2410.18415

    Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains. arXiv:2410.18415 [cs.CL] https: //arxiv.org/abs/2410.18415

  36. [36]

    Mufei Li, Siqi Miao, and Pan Li. 2025. Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=JvkuZZ04O7

  37. [37]

    Muzhi Li, Cehao Yang, Chengjin Xu, Xuhui Jiang, Yiyan Qi, Jian Guo, Ho-fung Leung, and Irwin King. 2025. Retrieval, Reasoning, Re-ranking: A Context- Enriched Framework for Knowledge Graph Completion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

  38. [38]

    Shaobo Li, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Chengjie Sun, Zhen- zhou Ji, and Bingquan Liu. 2021. HopRetriever: Retrieve Hops over Wikipedia to Answer Complex Questions.Proceedings of the AAAI Conference on Artificial Intel- ligence35, 15 (May 2021), 13279–13287. https://doi.org/10.1609/aaai.v35i15.17568

  39. [39]

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic Search-Enhanced Large Reasoning Models. arXiv:2501.05366 [cs.AI] https://arxiv.org/abs/2501.05366

  40. [40]

    Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi- Quan Luo. 2024. ReMax: a simple, effective, and efficient reinforcement learning method for aligning large language models. InProceedings of the 41st International Conference on Machine Learning. 29128–29163

  41. [41]

    Xujian Liang and Zhaoquan Gu. 2025. Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph.Proceedings of the AAAI Conference on Artificial Intelligence39, 23 (Apr. 2025), 24558–24566. https://doi.org/10.1609/aaai.v39i23.34635

  42. [42]

    Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. 2023. How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational L...

  43. [43]

    Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. 2025. Advances and chal- lenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990(2025)

  44. [44]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

  45. [45]

    Zhuang Liu and Kaiming He. 2024. A Decade’s Battle on Dataset Bias: Are We There Yet?arXiv preprint arXiv:2403.08632(2024)

  46. [46]

    Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations

  47. [47]

    Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations

  48. [48]

    LINHAO LUO, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024. Rea- soning on Graphs: Faithful and Interpretable Large Language Model Reasoning. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=ZGNWW7xZ6Q

  49. [49]

    LINHAO LUO, Zicheng Zhao, Gholamreza Haffari, Chen Gong, and Shirui Pan

  50. [50]

    https://openreview.net/forum?id=6embY8aclt

    Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models. https://openreview.net/forum?id=6embY8aclt

  51. [51]

    Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. 2025. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation. InThe Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=oFBu7qaZpS

  52. [52]

    Shengyu Mao, Yong Jiang, Boli Chen, Xiao Li, Peng Wang, Xinyu Wang, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. 2024. RaFe: Ranking Feedback Improves Query Rewriting for RAG. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguist...

  53. [53]

    Vaibhav Mavi, Anubhav Jangra, Adam Jatowt, et al. 2024. Multi-hop question answering.Foundations and Trends®in Information Retrieval17, 5 (2024), 457– 586

  54. [54]

    Costas Mavromatis and George Karypis. 2024. GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning. arXiv:2405.20139 [cs.CL] https://arxiv. org/abs/2405.20139

  55. [55]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  56. [56]

    Sukannya Purkayastha, Saswati Dana, Dinesh Garg, Dinesh Khandelwal, and G.P Shrivatsa Bhargav. 2022. A Deep Neural Approach to KGQA via SPARQL Silhouette Generation. In2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892263

  57. [57]

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al . [n.d.]. Improving language understanding by generative pre-training. ([n. d.])

  58. [58]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36 (2023), 53728–53741

  59. [59]

    Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Found. Trends Inf. Retr.3, 4 (April 2009), 333–389. https://doi.org/10.1561/1500000019

  60. [60]

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz

  61. [61]

    InInternational conference on machine learning

    Trust region policy optimization. InInternational conference on machine learning. PMLR, 1889–1897

  62. [62]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. 2016. High-Dimensional Continuous Control Using Generalized Ad- vantage Estimation. In4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. http://arxiv.org/abs/1506.02438

  63. [63]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  64. [64]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

  65. [65]

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. InFindings of the Association for Compu- tational Linguistics: EMNLP 2023. 9248–9274

  66. [66]

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-Augmented Large Language Models with It- erative Retrieval-Generation Synergy. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sing...

  67. [67]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al . 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  68. [68]

    Yiheng Shu, Zhiwei Yu, Yuhan Li, Börje Karlsson, Tingting Ma, Yuzhong Qu, and Chin-Yew Lin. 2022. TIARA: Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computati...

  69. [69]

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592 (2025)

  70. [70]

    Yuan Sui, Yufei He, Nian Liu, Xiaoxin He, Kun Wang, and Bryan Hooi. 2025. FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering. arXiv:2405.13873 [cs.AI] https://arxiv.org/abs/2405.13873

  71. [71]

    Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, and Yan Zhang. 2025. ZEROSEARCH: Incentivize the Search Capability of LLMs without Searching.arXiv preprint arXiv:2505.04588(2025)

  72. [72]

    Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=nnVO1PvbTv

  73. [73]

    Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, and Wenjie Zhang

  74. [74]

    arXiv:2410.14211 [cs.CL] https://arxiv.org/abs/2410.14211

    Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning. arXiv:2410.14211 [cs.CL] https://arxiv.org/abs/2410.14211

  75. [75]

    Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, et al. 2024. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448(2024)

  76. [77]

    Transactions of the Association for Computational Linguistics(2022)

    MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics(2022)

  77. [78]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

  78. [79]

    Diverse demonstrations improve in-context compositional generalization

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 10014–10037. https:/...

  79. [80]

    Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Andreas Both. 2024. QALD-10 – The 10th challenge on question answering over linked data: Shifting from DBpedia to Wikidata as a KG for KGQA.Semantic Web15, 6 (2024), 2193–2207. ...

  80. [81]

    Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase.Commun. ACM57, 10 (sep 2014), 78–85. https://doi.org/10. 1145/2629489

Showing first 80 references.