Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning
Pith reviewed 2026-05-22 13:30 UTC · model grok-4.3
The pith
Multi-agent decomposition paired with minimalist reinforcement learning overcomes long-context limits in RAG systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mujica-MyGO combines Mujica, a multi-agent RAG workflow that decomposes multi-turn interactions into cooperative sub-interactions to reduce context length, with MyGO, a minimalist policy gradient optimization algorithm that enables effective post-training of LLMs in RAG pipelines without in-context learning and supplies theoretical guarantees of convergence to the optimal policy, yielding superior empirical performance across text-corpus and knowledge-graph question-answering benchmarks.
What carries the argument
Mujica multi-agent decomposition workflow together with MyGO minimalist policy gradient optimization for post-training.
If this is right
- Multi-turn RAG reasoning no longer forces exponential growth in prompt length.
- LLMs can be post-trained for RAG tasks using lightweight RL rather than prompt-based few-shot demonstrations.
- The same workflow applies to both text-based corpora and structured knowledge graphs.
- Theoretical convergence of the RL component supports stable optimization inside complex agent pipelines.
- Overall accuracy on diverse question-answering tasks improves without larger context windows.
Where Pith is reading between the lines
- Similar decomposition patterns could shorten context demands in other multi-step LLM applications such as planning or summarization.
- The minimalist RL approach may lower the cost of adapting models to specialized retrieval pipelines.
- Pairing the method with shorter-context base models could produce still more efficient end-to-end systems.
Load-bearing premise
Decomposing multi-turn interactions into cooperative sub-interactions will sufficiently reduce the long-context limitations that LLMs face in RAG pipelines.
What would settle it
A controlled run of the same benchmarks in which the multi-agent decomposition is used but performance remains no better than strong single-agent baselines or long-context errors persist.
Figures
read the original abstract
Large Language Models (LLMs) equipped with modern Retrieval-Augmented Generation (RAG) systems often employ multi-turn interaction pipelines to interface with search engines for complex reasoning tasks. However, such multi-turn interactions inevitably produce long intermediate contexts, as context length grows exponentially with exploration depth. This leads to a well-known limitation of LLMs: their difficulty in effectively leveraging information from long contexts. This problem is further amplified in RAG systems that depend on in-context learning, where few-shot demonstrations must also be included in the prompt, compounding the context-length bottleneck. To address these challenges, we propose Mujica-MyGo, a unified framework for efficient multi-turn reasoning in RAG. Inspired by the divide-and-conquer principle, we introduce Mujica (Multi-hop Joint Intelligence for Complex Question Answering), a multi-agent RAG workflow that decomposes multi-turn interactions into cooperative sub-interactions, thereby mitigating long-context issues. To eliminate the dependency on in-context learning, we further develop MyGO (Minimalist Policy Gradient Optimization), a lightweight and efficient reinforcement learning algorithm that enables effective post-training of LLMs within complex RAG pipelines. We provide theoretical guarantees for MyGO's convergence to the optimal policy. Empirical evaluations across diverse question-answering benchmarks, covering both text corpora and knowledge graphs, show that Mujica-MyGO achieves superior performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mujica-MyGO, a unified multi-agent RAG framework for complex question answering. Mujica applies a divide-and-conquer decomposition to break multi-turn interactions into cooperative sub-interactions, aiming to mitigate exponential context growth in LLMs. MyGO introduces a lightweight policy-gradient RL algorithm for post-training that removes reliance on in-context learning, with claimed theoretical convergence guarantees to the optimal policy. Experiments across text-corpus and knowledge-graph QA benchmarks report superior performance over baselines.
Significance. If the central claims hold, the work offers a practical route to scalable multi-turn RAG by combining agent decomposition with minimalist RL, potentially reducing both context-length bottlenecks and few-shot prompting overhead while providing convergence assurances. The emphasis on lightweight, theoretically grounded RL distinguishes it from heavier fine-tuning approaches and could influence future multi-agent retrieval systems.
major comments (2)
- [§4] §4 (Experiments) and associated figures/tables: No quantitative evidence—such as average token counts, context-length histograms, or ablation on input length—is provided comparing the Mujica workflow against standard multi-turn RAG baselines. Without these measurements it remains unclear whether the divide-and-conquer decomposition actually shortens effective contexts or whether reported gains arise from the RL component or other factors.
- [§3.2–3.3] §3.2–3.3 (MyGO algorithm and theory): The convergence guarantee is stated to hold for the minimalist policy gradient, yet the manuscript supplies neither the full proof nor the precise assumptions on the reward function and policy parameterization needed to verify that the result is independent of fitted quantities or in-context demonstrations.
minor comments (2)
- [Abstract] Notation for the two components is inconsistent between “Mujica-MyGo” and “Mujica-MyGO” in the abstract and section headings; standardize capitalization.
- [§3.1] The description of cooperative sub-interactions in Mujica would benefit from an explicit diagram or pseudocode showing message passing between sub-agents to clarify how history accumulation is avoided.
Simulated Author's Rebuttal
Thank you for the referee's thoughtful review and constructive feedback on our manuscript. We have carefully considered each major comment and provide detailed responses below. Where appropriate, we will revise the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated figures/tables: No quantitative evidence—such as average token counts, context-length histograms, or ablation on input length—is provided comparing the Mujica workflow against standard multi-turn RAG baselines. Without these measurements it remains unclear whether the divide-and-conquer decomposition actually shortens effective contexts or whether reported gains arise from the RL component or other factors.
Authors: We acknowledge that the original manuscript lacks direct quantitative measurements of context lengths for the Mujica workflow compared to baselines. To address this, in the revised version we will include average token usage statistics, context length histograms, and an ablation study on varying input lengths. These additions will demonstrate the effectiveness of the divide-and-conquer decomposition in reducing context growth and help distinguish its impact from that of the MyGO reinforcement learning component. revision: yes
-
Referee: [§3.2–3.3] §3.2–3.3 (MyGO algorithm and theory): The convergence guarantee is stated to hold for the minimalist policy gradient, yet the manuscript supplies neither the full proof nor the precise assumptions on the reward function and policy parameterization needed to verify that the result is independent of fitted quantities or in-context demonstrations.
Authors: We agree with the referee that the full proof and precise assumptions were not supplied in the manuscript. In the revised version, we will include the complete proof of convergence for the minimalist policy gradient along with the detailed assumptions regarding the reward function and policy parameterization. This will enable verification that the result holds independently of fitted quantities or in-context demonstrations. revision: yes
Circularity Check
No significant circularity; derivation chain remains self-contained
full rationale
The paper presents Mujica as a multi-agent decomposition workflow inspired by divide-and-conquer to address long-context growth in RAG, and MyGO as a new minimalist RL method with independently claimed theoretical convergence guarantees. No equations, fitted parameters renamed as predictions, or load-bearing self-citations are exhibited that reduce the central performance claims or mitigation assertions to tautological inputs by construction. The divide-and-conquer motivation and RL convergence statement stand as external premises rather than self-referential reductions. The derivation is therefore not forced by definition or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Decomposing multi-turn RAG interactions into cooperative sub-interactions via divide-and-conquer mitigates long-context limitations.
invented entities (2)
-
Mujica
no independent evidence
-
MyGO
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Inspired by the divide-and-conquer principle, we introduce Mujica... that decomposes multi-turn interactions into cooperative sub-interactions, thereby mitigating long-context issues... MyGO... samples trajectories from an asymptotically approximate optimal policy... MLE for policy updates.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We provide theoretical guarantees for MyGO's convergence to the optimal policy.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Prerna Agarwal, Nishant Kumar, and Srikanta Bedathur. 2024. SymKGQA: Few- Shot Knowledge Graph Question Answering via Symbolic Program Generation and Execution. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computation...
-
[3]
Cecilia Aguerrebere, Ishwar Bhati, Mark Hildebrand, Mariano Tepper, and Ted Willke. 2023. Similarity search in the blink of an eye with compressed indices. Proceedings of the VLDB Endowment16, 11 (2023), 3433–3446
work page 2023
-
[4]
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi
-
[6]
InThe Twelfth International Conference on Learning Representations
Self-RAG: Learning to Retrieve, Generate, and Critique through Self- Reflection. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=hSyW5go0v8
-
[7]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:arXiv:1606.01540
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation. InFirst Conference on Language Modeling. https://openreview.net/ forum?id=tzE7VqsaJ4
work page 2024
-
[9]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu
-
[10]
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL] https://arxiv.org/abs/2402.03216
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Liyi Chen, Panrong Tong, Zhongming Jin, Ying Sun, Jieping Ye, and Hui Xiong
-
[12]
InThe Thirty-eighth Annual Conference on Neural Information Processing Systems
Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=CwCUEr6wO5
-
[13]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. 2025. ReSearch: Learning to Reason with Search for LLMs via Reinforce- ment Learning. arXiv:2503.19470 [cs.AI] https://arxiv.org/abs/2503.19470
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [14]
- [15]
-
[16]
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, SHUM KaShun, and Tong Zhang. 2023. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment.Transactions on Machine Learning Research(2023)
work page 2023
-
[17]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Comput...
-
[18]
Siyuan Fang, Kaijing Ma, Tianyu Zheng, Xinrun Du, Ningxuan Lu, Ge Zhang, and Qingkun Tang. 2024. KARPA: A Training-free Method of Adapting Knowledge Graph as References for Large Language Model’s Reasoning Path Aggregation. arXiv:2412.20995 [cs.CL] https://arxiv.org/abs/2412.20995
-
[19]
Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su
-
[20]
InThe Thirty-eighth Annual Conference on Neural Information Processing Systems
HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems
-
[21]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. InInternational conference on machine learning. PMLR, 3929–3938
work page 2020
-
[22]
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. InProceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Sp...
-
[23]
Jian Hu. 2025. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55
work page 2025
-
[25]
Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Xin Zhao, Yang Song, and Tao Zhang. 2025. RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...
work page 2025
- [26]
-
[27]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781
work page 2020
- [28]
-
[29]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526
work page 2017
-
[30]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626
work page 2023
-
[31]
Yunshi Lan and Jing Jiang. 2020. Query Graph Generation for Answering Multi- hop Complex Questions from Knowledge Bases. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 969–974. https://doi.or...
-
[32]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474
work page 2020
-
[33]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Van...
work page 2020
-
[34]
Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James Glass, and Helen Meng
-
[35]
arXiv:2410.18415 [cs.CL] https: //arxiv.org/abs/2410.18415
Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains. arXiv:2410.18415 [cs.CL] https: //arxiv.org/abs/2410.18415
-
[36]
Mufei Li, Siqi Miao, and Pan Li. 2025. Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=JvkuZZ04O7
work page 2025
-
[37]
Muzhi Li, Cehao Yang, Chengjin Xu, Xuhui Jiang, Yiyan Qi, Jian Guo, Ho-fung Leung, and Irwin King. 2025. Retrieval, Reasoning, Re-ranking: A Context- Enriched Framework for Knowledge Graph Completion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...
work page 2025
-
[38]
Shaobo Li, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Chengjie Sun, Zhen- zhou Ji, and Bingquan Liu. 2021. HopRetriever: Retrieve Hops over Wikipedia to Answer Complex Questions.Proceedings of the AAAI Conference on Artificial Intel- ligence35, 15 (May 2021), 13279–13287. https://doi.org/10.1609/aaai.v35i15.17568
-
[39]
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic Search-Enhanced Large Reasoning Models. arXiv:2501.05366 [cs.AI] https://arxiv.org/abs/2501.05366
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi- Quan Luo. 2024. ReMax: a simple, effective, and efficient reinforcement learning method for aligning large language models. InProceedings of the 41st International Conference on Machine Learning. 29128–29163
work page 2024
-
[41]
Xujian Liang and Zhaoquan Gu. 2025. Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph.Proceedings of the AAAI Conference on Artificial Intelligence39, 23 (Apr. 2025), 24558–24566. https://doi.org/10.1609/aaai.v39i23.34635
-
[42]
Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. 2023. How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational L...
-
[43]
Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. 2025. Advances and chal- lenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173
work page 2024
- [45]
-
[46]
Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations
work page 2017
-
[47]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations
work page 2019
-
[48]
LINHAO LUO, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024. Rea- soning on Graphs: Faithful and Interpretable Large Language Model Reasoning. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=ZGNWW7xZ6Q
work page 2024
-
[49]
LINHAO LUO, Zicheng Zhao, Gholamreza Haffari, Chen Gong, and Shirui Pan
-
[50]
https://openreview.net/forum?id=6embY8aclt
Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models. https://openreview.net/forum?id=6embY8aclt
-
[51]
Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. 2025. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation. InThe Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=oFBu7qaZpS
work page 2025
-
[52]
Shengyu Mao, Yong Jiang, Boli Chen, Xiao Li, Peng Wang, Xinyu Wang, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. 2024. RaFe: Ranking Feedback Improves Query Rewriting for RAG. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguist...
-
[53]
Vaibhav Mavi, Anubhav Jangra, Adam Jatowt, et al. 2024. Multi-hop question answering.Foundations and Trends®in Information Retrieval17, 5 (2024), 457– 586
work page 2024
- [54]
-
[55]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
work page 2022
-
[56]
Sukannya Purkayastha, Saswati Dana, Dinesh Garg, Dinesh Khandelwal, and G.P Shrivatsa Bhargav. 2022. A Deep Neural Approach to KGQA via SPARQL Silhouette Generation. In2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892263
-
[57]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al . [n.d.]. Improving language understanding by generative pre-training. ([n. d.])
-
[58]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36 (2023), 53728–53741
work page 2023
-
[59]
Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Found. Trends Inf. Retr.3, 4 (April 2009), 333–389. https://doi.org/10.1561/1500000019
-
[60]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz
-
[61]
InInternational conference on machine learning
Trust region policy optimization. InInternational conference on machine learning. PMLR, 1889–1897
-
[62]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. 2016. High-Dimensional Continuous Control Using Generalized Ad- vantage Estimation. In4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. http://arxiv.org/abs/1506.02438
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[63]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
-
[64]
Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[65]
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. InFindings of the Association for Compu- tational Linguistics: EMNLP 2023. 9248–9274
work page 2023
-
[66]
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-Augmented Large Language Models with It- erative Retrieval-Generation Synergy. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sing...
-
[67]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al . 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Yiheng Shu, Zhiwei Yu, Yuhan Li, Börje Karlsson, Tingting Ma, Yuzhong Qu, and Chin-Yew Lin. 2022. TIARA: Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computati...
-
[69]
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [70]
-
[71]
Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, and Yan Zhang. 2025. ZEROSEARCH: Incentivize the Search Capability of LLMs without Searching.arXiv preprint arXiv:2505.04588(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=nnVO1PvbTv
work page 2024
-
[73]
Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, and Wenjie Zhang
-
[74]
arXiv:2410.14211 [cs.CL] https://arxiv.org/abs/2410.14211
Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning. arXiv:2410.14211 [cs.CL] https://arxiv.org/abs/2410.14211
-
[75]
Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, et al. 2024. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448(2024)
-
[77]
Transactions of the Association for Computational Linguistics(2022)
MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics(2022)
work page 2022
-
[78]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal
-
[79]
Diverse demonstrations improve in-context compositional generalization
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 10014–10037. https:/...
-
[80]
Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Andreas Both. 2024. QALD-10 – The 10th challenge on question answering over linked data: Shifting from DBpedia to Wikidata as a KG for KGQA.Semantic Web15, 6 (2024), 2193–2207. ...
-
[81]
Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase.Commun. ACM57, 10 (sep 2014), 78–85. https://doi.org/10. 1145/2629489
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.