pith. sign in

arxiv: 2502.13957 · v3 · pith:TV4EXAUDnew · submitted 2025-02-19 · 💻 cs.CL · cs.AI

Supervising the search process produces reliable and generalizable information-seeking agents

Pith reviewed 2026-05-23 02:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords process supervisionsearch agentsRAG-Gymmulti-hop question answeringout-of-domain generalizationLLM agentsreasoning reflection
0
0 comments X

The pith

Supervising the search process produces more reliable and generalizable information-seeking agents than outcome supervision alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models used as search agents have relied on rewarding only correct final answers, which encourages reward hacking and heavy use of the model's internal knowledge at the expense of generalization. The paper introduces the RAG-Gym framework to provide supervision on intermediate search steps instead. This leads to identification of reasoning reflection as important and to the Re²Search++ agent. The resulting agents show gains on multi-hop benchmarks that are larger in out-of-domain tests, come mainly from improved queries, and include search critics that work across different models.

Core claim

Previous approaches have primarily relied on outcome supervision, rewarding agents only for producing correct final answers. This often leads to reward hacking and excessive dependence on parametric memory, limiting generalization to out-of-domain tasks. To address these limitations, we introduce RAG-Gym, a framework that shifts supervision from final answers to the search process itself. With RAG-Gym, we systematically investigate architecture design, parameter optimization, and action evaluation, identifying reasoning reflection as a critical capability for search agents. Building on this insight, we propose Re²Search++, a process-supervised agent that achieves substantial improvements on多

What carries the argument

RAG-Gym framework that shifts supervision from final answers to the search process itself, highlighting reasoning reflection as the key capability.

If this is right

  • Agents generate higher-quality search queries rather than focusing only on final answer optimization.
  • Performance gains are larger in out-of-domain settings than in-domain ones.
  • Learned search critics transfer to other models, including proprietary LLMs.
  • Reasoning reflection becomes a necessary capability for effective search agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Process supervision could reduce reward hacking in LLM agents for tasks other than search.
  • The same shift from outcome to process rewards might improve generalization in planning or tool-using agents.
  • The transferability of critics suggests a path toward reusable search modules that work across base models.

Load-bearing premise

The chosen multi-hop benchmarks and out-of-domain splits accurately measure generalization separate from the model's existing parametric knowledge.

What would settle it

Test both types of agents on questions whose answers consist of facts introduced after the base model's training data cutoff, then check whether the process-supervised version retains its performance edge.

Figures

Figures reproduced from arXiv: 2502.13957 by Aidong Zhang, Dengyu Wang, Fangyuan Chen, Guangzhi Xiong, Haolin Liu, Minjia Zhang, Qiao Jin, Xiao Wang, Yifan Yang, Yin Fang, Zhixing Song, Zhiyong Lu.

Figure 1
Figure 1. Figure 1: Overview of the RAG-Gym framework. RAG-Gym employs a modular design, comprising [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance improvements across various agents with critics. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of Re2Search agents with critics trained on different numbers of samples. For MedQA, which involves complex reasoning and information-seeking tasks requiring domain￾specific knowledge, a different trend is observed. With only 250 training samples, the performance slightly drops below the ZSL baseline, highlighting the challenges of capturing intricate domain￾specific processes with limited trai… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of Re2Search agents with different numbers of actions sampled per step. 5 Related Work 5.1 Retrieval-Augmented Generation Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) on knowledge-intensive tasks. A typical RAG framework comprises two core components: a retriever, which selects relevant documents from a large corpus, and a ge… view at source ↗
Figure 5
Figure 5. Figure 5: Pipeline of the process data collection in RAG-Gym. Process reward data is collected by [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of different agent architectures in handling a multi-hop question from Bam [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of different agent architectures in handling a multi-hop question from MedQA. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Template used for history knowledge summarization in Search-o1 and Re [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Template used to generate actions for the Re [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Template used by GPT-4o to rank action candidates given the state. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
read the original abstract

Large language models (LLMs) are transforming web search by shifting from document ranking to synthesizing answers, and are increasingly deployed as autonomous agentic search systems that iteratively interact with external knowledge sources. Despite this progress, building effective search agents remains challenging because high-quality intermediate search steps are difficult to generate. Previous approaches have primarily relied on outcome supervision, rewarding agents only for producing correct final answers. This often leads to reward hacking and excessive dependence on parametric memory, limiting generalization to out-of-domain tasks. To address these limitations, we introduce RAG-Gym, a framework that shifts supervision from final answers to the search process itself. With RAG-Gym, we systematically investigate architecture design, parameter optimization, and action evaluation, identifying reasoning reflection as a critical capability for search agents. Building on this insight, we propose Re$^2$Search++, a process-supervised agent that achieves substantial improvements on multi-hop information-seeking benchmarks, especially in out-of-domain settings. Performance gains are driven primarily by higher-quality search queries rather than answer optimization alone, and the learned search critics transfer across models, including proprietary LLMs. These findings show that supervising the search process produces more reliable and generalizable information-seeking agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RAG-Gym, a framework that shifts supervision from final-answer outcomes to the intermediate search process in LLM-based information-seeking agents. It identifies reasoning reflection as a key capability, proposes the Re²Search++ agent, and reports substantial gains on multi-hop QA benchmarks (HotpotQA, 2WikiMultihopQA, Musique) especially in out-of-domain settings. Gains are attributed primarily to higher-quality search queries produced by the process-supervised policy, with learned search critics shown to transfer across models including proprietary LLMs.

Significance. If the central empirical claims hold after addressing isolation of search behavior, the work would advance agentic RAG systems by providing evidence that process supervision reduces reward hacking and improves OOD generalization beyond outcome-only training. Systematic investigation of architecture, optimization, and action evaluation, plus the transfer result, are strengths that could influence future agent design.

major comments (2)
  1. [OOD evaluation (Experiments section)] OOD evaluation (Experiments section): The claim that performance gains derive from the process-supervised search policy rather than residual parametric recall is load-bearing for the generalization argument. Standard multi-hop splits (HotpotQA, 2WikiMultihopQA, Musique) are drawn from corpora overlapping with LLM pretraining data; without explicit controls such as no-retrieval baselines, retrieval-only ablations, or synthetic/post-cutoff facts on the same OOD items, higher OOD scores could reflect improved exploitation of internal knowledge rather than learned search behavior. The abstract's emphasis on 'higher-quality search queries' does not rule this out.
  2. [Transfer experiments (Experiments section)] Transfer of search critics (Experiments section): The result that learned critics transfer to proprietary LLMs is presented as supporting evidence for process supervision, but the manuscript does not report whether the transferred critics were evaluated with the same no-retrieval or parametric controls as the main agents; this weakens the claim that the critics encode generalizable search policies independent of base-model memory.
minor comments (2)
  1. [Abstract] Abstract: The list of benchmarks is given without citation or version details; adding these would improve reproducibility.
  2. [Method (§3)] Notation: 'Re²Search++' is introduced without an explicit expansion or comparison table to the base Re²Search variant; a small table in §3 would clarify the incremental changes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [OOD evaluation (Experiments section)] OOD evaluation (Experiments section): The claim that performance gains derive from the process-supervised search policy rather than residual parametric recall is load-bearing for the generalization argument. Standard multi-hop splits (HotpotQA, 2WikiMultihopQA, Musique) are drawn from corpora overlapping with LLM pretraining data; without explicit controls such as no-retrieval baselines, retrieval-only ablations, or synthetic/post-cutoff facts on the same OOD items, higher OOD scores could reflect improved exploitation of internal knowledge rather than learned search behavior. The abstract's emphasis on 'higher-quality search queries' does not rule this out.

    Authors: We agree that isolating search-policy contributions from parametric recall is critical for the OOD generalization argument. Our Experiments section already reports no-retrieval baselines and retrieval-only ablations on the same OOD splits, where the process-supervised agents outperform these controls; the gains are further supported by direct measurements of query quality. To strengthen the isolation, we will add synthetic/post-cutoff fact experiments on the OOD items in the revision. revision: yes

  2. Referee: [Transfer experiments (Experiments section)] Transfer of search critics (Experiments section): The result that learned critics transfer to proprietary LLMs is presented as supporting evidence for process supervision, but the manuscript does not report whether the transferred critics were evaluated with the same no-retrieval or parametric controls as the main agents; this weakens the claim that the critics encode generalizable search policies independent of base-model memory.

    Authors: We thank the referee for noting this reporting gap. The transfer experiments used the identical evaluation protocol (including no-retrieval settings) as the main agents. We will revise the manuscript to explicitly report the no-retrieval and parametric-control results for the transferred critics, confirming that the critics encode search policies independent of base-model memory. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical benchmark evaluation

full rationale

The paper introduces the RAG-Gym framework and Re²Search++ agent, then reports performance gains on standard multi-hop QA benchmarks (HotpotQA, 2WikiMultihopQA, Musique) under out-of-domain splits. All load-bearing claims are supported by direct experimental comparisons of process-supervised vs. outcome-supervised agents, with no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central result to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical framework paper; central claim rests on experimental results from benchmarks rather than mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5774 in / 998 out tokens · 31302 ms · 2026-05-23T02:15:29.843793+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

    cs.CL 2025-11 unverdicted novelty 7.0

    MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

  2. Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    RSCB-MC is a risk-sensitive contextual bandit memory controller for LLM coding agents that chooses safe actions including abstention, achieving 60.5% proxy success with 0% false positives and low latency in 200-case v...

  3. Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

  4. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    cs.AI 2025-04 accept novelty 4.0

    A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · cited by 4 Pith papers · 18 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Understanding prompt engineering may not require rethinking generalization

    Victor Akinwande, Yiding Jiang, Dylan Sam, and J Zico Kolter. Understanding prompt engineering may not require rethinking generalization. arXiv preprint arXiv:2310.03957, 2023

  3. [3]

    Retrievalsum: A retrieval enhanced framework for abstractive summarization

    Chenxin An, Ming Zhong, Zhichao Geng, Jianqiang Yang, and Xipeng Qiu. Retrievalsum: A retrieval enhanced framework for abstractive summarization. arXiv preprint arXiv:2109.07943, 2021

  4. [4]

    Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024. URL https://openreview.net/forum?id=hSyW5go0v8

  5. [5]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  7. [7]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

  8. [8]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Research: Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025

  9. [9]

    A survey on knowledge-oriented retrieval-augmented generation

    Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang, Jiawei Cao, Jie Ma, et al. A survey on knowledge-oriented retrieval-augmented generation. arXiv preprint arXiv:2503.10677, 2025

  10. [10]

    Scaling instruction-finetuned language models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024

  11. [11]

    Reciprocal rank fusion outper- forms condorcet and individual rank learning methods

    Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. Reciprocal rank fusion outper- forms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759, 2009

  12. [12]

    Progressive multimodal reasoning via active retrieval

    Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Progressive multimodal reasoning via active retrieval. arXiv preprint arXiv:2412.14835, 2024. 10

  13. [13]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

  15. [15]

    Enhancing noise robustness of retrieval-augmented language models with adaptive adversarial training

    Feiteng Fang, Yuelin Bai, Shiwen Ni, Min Yang, Xiaojun Chen, and Ruifeng Xu. Enhancing noise robustness of retrieval-augmented language models with adaptive adversarial training. arXiv preprint arXiv:2405.20978, 2024

  16. [16]

    Reward shaping to mitigate reward hacking in rlhf

    Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf. arXiv preprint arXiv:2502.18770, 2025

  17. [17]

    Smartrag: Jointly learn rag-related tasks from the environment feedback

    Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. Smartrag: Jointly learn rag-related tasks from the environment feedback. arXiv preprint arXiv:2410.18141, 2024

  18. [18]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  20. [20]

    Automating systematic literature reviews with retrieval-augmented generation: A comprehensive overview

    Binglan Han, Teo Susnjak, and Anuradha Mathrani. Automating systematic literature reviews with retrieval-augmented generation: A comprehensive overview. Applied Sciences, 14(19): 9103, 2024

  21. [21]

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

  22. [22]

    Grounding by trying: Llms with reinforcement learning-enhanced retrieval

    Sheryl Hsu, Omar Khattab, Chelsea Finn, and Archit Sharma. Grounding by trying: Llms with reinforcement learning-enhanced retrieval. arXiv preprint arXiv:2410.23214, 2024

  23. [23]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  24. [24]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Zilin Zhu, Weixun Wang, Dehao Zhang, Yu Cao, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143, 2024

  25. [25]

    Leveraging passage retrieval with generative models for open domain question answering

    Gautier Izacard and Édouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, 2021

  26. [26]

    Atlas: Few-shot learning with retrieval augmented language models

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251): 1–43, 2023

  27. [27]

    Adaptive- rag: Learning to adapt retrieval-augmented large language models through question complexity

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong-Cheol Park. Adaptive- rag: Learning to adapt retrieval-augmented large language models through question complexity. In 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 7036–7050. Association for Computational L...

  28. [28]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023. 11

  29. [29]

    Ras: Retrieval-and-structuring for knowledge-intensive llm generation

    Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, and Jiawei Han. Ras: Retrieval-and-structuring for knowledge-intensive llm generation. arXiv preprint arXiv:2502.10996, 2025

  30. [30]

    Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning

    Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, and Jiawei Han. Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning. arXiv preprint arXiv:2503.00223, 2025

  31. [31]

    Active retrieval augmented generation

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023

  32. [32]

    Longrag: Enhancing retrieval-augmented generation with long-context llms

    Ziyan Jiang, Xueguang Ma, and Wenhu Chen. Longrag: Enhancing retrieval-augmented generation with long-context llms. arXiv preprint arXiv:2406.15319, 2024

  33. [33]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

  34. [34]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

  35. [35]

    Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval

    Qiao Jin, Won Kim, Qingyu Chen, Donald C Comeau, Lana Yeganova, W John Wilbur, and Zhiyong Lu. Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11):btad651, 2023

  36. [36]

    Rag-rewardbench: Benchmarking reward models in retrieval augmented generation for preference alignment

    Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Rag-rewardbench: Benchmarking reward models in retrieval augmented generation for preference alignment. arXiv preprint arXiv:2412.13746, 2024

  37. [37]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020

  38. [38]

    Decomposed prompting: A modular approach for solving complex tasks

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum? id=_nGgzQjzaRy

  39. [39]

    PaperQA: Retrieval-Augmented Generative Agent for Scientific Research

    Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559, 2023

  40. [40]

    The role of prompt engineering in improving language understanding and generation

    Divya Lamba. The role of prompt engineering in improving language understanding and generation. International Journal For Multidisciplinary Research, 2024. URL https://api. semanticscholar.org/CorpusID:274939741

  41. [41]

    Ai-powered learning support: A study of retrieval-augmented generation (rag) chatbot effectiveness in an online course

    Guido Lang and Tan Gürpinar. Ai-powered learning support: A study of retrieval-augmented generation (rag) chatbot effectiveness in an online course. Information Systems Education Journal, 23(2), 2025

  42. [42]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  43. [43]

    Llmr: Knowledge distillation with a large language model-induced reward

    Dongheng Li, Yongchang Hao, and Lili Mou. Llmr: Knowledge distillation with a large language model-induced reward. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10657–10664, 2024. 12

  44. [44]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025

  45. [45]

    Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks

    Xingxuan Li, Weiwen Xu, Ruochen Zhao, Fangkai Jiao, Shafiq Joty, and Lidong Bing. Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks. arXiv preprint arXiv:2410.01428, 2024

  46. [46]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=v8L0pN6EOi

  47. [47]

    Learning to summarize from human feedback

    Fei Liu et al. Learning to summarize from human feedback. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 583–592, 2020

  48. [48]

    Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines

    Siru Liu, Allison B McCoy, and Adam Wright. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. Journal of the American Medical Informatics Association , page ocaf008, 2025

  49. [49]

    Coevolv- ing with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning

    Hao Ma, Tianyi Hu, Zhiqiang Pu, Liu Boyin, Xiaolin Ai, Yanyan Liang, and Min Chen. Coevolv- ing with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 37:15497–15525, 2024

  50. [50]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  51. [51]

    Reward-rag: Enhancing rag with reward driven supervision

    Thang Nguyen, Peter Chin, and Yu-Wing Tai. Reward-rag: Enhancing rag with reward driven supervision. arXiv preprint arXiv:2410.03780, 2024

  52. [52]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  53. [53]

    Legalbench-rag: A benchmark for retrieval- augmented generation in the legal domain

    Nicholas Pipitone and Ghita Houir Alami. Legalbench-rag: A benchmark for retrieval- augmented generation in the legal domain. arXiv preprint arXiv:2408.10343, 2024

  54. [54]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

  55. [55]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958, 2025

  56. [56]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  57. [57]

    In-context retrieval-augmented language models

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

  58. [58]

    The probabilistic relevance framework: Bm25 and beyond

    Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009

  59. [59]

    Large language models for biomedicine: foundations, opportunities, challenges, and best practices

    Satya S Sahoo, Joseph M Plasek, Hua Xu, Özlem Uzuner, Trevor Cohen, Meliha Yetisgen, Hongfang Liu, Stéphane Meystre, and Yanshan Wang. Large language models for biomedicine: foundations, opportunities, challenges, and best practices. Journal of the American Medical Informatics Association, page ocae074, 2024. 13

  60. [60]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  61. [61]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9248–9274, 2023

  62. [62]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  63. [63]

    Searchrag: Can search engines be helpful for llm-based medical question answering? arXiv preprint arXiv:2502.13233, 2025

    Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, and Ninghao Liu. Searchrag: Can search engines be helpful for llm-based medical question answering? arXiv preprint arXiv:2502.13233, 2025

  64. [64]

    Generate-then-ground in retrieval-augmented generation for multi-hop question answering

    Zhengliang Shi, Shuo Zhang, Weiwei Sun, Shen Gao, Pengjie Ren, Zhumin Chen, and Zhaochun Ren. Generate-then-ground in retrieval-augmented generation for multi-hop question answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7339–7353, 2024

  65. [65]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

  66. [66]

    Retrieval augmenta- tion reduces hallucination in conversation

    Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmenta- tion reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, 2021

  67. [67]

    Defining and characterizing reward gaming

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

  68. [68]

    D., Cox, S., Laurent, J

    Michael D Skarlinski, Sam Cox, Jon M Laurent, James D Braza, Michaela Hinks, Michael J Hammerling, Manvitha Ponnapati, Samuel G Rodriques, and Andrew D White. Language agents achieve superhuman synthesis of scientific knowledge. arXiv preprint arXiv:2409.13740, 2024

  69. [69]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025

  70. [70]

    Prototyp- ing with prompts: Emerging approaches and challenges in generative ai design for collaborative software teams

    Hari Subramonyam, Divy Thakkar, Andrew Ku, Juergen Dieber, and Anoop K Sinha. Prototyp- ing with prompts: Emerging approaches and challenges in generative ai design for collaborative software teams. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–22, 2025

  71. [71]

    Rearter: Retrieval-augmented reasoning with trustworthy process rewarding

    Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Song Yang, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding. arXiv preprint arXiv:2501.07861, 2025

  72. [72]

    Retrieval-augmented generation (rag) chatbots for education: A survey of applications

    Jakub Swacha and Michał Gracel. Retrieval-augmented generation (rag) chatbots for education: A survey of applications. Applied Sciences, 15(8):4234, 2025

  73. [73]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, 2023

  74. [74]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 14

  75. [75]

    Trl: Transformer reinforce- ment learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforce- ment learning. https://github.com/huggingface/trl, 2020

  76. [76]

    Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering

    Keheng Wang, Feiyu Duan, Sirui Wang, Peiguang Li, Yunsen Xian, Chuantao Yin, Wenge Rong, and Zhang Xiong. Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. arXiv preprint arXiv:2308.13259, 2023

  77. [77]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

  78. [78]

    Factuality of large language models: A survey

    Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Georgiev, Rocktim Das, and Preslav Nakov. Factuality of large language models: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19519–19529, 2024

  79. [79]

    RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,

    Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, and Yitao Liang. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313, 2024

  80. [80]

    Speculative rag: Enhancing retrieval augmented generation through drafting

    Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, et al. Speculative rag: Enhancing retrieval augmented generation through drafting. arXiv preprint arXiv:2407.08223, 2024

Showing first 80 references.