pith. sign in

arxiv: 2505.04588 · v3 · pith:ZYQFF5SAnew · submitted 2025-05-07 · 💻 cs.CL

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Pith reviewed 2026-05-22 16:00 UTC · model grok-4.3

classification 💻 cs.CL
keywords reinforcement learninglarge language modelssearch capabilitiessimulated retrievalcurriculum learningtool augmentationRL for LLMs
0
0 comments X

The pith

LLMs can develop strong search capabilities by training with simulated documents from another LLM instead of real search engines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ZeroSearch is a reinforcement learning method that trains large language models to use search by substituting a real search engine with an LLM-based retrieval module. This module is first fine-tuned to generate useful or noisy documents for queries, then used in a curriculum where document quality is gradually reduced to make the task harder. The goal is to build the main model's ability to reason despite imperfect information. Experiments indicate that this leads to performance comparable or superior to training with actual search APIs, and it scales across model sizes without high costs. If successful, it would allow widespread development of search-enhanced reasoning models without relying on expensive external services.

Core claim

By converting an LLM into a retrieval module through supervised fine-tuning, ZeroSearch generates documents of controllable quality. A curriculum-based rollout strategy then incrementally degrades these documents during RL training, forcing the primary model to improve its reasoning over noisy retrievals. This training transfers to real search engine interactions, with results showing a 3B retrieval module suffices for effective training, a 7B matches real engines, and a 14B exceeds them.

What carries the argument

The curriculum-based rollout strategy that incrementally degrades the quality of documents generated by the retrieval module to progressively build the main model's resilience to imperfect search results.

If this is right

  • Scalable RL training for search capabilities becomes feasible without hundreds of thousands of expensive API calls.
  • The method works with various model sizes and both base and instruction-tuned models.
  • Performance can match or surpass real search engines depending on the size of the retrieval module used.
  • It is compatible with a wide range of reinforcement learning algorithms.
  • Training avoids the instability caused by unpredictable real-world search result quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers could apply similar simulation techniques to train LLMs on other tool-using behaviors like web browsing or code execution.
  • The success with larger retrieval modules suggests that more powerful simulators might further improve the transfer to real environments.
  • This approach might reduce the barrier for smaller labs to experiment with search-augmented LLMs.
  • Models trained this way may develop general strategies for dealing with uncertain information sources beyond just search engines.

Load-bearing premise

That the reasoning skills developed by handling progressively noisier simulated documents will transfer successfully to interactions with actual search engines.

What would settle it

Evaluate models trained using ZeroSearch on real search engine tasks and compare their performance to models trained directly with real searches or without search training; if no advantage or worse results appear, the claim would be undermined.

Figures

Figures reproduced from arXiv: 2505.04588 by Fei Huang, Hao Sun, Jiayan Guo, Jingren Zhou, Pengjun Xie, Xuanbo Fan, Yan Zhang, Yingyan Hou, Yong Jiang, Zile Qiao.

Figure 1
Figure 1. Figure 1: Demonstration of PPO and GRPO training without the search engine. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a-b): Reward curve comparison between ZEROSEARCH and Search-R1 using Qwen-2.5- 3B. (c): The reward curve and interaction turns during training of LLaMA-3.2-3B-Base. Search Engine NQ TriviaQA PopQA HotpotQA 2Wiki Musique Bamboogle Avg. Base Model 12.40 19.40 11.20 4.40 6.80 2.00 5.56 8.82 Prompt-3B 35.80 56.00 42.20 25.60 27.00 4.20 15.28 29.44 Prompt-7B 38.40 59.40 43.40 27.80 30.00 11.00 9.72 31.39 Promp… view at source ↗
Figure 3
Figure 3. Figure 3: Reward curve comparison between ZEROSEARCH and Search-R1(using a real search engine). 0 25 50 75 100 125 150 175 200 Step 0.1 0.2 0.3 0.4 0.5 Train Reward Base Instruct (a) Qwen-2.5-3B 0 25 50 75 100 125 150 175 200 Step 0.2 0.3 0.4 0.5 Train Reward Base Instruct (b) Qwen-2.5-7B 0 25 50 75 100 125 150 175 200 Step 0.0 0.1 0.2 0.3 0.4 0.5 Train Reward w Mask w/o Mask (c) Effect of document masking [PITH_FU… view at source ↗
Figure 4
Figure 4. Figure 4: (a-b) We compare the reward curve between base and instruct models using Qwen-2.5-3B [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ZeroSearch, a reinforcement learning framework that trains LLMs to use search engines via simulated searches from a separate retrieval LLM module. It begins with SFT to enable the retrieval module to generate useful and noisy documents, then applies RL with a curriculum rollout that incrementally degrades document quality to build reasoning robustness. Experiments report that a 3B retrieval module incentivizes search capabilities, a 7B module matches real search engine performance, and a 14B module surpasses it, with generalization across base/instruction-tuned models and RL algorithms.

Significance. If the transfer from simulated to real search is validated, ZeroSearch provides a practical way to reduce API costs and control document quality noise during RL training of search-augmented LLMs. The curriculum degradation approach offers a controllable way to scale training difficulty. The reported scaling with retrieval module size (3B to 14B) and compatibility with multiple RL methods are positive indicators of broader applicability, though these rest on the unablated transfer assumption.

major comments (2)
  1. [Experiments] Experiments section: The headline result that a 14B retrieval module surpasses real search is central to the paper's claim of effective simulation, yet no ablation is reported that holds retrieval quality fixed or compares directly against standard RL/SFT without the curriculum degradation schedule. This leaves open whether observed gains stem from the proposed curriculum mechanism or from generic RL benefits, undermining attribution of the transfer effect.
  2. [Method] Method section on curriculum rollout: The strategy of incrementally degrading generated document quality is load-bearing for the claim that it 'progressively elicits the model's reasoning ability' in a manner that transfers to real search. However, the manuscript provides no sensitivity analysis on the degradation schedule parameters or direct comparison of rollouts using real search during training, making it impossible to confirm isolation of the curriculum's contribution.
minor comments (2)
  1. [Abstract] Abstract and results: Performance claims for 3B/7B/14B modules lack error bars, detailed baseline descriptions, or statistical significance tests, which would strengthen interpretation of the scaling behavior.
  2. Notation and figures: Ensure consistent use of symbols for retrieval module sizes and document quality metrics across text and figures to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below with clarifications on our design choices and indicate where revisions will be made to strengthen attribution of results.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The headline result that a 14B retrieval module surpasses real search is central to the paper's claim of effective simulation, yet no ablation is reported that holds retrieval quality fixed or compares directly against standard RL/SFT without the curriculum degradation schedule. This leaves open whether observed gains stem from the proposed curriculum mechanism or from generic RL benefits, undermining attribution of the transfer effect.

    Authors: We agree that additional controls would help isolate the curriculum's specific contribution. The reported scaling results (3B module enables search, 7B matches real search, 14B exceeds it) already indicate that simulation quality drives performance rather than generic RL alone. To strengthen this, we will add an ablation in the revised experiments section that trains with standard RL/SFT without the degradation schedule while holding other factors fixed, allowing direct comparison of the curriculum's role in the transfer effect. revision: yes

  2. Referee: [Method] Method section on curriculum rollout: The strategy of incrementally degrading generated document quality is load-bearing for the claim that it 'progressively elicits the model's reasoning ability' in a manner that transfers to real search. However, the manuscript provides no sensitivity analysis on the degradation schedule parameters or direct comparison of rollouts using real search during training, making it impossible to confirm isolation of the curriculum's contribution.

    Authors: The curriculum begins with high-quality documents from the SFT retrieval module and gradually increases noise to build robustness to real-world variability. Parameters were selected via preliminary runs for training stability. A direct real-search rollout comparison during training is not performed because it would reintroduce the exact API costs and uncontrolled quality issues that ZeroSearch is designed to eliminate; transfer is instead validated through post-training evaluation on live search. We will add sensitivity analysis on degradation rate and noise levels to the appendix in revision. revision: partial

Circularity Check

0 steps flagged

No circularity: standard SFT+RL applied to simulated-search setup with no equations or claims reducing to fitted inputs

full rationale

The paper presents ZeroSearch as an RL framework that first applies lightweight SFT to turn an LLM into a retrieval module generating useful/noisy documents, then uses curriculum-based rollouts that degrade document quality during training. No mathematical derivations, uniqueness theorems, or predictions are shown that reduce by construction to parameters fitted from the target result. Claims rest on experimental comparisons (3B/7B/14B retrieval modules vs real search) rather than self-referential definitions or self-citation chains. The approach is self-contained against external benchmarks and uses standard RL techniques without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a learned retrieval module plus controlled quality degradation can substitute for real search during training, plus standard RL assumptions about reward signals and policy improvement.

free parameters (1)
  • curriculum degradation schedule
    The rate and steps by which simulated document quality is reduced during rollouts is a design choice that must be tuned for the reported gains.
axioms (1)
  • domain assumption Simulated documents generated by the fine-tuned retrieval module can effectively train search and reasoning capabilities that transfer to real search engines.
    This premise is required for the entire simulated-training approach to be useful.

pith-pipeline@v0.9.0 · 5835 in / 1319 out tokens · 47435 ms · 2026-05-22T16:00:14.004010+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

  2. CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

    cs.AI 2026-05 unverdicted novelty 7.0

    CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

  3. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 7.0

    SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.

  4. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

    cs.CL 2026-05 unverdicted novelty 7.0

    LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.

  5. MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

    cs.CL 2025-11 unverdicted novelty 7.0

    MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

  6. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  7. Harnessing LLM Agents with Skill Programs

    cs.AI 2026-05 conditional novelty 6.0

    HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning be...

  8. SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

    cs.CL 2026-05 unverdicted novelty 6.0

    SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.

  9. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.

  10. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.

  11. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 6.0

    SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.

  12. AIPO: Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  13. AIPO: Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

  14. T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.

  15. Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

    cs.CL 2026-04 unverdicted novelty 6.0

    CalibAdv calibrates advantages in GRPO by downscaling negative signals from incorrect final answers using intermediate step correctness and rebalancing answer-level advantages, yielding better performance and training...

  16. Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model

    cs.LG 2026-04 unverdicted novelty 6.0

    TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.

  17. $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

    cs.LG 2026-04 unverdicted novelty 6.0

    π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...

  18. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

  19. Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

    cs.AI 2026-05 unverdicted novelty 5.0

    Search-E1 interleaves vanilla GRPO with offline self-distillation via token-level forward KL alignment to privileged sibling trajectories, reaching 0.440 average EM on seven QA benchmarks with Qwen2.5-3B and beating o...

  20. Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

    cs.AI 2026-05 unverdicted novelty 5.0

    MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.

  21. CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

    cs.AI 2026-05 unverdicted novelty 5.0

    CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...

  22. Learning CLI Agents with Structured Action Credit under Selective Observation

    cs.AI 2026-05 unverdicted novelty 5.0

    CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.

  23. Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

    cs.CL 2025-10 unverdicted novelty 5.0

    ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.

  24. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  25. Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning

    cs.CL 2025-05 unverdicted novelty 5.0

    Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.

  26. Agentic Reasoning for Large Language Models

    cs.AI 2026-01 unverdicted novelty 4.0

    The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 22 Pith papers · 19 internal anchors

  1. [1]

    A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2023

  2. [2]

    Bohnet, V

    B. Bohnet, V . Q. Tran, P. Verga, R. Aharoni, D. Andor, L. B. Soares, J. Eisenstein, K. Ganchev, J. Herzig, K. Hui, et al. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037, 2022

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  4. [4]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  5. [5]

    W. Feng, C. Hao, Y . Zhang, J. Song, and H. Wang. Airrag: Activating intrinsic reasoning for retrieval augmented generation via tree-based search. arXiv preprint arXiv:2501.10053, 2025

  6. [6]

    L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y . Fan, V . Y . Zhao, N. Lao, H. Lee, D.-C. Juan, et al. Rarr: Researching and revising what language models say, using language models. arXiv preprint arXiv:2210.08726, 2022

  7. [7]

    Y . Guo, L. Hou, R. Shao, P. G. Jin, V . Kumar, W. Weng, Y . Xie, and T.-Y . Liu. Deepseek-r1: Reinforcement learning for retrieval-augmented generation in large language models. arXiv preprint arXiv:2503.01234, 2025

  8. [8]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020

  9. [9]

    Hou and et al

    Y . Hou and et al. Rl-based learning for reasoning and decision-making in large language models. In ACL, 2025

  10. [10]

    Mathprompter: Mathematical rea- soning using large language models,

    S. Imani, L. Du, and H. Shrivastava. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023

  11. [11]

    J.; and Park, J

    S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C. Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. arXiv preprint arXiv:2403.14403, 2024

  12. [12]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

  13. [13]

    Jiang, J

    J. Jiang, J. Chen, J. Li, R. Ren, S. Wang, W. X. Zhao, Y . Song, and T. Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. arXiv preprint arXiv:2412.12881, 2024

  14. [14]

    Technical report: Enhancing llm reasoning with reward-guided tree search,

    J. Jiang, Z. Chen, Y . Min, J. Chen, X. Cheng, J. Wang, Y . Tang, H. Sun, J. Deng, W. X. Zhao, et al. Technical report: Enhancing llm reasoning with reward-guided tree search. arXiv preprint arXiv:2411.11694, 2024

  15. [15]

    Jiang, F

    Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y . Yang, J. Callan, and G. Neubig. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023

  16. [16]

    B. Jin, H. Zeng, Z. Yue, D. Wang, H. Zamani, and J. Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

  17. [17]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. 10

  18. [18]

    Kumar and et al

    R. Kumar and et al. Research: Autonomous retrieval decision-making in llms using reinforce- ment learning. In ICLR, 2025

  19. [19]

    Kumar, L

    V . Kumar, L. Hou, Y . Guo, R. Shao, P. G. Jin, W. Weng, Y . Xie, and T.-Y . Liu. Self-correcting language models with reinforcement learning. arXiv preprint arXiv:2409.06543, 2024

  20. [20]

    Kwiatkowski, J

    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics , 7:453–466, 2019

  21. [21]

    Lewkowycz, A

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems , 35:3843–3857, 2022

  22. [22]

    X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025

  23. [23]

    X. Li, J. Jin, G. Dong, H. Qian, Y . Zhu, Y . Wu, J.-R. Wen, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability. arXiv preprint arXiv:2504.21776, 2025

  24. [24]

    X. Li, J. Jin, Y . Zhou, Y . Wu, Z. Li, Q. Ye, and Z. Dou. Retrollm: Empowering large language models to retrieve fine-grained evidence within generation. arXiv preprint arXiv:2412.11919, 2024

  25. [25]

    X. Li, W. Xu, R. Zhao, F. Jiao, S. Joty, and L. Bing. Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks. arXiv preprint arXiv:2410.01428, 2024

  26. [26]

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    A. Mallen, A. Asai, V . Zhong, R. Das, H. Hajishirzi, and D. Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 7, 2022

  27. [27]

    Teaching language models to support answers with verified quotes

    J. Menick, M. Trebacz, V . Mikulik, J. Aslanides, F. Song, M. Chadwick, M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022

  28. [28]

    Measuring and Narrowing the Compositionality Gap in Language Models

    O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

  29. [29]

    O. Ram, Y . Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, and Y . Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023

  30. [30]

    Rashkin, V

    H. Rashkin, V . Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, and D. Reitter. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870, 2021

  31. [31]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  32. [32]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023

  34. [34]

    Retrieval augmentation reduces hallucination in conversation,

    K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021

  35. [35]

    H. Song, J. Jiang, Y . Min, J. Chen, Z. Chen, W. X. Zhao, L. Fang, and J.-R. Wen. R1- searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025. 11

  36. [36]

    Galactica: A Large Language Model for Science

    R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A large language model for science. CoRR, abs/2211.09085, 2022

  37. [37]

    Trivedi, N

    H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  38. [38]

    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992

  39. [39]

    S. Xia, X. Li, Y . Liu, T. Wu, and P. Liu. Evaluating mathematical reasoning beyond accuracy. arXiv preprint arXiv:2404.05692, 2024

  40. [40]

    Yamauchi, S

    R. Yamauchi, S. Sonoda, A. Sannai, and W. Kumagai. Lpml: llm-prompting markup language for mathematical reasoning. arXiv preprint arXiv:2309.13078, 2023

  41. [41]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  42. [42]

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

  43. [43]

    Yoran, T

    O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch, and J. Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023

  44. [44]

    W. Yu, D. Iter, S. Wang, Y . Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng, and M. Jiang. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063, 2022

  45. [45]

    Zhang, Z

    J. Zhang, Z. Li, K. Das, B. Malin, and S. Kumar. Sac3: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. InFindings of the Association for Computational Linguistics: EMNLP 2023 , pages 15445–15458, 2023

  46. [46]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023

  47. [47]

    Y . Zhao, H. Yin, B. Zeng, H. Wang, T. Shi, C. Lyu, L. Wang, W. Luo, and K. Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024

  48. [48]

    DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

    Y . Zheng, D. Fu, X. Hu, X. Cai, L. Ye, P. Lu, and P. Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025. 12 0 25 50 75 100 125 150 175 200 Step 0.0 0.1 0.2 0.3 0.4 0.5Train Reward ZeroSearch Search-R1 (a) LLaMA-3.2-3B-Base 0 25 50 75 100 125 150 175 200 Step 0.10 0.15 0....

  49. [49]

    1896 – 1897. New York City, 1896 is a time Doc 3: The Alienist: A Novel (2017) · The Angel of Darkness (2018) · The Lost City of Z (2019) · The Devil in the White City (2019) · A Gentleman in Moscow (2019) Doc 4: The sequel to the acclaimed national bestseller The Alienist, Caleb Carr’s The Angel of Darkness is a breathtaking thriller set in 1897 New York...