ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Pith reviewed 2026-05-22 16:00 UTC · model grok-4.3
The pith
LLMs can develop strong search capabilities by training with simulated documents from another LLM instead of real search engines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting an LLM into a retrieval module through supervised fine-tuning, ZeroSearch generates documents of controllable quality. A curriculum-based rollout strategy then incrementally degrades these documents during RL training, forcing the primary model to improve its reasoning over noisy retrievals. This training transfers to real search engine interactions, with results showing a 3B retrieval module suffices for effective training, a 7B matches real engines, and a 14B exceeds them.
What carries the argument
The curriculum-based rollout strategy that incrementally degrades the quality of documents generated by the retrieval module to progressively build the main model's resilience to imperfect search results.
If this is right
- Scalable RL training for search capabilities becomes feasible without hundreds of thousands of expensive API calls.
- The method works with various model sizes and both base and instruction-tuned models.
- Performance can match or surpass real search engines depending on the size of the retrieval module used.
- It is compatible with a wide range of reinforcement learning algorithms.
- Training avoids the instability caused by unpredictable real-world search result quality.
Where Pith is reading between the lines
- Researchers could apply similar simulation techniques to train LLMs on other tool-using behaviors like web browsing or code execution.
- The success with larger retrieval modules suggests that more powerful simulators might further improve the transfer to real environments.
- This approach might reduce the barrier for smaller labs to experiment with search-augmented LLMs.
- Models trained this way may develop general strategies for dealing with uncertain information sources beyond just search engines.
Load-bearing premise
That the reasoning skills developed by handling progressively noisier simulated documents will transfer successfully to interactions with actual search engines.
What would settle it
Evaluate models trained using ZeroSearch on real search engine tasks and compare their performance to models trained directly with real searches or without search training; if no advantage or worse results appear, the claim would be undermined.
Figures
read the original abstract
Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ZeroSearch, a reinforcement learning framework that trains LLMs to use search engines via simulated searches from a separate retrieval LLM module. It begins with SFT to enable the retrieval module to generate useful and noisy documents, then applies RL with a curriculum rollout that incrementally degrades document quality to build reasoning robustness. Experiments report that a 3B retrieval module incentivizes search capabilities, a 7B module matches real search engine performance, and a 14B module surpasses it, with generalization across base/instruction-tuned models and RL algorithms.
Significance. If the transfer from simulated to real search is validated, ZeroSearch provides a practical way to reduce API costs and control document quality noise during RL training of search-augmented LLMs. The curriculum degradation approach offers a controllable way to scale training difficulty. The reported scaling with retrieval module size (3B to 14B) and compatibility with multiple RL methods are positive indicators of broader applicability, though these rest on the unablated transfer assumption.
major comments (2)
- [Experiments] Experiments section: The headline result that a 14B retrieval module surpasses real search is central to the paper's claim of effective simulation, yet no ablation is reported that holds retrieval quality fixed or compares directly against standard RL/SFT without the curriculum degradation schedule. This leaves open whether observed gains stem from the proposed curriculum mechanism or from generic RL benefits, undermining attribution of the transfer effect.
- [Method] Method section on curriculum rollout: The strategy of incrementally degrading generated document quality is load-bearing for the claim that it 'progressively elicits the model's reasoning ability' in a manner that transfers to real search. However, the manuscript provides no sensitivity analysis on the degradation schedule parameters or direct comparison of rollouts using real search during training, making it impossible to confirm isolation of the curriculum's contribution.
minor comments (2)
- [Abstract] Abstract and results: Performance claims for 3B/7B/14B modules lack error bars, detailed baseline descriptions, or statistical significance tests, which would strengthen interpretation of the scaling behavior.
- Notation and figures: Ensure consistent use of symbols for retrieval module sizes and document quality metrics across text and figures to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address each major point below with clarifications on our design choices and indicate where revisions will be made to strengthen attribution of results.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The headline result that a 14B retrieval module surpasses real search is central to the paper's claim of effective simulation, yet no ablation is reported that holds retrieval quality fixed or compares directly against standard RL/SFT without the curriculum degradation schedule. This leaves open whether observed gains stem from the proposed curriculum mechanism or from generic RL benefits, undermining attribution of the transfer effect.
Authors: We agree that additional controls would help isolate the curriculum's specific contribution. The reported scaling results (3B module enables search, 7B matches real search, 14B exceeds it) already indicate that simulation quality drives performance rather than generic RL alone. To strengthen this, we will add an ablation in the revised experiments section that trains with standard RL/SFT without the degradation schedule while holding other factors fixed, allowing direct comparison of the curriculum's role in the transfer effect. revision: yes
-
Referee: [Method] Method section on curriculum rollout: The strategy of incrementally degrading generated document quality is load-bearing for the claim that it 'progressively elicits the model's reasoning ability' in a manner that transfers to real search. However, the manuscript provides no sensitivity analysis on the degradation schedule parameters or direct comparison of rollouts using real search during training, making it impossible to confirm isolation of the curriculum's contribution.
Authors: The curriculum begins with high-quality documents from the SFT retrieval module and gradually increases noise to build robustness to real-world variability. Parameters were selected via preliminary runs for training stability. A direct real-search rollout comparison during training is not performed because it would reintroduce the exact API costs and uncontrolled quality issues that ZeroSearch is designed to eliminate; transfer is instead validated through post-training evaluation on live search. We will add sensitivity analysis on degradation rate and noise levels to the appendix in revision. revision: partial
Circularity Check
No circularity: standard SFT+RL applied to simulated-search setup with no equations or claims reducing to fitted inputs
full rationale
The paper presents ZeroSearch as an RL framework that first applies lightweight SFT to turn an LLM into a retrieval module generating useful/noisy documents, then uses curriculum-based rollouts that degrade document quality during training. No mathematical derivations, uniqueness theorems, or predictions are shown that reduce by construction to parameters fitted from the target result. Claims rest on experimental comparisons (3B/7B/14B retrieval modules vs real search) rather than self-referential definitions or self-citation chains. The approach is self-contained against external benchmarks and uses standard RL techniques without load-bearing self-citations or ansatz smuggling.
Axiom & Free-Parameter Ledger
free parameters (1)
- curriculum degradation schedule
axioms (1)
- domain assumption Simulated documents generated by the fine-tuned retrieval module can effectively train search and reasoning capabilities that transfer to real search engines.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
curriculum-based rollout strategy that incrementally degrades the quality of generated documents... pi = ps + b^{i/m}-1/(b-1)(pe-ps)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 26 Pith papers
-
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
-
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
-
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Harnessing LLM Agents with Skill Programs
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning be...
-
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
-
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
-
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
-
AIPO: Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
AIPO: Learning to Reason from Active Interaction
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
-
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
-
Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search
CalibAdv calibrates advantages in GRPO by downscaling negative signals from incorrect final answers using intermediate step correctness and rebalancing answer-level advantages, yielding better performance and training...
-
Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model
TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.
-
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
Search-E1 interleaves vanilla GRPO with offline self-distillation via token-level forward KL alignment to privileged sibling trajectories, reaching 0.440 average EM on seven QA benchmarks with Qwen2.5-3B and beating o...
-
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...
-
Learning CLI Agents with Structured Action Credit under Selective Observation
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
-
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning
Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
Reference graph
Works this paper leans on
-
[1]
A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
- [2]
-
[3]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [5]
- [6]
- [7]
-
[8]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[9]
Y . Hou and et al. Rl-based learning for reasoning and decision-making in large language models. In ACL, 2025
work page 2025
-
[10]
Mathprompter: Mathematical rea- soning using large language models,
S. Imani, L. Du, and H. Shrivastava. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023
-
[11]
S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C. Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. arXiv preprint arXiv:2403.14403, 2024
-
[12]
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023
work page 2023
- [13]
-
[14]
Technical report: Enhancing llm reasoning with reward-guided tree search,
J. Jiang, Z. Chen, Y . Min, J. Chen, X. Cheng, J. Wang, Y . Tang, H. Sun, J. Deng, W. X. Zhao, et al. Technical report: Enhancing llm reasoning with reward-guided tree search. arXiv preprint arXiv:2411.11694, 2024
- [15]
-
[16]
B. Jin, H. Zeng, Z. Yue, D. Wang, H. Zamani, and J. Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. 10
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
R. Kumar and et al. Research: Autonomous retrieval decision-making in llms using reinforce- ment learning. In ICLR, 2025
work page 2025
- [19]
-
[20]
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics , 7:453–466, 2019
work page 2019
-
[21]
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems , 35:3843–3857, 2022
work page 2022
-
[22]
X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
X. Li, J. Jin, G. Dong, H. Qian, Y . Zhu, Y . Wu, J.-R. Wen, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability. arXiv preprint arXiv:2504.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [24]
- [25]
-
[26]
A. Mallen, A. Asai, V . Zhong, R. Das, H. Hajishirzi, and D. Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 7, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Teaching language models to support answers with verified quotes
J. Menick, M. Trebacz, V . Mikulik, J. Aslanides, F. Song, M. Chadwick, M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Measuring and Narrowing the Compositionality Gap in Language Models
O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [29]
-
[30]
H. Rashkin, V . Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, and D. Reitter. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870, 2021
-
[31]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Retrieval augmentation reduces hallucination in conversation,
K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021
-
[35]
H. Song, J. Jiang, Y . Min, J. Chen, Z. Chen, W. X. Zhao, L. Fang, and J.-R. Wen. R1- searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Galactica: A Large Language Model for Science
R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A large language model for science. CoRR, abs/2211.09085, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[38]
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992
work page 1992
- [39]
-
[40]
R. Yamauchi, S. Sonoda, A. Sannai, and W. Kumagai. Lpml: llm-prompting markup language for mathematical reasoning. arXiv preprint arXiv:2309.13078, 2023
-
[41]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [43]
- [44]
-
[45]
J. Zhang, Z. Li, K. Das, B. Malin, and S. Kumar. Sac3: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. InFindings of the Association for Computational Linguistics: EMNLP 2023 , pages 15445–15458, 2023
work page 2023
-
[46]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [47]
-
[48]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Y . Zheng, D. Fu, X. Hu, X. Cai, L. Ye, P. Lu, and P. Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025. 12 0 25 50 75 100 125 150 175 200 Step 0.0 0.1 0.2 0.3 0.4 0.5Train Reward ZeroSearch Search-R1 (a) LLaMA-3.2-3B-Base 0 25 50 75 100 125 150 175 200 Step 0.10 0.15 0....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
1896 – 1897. New York City, 1896 is a time Doc 3: The Alienist: A Novel (2017) · The Angel of Darkness (2018) · The Lost City of Z (2019) · The Devil in the White City (2019) · A Gentleman in Moscow (2019) Doc 4: The sequel to the acclaimed national bestseller The Alienist, Caleb Carr’s The Angel of Darkness is a breathtaking thriller set in 1897 New York...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.