pith. machine review for the scientific record. sign in

arxiv: 2507.02592 · v1 · pith:LZ6OY7ZGnew · submitted 2025-07-03 · 💻 cs.CL · cs.AI

WebSailor: Navigating Super-human Reasoning for Web Agent

Pith reviewed 2026-05-17 15:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords web agentsLLM post-trainingreinforcement learninginformation seekinguncertainty reductionagentic RLDUPOBrowseComp
0
0 comments X

The pith

WebSailor equips open-source models with the ability to reduce extreme uncertainty in web navigation, allowing them to match proprietary agents on complex information-seeking tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that proprietary web agents succeed on difficult benchmarks because they possess a reasoning pattern for systematically reducing extreme uncertainty across vast information spaces, a pattern missing from open-source models. WebSailor addresses this gap through a post-training pipeline that creates novel high-uncertainty tasks via structured sampling and information obfuscation, applies RFT cold start, and optimizes with the Duplicating Sampling Policy Optimization (DUPO) agentic RL algorithm. This integrated approach produces open-source agents that outperform prior open models and reach the performance of closed systems on tasks like those in BrowseComp. Readers should care because the method provides a concrete route to democratizing advanced web agent capabilities through targeted training rather than proprietary scale.

Core claim

The authors present WebSailor as a complete post-training methodology designed to instill the capability of systematically reducing extreme uncertainty when navigating vast information landscapes. This is achieved by generating novel high-uncertainty tasks through structured sampling and information obfuscation, followed by RFT cold start and an efficient agentic RL training algorithm called Duplicating Sampling Policy Optimization (DUPO). With this pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks while matching proprietary agents and closing the capability gap.

What carries the argument

The integrated pipeline of structured sampling with information obfuscation for high-uncertainty task generation, RFT cold start, and Duplicating Sampling Policy Optimization (DUPO) for agentic reinforcement learning.

If this is right

  • Open-source web agents achieve performance levels on complex information-seeking benchmarks that were previously limited to proprietary systems.
  • The performance gap between open and closed web agents narrows through focused post-training on uncertainty reduction rather than increased model scale.
  • Similar post-training methods could extend the same reasoning pattern to additional agentic tasks involving large unstructured data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The emphasis on uncertainty reduction may apply to agent training in other domains such as scientific database navigation or enterprise document search where information volume creates comparable challenges.
  • Researchers could test whether the task generation approach using information obfuscation transfers to non-web settings like code repositories or knowledge graphs.
  • Open-source developers might integrate the DUPO algorithm into existing agent frameworks to improve navigation without requiring access to proprietary training data or compute.

Load-bearing premise

The central claim depends on the premise that proprietary agents outperform open-source ones primarily due to a trainable reasoning pattern for reducing extreme uncertainty in large information landscapes.

What would settle it

If WebSailor-trained agents do not match or exceed proprietary performance scores on the BrowseComp benchmark, this would indicate that the uncertainty-reduction training pipeline fails to deliver the claimed capability.

read the original abstract

Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WebSailor, a post-training methodology for web agents that generates novel high-uncertainty tasks via structured sampling and information obfuscation, applies RFT cold-start, and uses Duplicating Sampling Policy Optimization (DUPO) to instill systematic uncertainty reduction in open-source LLMs. It claims this pipeline enables significant outperformance over all open-source agents on complex information-seeking benchmarks such as BrowseComp, matching proprietary agents and closing the capability gap.

Significance. If the performance gains and attribution to the uncertainty-reduction mechanism are robustly demonstrated, the work would be significant for agentic LLM training, as it offers a concrete pipeline to replicate super-human web navigation capabilities in open-source models on extremely challenging benchmarks.

major comments (3)
  1. [Abstract and §1] Abstract and §1: The central claim that proprietary success stems from a 'sophisticated reasoning pattern' of systematic uncertainty reduction absent in open-source models, and that WebSailor specifically instills it, lacks isolating evidence; no ablations, entropy metrics over navigation paths, or step-wise information-gain comparisons are reported to distinguish the posited mechanism from generic effects of additional RL compute or data scale.
  2. [§3 and §4] §3 (Training Pipeline) and §4 (Experiments): The choice of task-generation heuristics and DUPO hyperparameters is described as independent of final benchmarks, yet without explicit separation (e.g., via held-out task variants or hyperparameter sensitivity tables), circularity risk remains for the cross-benchmark claims.
  3. [Results tables] Table 1 or equivalent results section: Reported gains over open-source agents are presented without seed-wise variance, multiple benchmark variants, or controls for post-training compute, making it impossible to confirm robustness as required for the 'closing the capability gap' conclusion.
minor comments (2)
  1. [§3.2] Notation for DUPO is introduced without a clear algorithmic pseudocode or comparison to standard policy optimization variants in the methods section.
  2. [Figures 2-3] Figure captions for task-generation examples could more explicitly label the information-obfuscation steps to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions where necessary to enhance the robustness of our claims.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1: The central claim that proprietary success stems from a 'sophisticated reasoning pattern' of systematic uncertainty reduction absent in open-source models, and that WebSailor specifically instills it, lacks isolating evidence; no ablations, entropy metrics over navigation paths, or step-wise information-gain comparisons are reported to distinguish the posited mechanism from generic effects of additional RL compute or data scale.

    Authors: We appreciate the referee's emphasis on the need for isolating evidence to support our central claim. The WebSailor pipeline is specifically engineered to generate high-uncertainty tasks and employ DUPO to promote systematic uncertainty reduction, as detailed in Sections 3 and 4. Our results show that this leads to performance matching proprietary agents on BrowseComp, unlike other open-source approaches. However, we acknowledge the absence of direct metrics such as entropy over paths or information-gain comparisons. In the revised manuscript, we have incorporated an ablation study comparing DUPO against standard RL methods and added entropy analysis to better isolate the mechanism from generic compute effects. revision: yes

  2. Referee: [§3 and §4] §3 (Training Pipeline) and §4 (Experiments): The choice of task-generation heuristics and DUPO hyperparameters is described as independent of final benchmarks, yet without explicit separation (e.g., via held-out task variants or hyperparameter sensitivity tables), circularity risk remains for the cross-benchmark claims.

    Authors: We agree that demonstrating independence from the evaluation benchmarks is crucial to avoid circularity. The task-generation heuristics were developed based on general principles of information-seeking tasks and applied prior to benchmark evaluation. To further address this concern, we have added hyperparameter sensitivity tables and results on held-out task variants in the revised Section 4, confirming that performance gains hold across different configurations. revision: yes

  3. Referee: [Results tables] Table 1 or equivalent results section: Reported gains over open-source agents are presented without seed-wise variance, multiple benchmark variants, or controls for post-training compute, making it impossible to confirm robustness as required for the 'closing the capability gap' conclusion.

    Authors: We recognize the importance of statistical robustness in our results. The original submission reported single-run results for clarity, but we have now performed experiments with multiple random seeds and report mean and standard deviation in the updated results tables. We have also included controls for post-training compute by comparing against baselines with matched training budgets, and added evaluations on multiple benchmark variants to support the conclusion that the capability gap is closed. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper posits that proprietary agents succeed via systematic uncertainty reduction absent in open-source models, then describes an independent pipeline of high-uncertainty task generation via structured sampling and information obfuscation, followed by RFT cold-start and DUPO training. No equations, self-citations, or fitted parameters are shown in the provided text that reduce the claimed performance gains or the instilled reasoning pattern back to the inputs by construction. The methodology is presented as a novel post-training approach based on external insight from proprietary systems, with evaluation on separate benchmarks; the central claim retains independent content and does not collapse into self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that proprietary success comes from a specific uncertainty-reduction reasoning pattern that can be instilled via the described task generation and RL procedure. No new physical entities or mathematical axioms are introduced; the work is empirical ML training.

axioms (1)
  • domain assumption Open-source models lack the uncertainty-reduction reasoning pattern that proprietary agents possess.
    Stated in the abstract as the key insight motivating the entire pipeline.
invented entities (1)
  • DUPO (Duplicating Sampling Policy Optimization) no independent evidence
    purpose: Efficient agentic RL training algorithm for high-uncertainty web tasks.
    New algorithm introduced in the paper; no independent evidence provided beyond the training results.

pith-pipeline@v0.9.0 · 5507 in / 1311 out tokens · 124394 ms · 2026-05-17T15:30:30.256512+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes.

  • IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO).

  • IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

    cs.AI 2026-04 accept novelty 8.0

    AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

  3. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  4. GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

    cs.CL 2026-04 unverdicted novelty 7.0

    GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

  5. Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

    cs.AI 2026-01 conditional novelty 7.0

    DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.

  6. MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

    cs.CL 2025-11 unverdicted novelty 7.0

    MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

  7. LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.

  8. SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...

  9. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  10. $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

    cs.LG 2026-04 unverdicted novelty 6.0

    π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...

  11. Towards Knowledgeable Deep Research: Framework and Benchmark

    cs.AI 2026-04 unverdicted novelty 6.0

    The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

  12. Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

  13. Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search

    cs.AI 2026-03 conditional novelty 6.0

    EASP adds a Probe-then-Plan step so LLMs ground their search plans in actual retrieval snapshots and inventory, yielding higher recall and business metrics in sub-second production search.

  14. MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

    cs.CL 2025-11 unverdicted novelty 6.0

    MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.

  15. WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    cs.IR 2025-08 unverdicted novelty 6.0

    WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.

  16. Don\'t Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination

    cs.CL 2026-04 unverdicted novelty 5.0

    The EDR system with outline reflection, dependency-controlled information flow, and evidence sufficiency criteria outperforms baselines on sales enablement and DeepResearch Bench by reducing premature stopping and imp...

  17. Mind DeepResearch Technical Report

    cs.AI 2026-04 unverdicted novelty 5.0

    MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

  18. SynthAgent: Adapting Web Agents with Synthetic Supervision

    cs.LG 2025-11 unverdicted novelty 5.0

    SynthAgent uses dual refinement of synthetic tasks and trajectories to produce higher-quality training data that improves web agent adaptation to target environments.

  19. Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    cs.CV 2025-09 unverdicted novelty 5.0

    Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

  20. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  21. Agentic Reasoning for Large Language Models

    cs.AI 2026-01 unverdicted novelty 4.0

    The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

  22. A Survey of Reinforcement Learning for Large Reasoning Models

    cs.CL 2025-09 accept novelty 3.0

    A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 22 Pith papers · 21 internal anchors

  1. [1]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565,

  2. [2]

    Fireact: Toward language agent fine-tuning,

    Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915,

  3. [3]

    SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468,

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

  5. [5]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URL https://research.google/blog/measuri ng-compositional-generalization/. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,

  7. [7]

    URL https://arxiv.org/abs/2011.0

  8. [8]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

  9. [9]

    Look before you leap: An exploratory study of uncertainty measurement for large language models

    Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236,

  10. [10]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516,

  11. [11]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    URL https://jina.ai/. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,

  12. [12]

    Large language models must be taught to know what they don’t know.arXiv preprint arXiv:2406.08391,

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don’t know.arXiv preprint arXiv:2406.08391,

  13. [13]

    WebThinker: Empowering Large Reasoning Models with Deep Research Capability

    Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, and Minhao Cheng. Lara: Benchmarking retrieval-augmented generation and long-context llms–no silver bullet for lc or rag routing. arXiv preprint arXiv:2502.09977, 2025a. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1:...

  14. [14]

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom

    URL https://aclanthology .org/2024.lrec-main.237. Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations ,

  15. [15]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URL https: //qwenlm.github.io/blog/qwq-32b/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  16. [16]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

  17. [17]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    21 Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592,

  18. [18]

    Llm-based multi-agent rein- forcement learning: Current and future directions,

    Chuanneng Sun, Songjun Huang, and Dario Pompili. Llm-based multi-agent reinforcement learning: Current and future directions. arXiv preprint arXiv:2405.11106,

  19. [19]

    Climbing the ladder of reasoning: What llms can-and still can’t-solve after sft? arXiv preprint arXiv:2504.11741,

    Yiyou Sun, Georgia Zhou, Hao Wang, Dacheng Li, Nouha Dziri, and Dawn Song. Climbing the ladder of reasoning: What llms can-and still can’t-solve after sft? arXiv preprint arXiv:2504.11741,

  20. [20]

    All roads lead to likelihood: The value of reinforcement learning in fine-tuning

    Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J Andrew Bagnell. All roads lead to likelihood: The value of reinforcement learning in fine-tuning. arXiv preprint arXiv:2503.01067,

  21. [21]

    Leave no document behind: Benchmarking long-context llms with extended multi-doc QA

    Minzheng Wang, Longze Chen, Fu Cheng, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, and Yongbin Li. Leave no document behind: Benchmarking long-context llms with extended multi-doc QA. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empiri...

  22. [22]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al

    URL https://aclanthology.org/2024.emnlp-main.322. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837,

  23. [23]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,

  24. [24]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516,

  25. [25]

    Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648,

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Yong Jiang, Pengjun Xie, et al. Webdancer: Towards autonomous information seeking agency. arXiv preprint arXiv:2505.22648, 2025a. Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, an...

  26. [26]

    Frequency principle: Fourier analysis sheds light on deep neural networks.arXiv preprint arXiv:1901.06523,

    URL https://xbench.org/agi/aisearch. Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523,

  27. [27]

    An overview of condensation phenomenon in deep learning

    Zhi-Qin John Xu, Yaoyu Zhang, and Zhangchen Zhou. An overview of condensation phenomenon in deep learning. arXiv preprint arXiv:2504.09484,

  28. [28]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115,

  29. [29]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,

  30. [30]

    LIMO: Less is More for Reasoning

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387,

  31. [31]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

  32. [32]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837,

  33. [33]

    Evolvesearch: An iterative self-evolving search agent

    Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, and Fei Huang. Evolvesearch: An iterative self-evolving search agent, 2025a. URL https://arxiv.org/abs/2505.22501. Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. Nem...

  34. [34]

    DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

    URL https://arxiv.org/abs/2504.03160. Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv preprint arXiv:2504.19314,