WebSailor: Navigating Super-human Reasoning for Web Agent
Pith reviewed 2026-05-17 15:30 UTC · model grok-4.3
The pith
WebSailor equips open-source models with the ability to reduce extreme uncertainty in web navigation, allowing them to match proprietary agents on complex information-seeking tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present WebSailor as a complete post-training methodology designed to instill the capability of systematically reducing extreme uncertainty when navigating vast information landscapes. This is achieved by generating novel high-uncertainty tasks through structured sampling and information obfuscation, followed by RFT cold start and an efficient agentic RL training algorithm called Duplicating Sampling Policy Optimization (DUPO). With this pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks while matching proprietary agents and closing the capability gap.
What carries the argument
The integrated pipeline of structured sampling with information obfuscation for high-uncertainty task generation, RFT cold start, and Duplicating Sampling Policy Optimization (DUPO) for agentic reinforcement learning.
If this is right
- Open-source web agents achieve performance levels on complex information-seeking benchmarks that were previously limited to proprietary systems.
- The performance gap between open and closed web agents narrows through focused post-training on uncertainty reduction rather than increased model scale.
- Similar post-training methods could extend the same reasoning pattern to additional agentic tasks involving large unstructured data sources.
Where Pith is reading between the lines
- The emphasis on uncertainty reduction may apply to agent training in other domains such as scientific database navigation or enterprise document search where information volume creates comparable challenges.
- Researchers could test whether the task generation approach using information obfuscation transfers to non-web settings like code repositories or knowledge graphs.
- Open-source developers might integrate the DUPO algorithm into existing agent frameworks to improve navigation without requiring access to proprietary training data or compute.
Load-bearing premise
The central claim depends on the premise that proprietary agents outperform open-source ones primarily due to a trainable reasoning pattern for reducing extreme uncertainty in large information landscapes.
What would settle it
If WebSailor-trained agents do not match or exceed proprietary performance scores on the BrowseComp benchmark, this would indicate that the uncertainty-reduction training pipeline fails to deliver the claimed capability.
read the original abstract
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WebSailor, a post-training methodology for web agents that generates novel high-uncertainty tasks via structured sampling and information obfuscation, applies RFT cold-start, and uses Duplicating Sampling Policy Optimization (DUPO) to instill systematic uncertainty reduction in open-source LLMs. It claims this pipeline enables significant outperformance over all open-source agents on complex information-seeking benchmarks such as BrowseComp, matching proprietary agents and closing the capability gap.
Significance. If the performance gains and attribution to the uncertainty-reduction mechanism are robustly demonstrated, the work would be significant for agentic LLM training, as it offers a concrete pipeline to replicate super-human web navigation capabilities in open-source models on extremely challenging benchmarks.
major comments (3)
- [Abstract and §1] Abstract and §1: The central claim that proprietary success stems from a 'sophisticated reasoning pattern' of systematic uncertainty reduction absent in open-source models, and that WebSailor specifically instills it, lacks isolating evidence; no ablations, entropy metrics over navigation paths, or step-wise information-gain comparisons are reported to distinguish the posited mechanism from generic effects of additional RL compute or data scale.
- [§3 and §4] §3 (Training Pipeline) and §4 (Experiments): The choice of task-generation heuristics and DUPO hyperparameters is described as independent of final benchmarks, yet without explicit separation (e.g., via held-out task variants or hyperparameter sensitivity tables), circularity risk remains for the cross-benchmark claims.
- [Results tables] Table 1 or equivalent results section: Reported gains over open-source agents are presented without seed-wise variance, multiple benchmark variants, or controls for post-training compute, making it impossible to confirm robustness as required for the 'closing the capability gap' conclusion.
minor comments (2)
- [§3.2] Notation for DUPO is introduced without a clear algorithmic pseudocode or comparison to standard policy optimization variants in the methods section.
- [Figures 2-3] Figure captions for task-generation examples could more explicitly label the information-obfuscation steps to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions where necessary to enhance the robustness of our claims.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1: The central claim that proprietary success stems from a 'sophisticated reasoning pattern' of systematic uncertainty reduction absent in open-source models, and that WebSailor specifically instills it, lacks isolating evidence; no ablations, entropy metrics over navigation paths, or step-wise information-gain comparisons are reported to distinguish the posited mechanism from generic effects of additional RL compute or data scale.
Authors: We appreciate the referee's emphasis on the need for isolating evidence to support our central claim. The WebSailor pipeline is specifically engineered to generate high-uncertainty tasks and employ DUPO to promote systematic uncertainty reduction, as detailed in Sections 3 and 4. Our results show that this leads to performance matching proprietary agents on BrowseComp, unlike other open-source approaches. However, we acknowledge the absence of direct metrics such as entropy over paths or information-gain comparisons. In the revised manuscript, we have incorporated an ablation study comparing DUPO against standard RL methods and added entropy analysis to better isolate the mechanism from generic compute effects. revision: yes
-
Referee: [§3 and §4] §3 (Training Pipeline) and §4 (Experiments): The choice of task-generation heuristics and DUPO hyperparameters is described as independent of final benchmarks, yet without explicit separation (e.g., via held-out task variants or hyperparameter sensitivity tables), circularity risk remains for the cross-benchmark claims.
Authors: We agree that demonstrating independence from the evaluation benchmarks is crucial to avoid circularity. The task-generation heuristics were developed based on general principles of information-seeking tasks and applied prior to benchmark evaluation. To further address this concern, we have added hyperparameter sensitivity tables and results on held-out task variants in the revised Section 4, confirming that performance gains hold across different configurations. revision: yes
-
Referee: [Results tables] Table 1 or equivalent results section: Reported gains over open-source agents are presented without seed-wise variance, multiple benchmark variants, or controls for post-training compute, making it impossible to confirm robustness as required for the 'closing the capability gap' conclusion.
Authors: We recognize the importance of statistical robustness in our results. The original submission reported single-run results for clarity, but we have now performed experiments with multiple random seeds and report mean and standard deviation in the updated results tables. We have also included controls for post-training compute by comparing against baselines with matched training budgets, and added evaluations on multiple benchmark variants to support the conclusion that the capability gap is closed. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper posits that proprietary agents succeed via systematic uncertainty reduction absent in open-source models, then describes an independent pipeline of high-uncertainty task generation via structured sampling and information obfuscation, followed by RFT cold-start and DUPO training. No equations, self-citations, or fitted parameters are shown in the provided text that reduce the claimed performance gains or the instilled reasoning pattern back to the inputs by construction. The methodology is presented as a novel post-training approach based on external insight from proprietary systems, with evaluation on separate benchmarks; the central claim retains independent content and does not collapse into self-definition or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Open-source models lack the uncertainty-reduction reasoning pattern that proprietary agents possess.
invented entities (1)
-
DUPO (Duplicating Sampling Policy Optimization)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO).
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
-
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.
-
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
-
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
-
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
EASP adds a Probe-then-Plan step so LLMs ground their search plans in actual retrieval snapshots and inventory, yielding higher recall and business metrics in sub-second production search.
-
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
-
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
-
Don\'t Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination
The EDR system with outline reflection, dependency-controlled information flow, and evidence sufficiency criteria outperforms baselines on sales enablement and DeepResearch Bench by reducing premature stopping and imp...
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
SynthAgent: Adapting Web Agents with Synthetic Supervision
SynthAgent uses dual refinement of synthetic tasks and trajectories to produce higher-quality training data that improves web agent adaptation to target environments.
-
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Reference graph
Works this paper leans on
-
[1]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Fireact: Toward language agent fine-tuning,
Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915,
-
[3]
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468,
work page internal anchor Pith review arXiv
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
URL https://research.google/blog/measuri ng-compositional-generalization/. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
URL https://arxiv.org/abs/2011.0
work page 2011
-
[8]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Look before you leap: An exploratory study of uncertainty measurement for large language models
Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236,
-
[10]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
URL https://jina.ai/. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Large language models must be taught to know what they don’t know.arXiv preprint arXiv:2406.08391,
Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don’t know.arXiv preprint arXiv:2406.08391,
-
[13]
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, and Minhao Cheng. Lara: Benchmarking retrieval-augmented generation and long-context llms–no silver bullet for lc or rag routing. arXiv preprint arXiv:2502.09977, 2025a. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1:...
work page internal anchor Pith review doi:10.48550/arxiv.2504.21776 2024
-
[14]
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom
URL https://aclanthology .org/2024.lrec-main.237. Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations ,
work page 2024
-
[15]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URL https: //qwenlm.github.io/blog/qwq-32b/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[17]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
21 Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Llm-based multi-agent rein- forcement learning: Current and future directions,
Chuanneng Sun, Songjun Huang, and Dario Pompili. Llm-based multi-agent reinforcement learning: Current and future directions. arXiv preprint arXiv:2405.11106,
-
[19]
Yiyou Sun, Georgia Zhou, Hao Wang, Dacheng Li, Nouha Dziri, and Dawn Song. Climbing the ladder of reasoning: What llms can-and still can’t-solve after sft? arXiv preprint arXiv:2504.11741,
-
[20]
All roads lead to likelihood: The value of reinforcement learning in fine-tuning
Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J Andrew Bagnell. All roads lead to likelihood: The value of reinforcement learning in fine-tuning. arXiv preprint arXiv:2503.01067,
-
[21]
Leave no document behind: Benchmarking long-context llms with extended multi-doc QA
Minzheng Wang, Longze Chen, Fu Cheng, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, and Yongbin Li. Leave no document behind: Benchmarking long-context llms with extended multi-doc QA. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empiri...
work page 2024
-
[22]
URL https://aclanthology.org/2024.emnlp-main.322. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837,
work page 2024
-
[23]
Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648,
Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Yong Jiang, Pengjun Xie, et al. Webdancer: Towards autonomous information seeking agency. arXiv preprint arXiv:2505.22648, 2025a. Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, an...
-
[26]
Frequency principle: Fourier analysis sheds light on deep neural networks,
URL https://xbench.org/agi/aisearch. Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523,
-
[27]
An overview of condensation phenomenon in deep learning
Zhi-Qin John Xu, Yaoyu Zhang, and Zhangchen Zhou. An overview of condensation phenomenon in deep learning. arXiv preprint arXiv:2504.09484,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
LIMO: Less is More for Reasoning
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Evolvesearch: An iterative self-evolving search agent, 2025a
Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, and Fei Huang. Evolvesearch: An iterative self-evolving search agent, 2025a. URL https://arxiv.org/abs/2505.22501. Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. Nem...
-
[34]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
URL https://arxiv.org/abs/2504.03160. Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv preprint arXiv:2504.19314,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.