pith. machine review for the scientific record. sign in

arxiv: 2605.01489 · v1 · submitted 2026-05-02 · 💻 cs.AI · cs.CL

Recognition: unknown

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords scientific reasoning agentsautomated data synthesisfrontier sciencebiology reasoningchemistry reasoningagentic reinforcement learningsupervised fine-tuninginformation-seeking tasks
0
0 comments X

The pith

Automated synthesis of conceptual and computational tasks trains an 8B model to set new records on frontier biology and chemistry reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SciResearcher as a fully automated agentic system that generates training data for scientific reasoning by pulling from academic sources to create diverse tasks. These tasks target information-seeking, tool use, and extended reasoning chains that standard web or graph methods struggle to produce for frontier domains. The data then supports supervised fine-tuning followed by agentic reinforcement learning to produce SciResearcher-8B. This model reaches 19.46 percent on the HLE-Bio/Chem-Gold benchmark while posting 13-15 percent absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature. The approach matters because frontier science problems involve scattered, heterogeneous sources and heavy computation, not simple recall, so scalable data construction could accelerate agent development without constant human curation.

Core claim

SciResearcher is a fully automated agentic framework for frontier-science data construction that synthesizes diverse conceptual and computational tasks grounded in academic evidence. The framework elicits information acquisition, tool-integrated reasoning, and long-horizon capabilities. Training on the resulting data via supervised fine-tuning and agentic reinforcement learning yields SciResearcher-8B, which scores 19.46 percent on HLE-Bio/Chem-Gold and delivers 13-15 percent absolute improvements on SuperGPQA-Hard-Biology and TRQA-Literature, establishing a new state of the art at the 8B scale and surpassing several larger proprietary agents.

What carries the argument

The SciResearcher agentic framework, which automatically synthesizes grounded conceptual and computational tasks from academic sources to drive supervised fine-tuning and reinforcement learning for scientific agents.

If this is right

  • Smaller open models can match or exceed larger closed agents on hard scientific benchmarks when trained on suitably synthesized data.
  • The same synthesis loop can be iterated to generate larger and more diverse datasets for continued scaling.
  • Agentic reinforcement learning on these tasks strengthens long-horizon planning beyond what factual pre-training alone provides.
  • The framework reduces dependence on manual knowledge-graph or browsing pipelines that miss computational depth in sparse academic sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could transfer to other data-scarce fields such as materials discovery or theoretical physics by swapping the source academic corpora.
  • Pairing the synthesized tasks with real experimental logs or simulation outputs might close remaining gaps between benchmark performance and practical discovery.
  • The performance edge at 8B scale suggests that future gains may come more from data quality than from raw parameter count in scientific agent work.

Load-bearing premise

Tasks synthesized by the agentic framework accurately reflect the computational and reasoning demands of actual frontier scientific problems rather than simplified or proxy versions.

What would settle it

If a model trained on human-curated real frontier problems performs no better than SciResearcher-8B on the same held-out benchmarks, or if SciResearcher-8B fails on a fresh set of unsolved domain problems never seen during data synthesis, the value of the automated construction process would be in question.

Figures

Figures reproduced from arXiv: 2605.01489 by Rui Wang, Tianqing Fang, Tianshi Zheng, Xiyun Li, Yangqiu Song.

Figure 1
Figure 1. Figure 1: Performance comparison on HLE-Bio/Chem-Gold ( view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of ontology and web presence be view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our SciResearcher data construction framework. specific and concrete to support further evidence-grounded expansion. After selecting the best anchor, we invoke a new web agent instance to gather additional academic evidence about that anchor and generate a new question whose answer is exactly the anchor entity. This newly generated question is then fused back into the previous question by repla… view at source ↗
Figure 4
Figure 4. Figure 4: A running example of a question evolution pipeline for conceptual task curation. Question view at source ↗
Figure 5
Figure 5. Figure 5: (a) Word clouds of the curated questions from the two pipelines. (b) Distribution and view at source ↗
Figure 6
Figure 6. Figure 6: (a) Distribution of trajectory lengths (in macro steps) for SFT and RL checkpoints. (b) view at source ↗
read the original abstract

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SciResearcher, a fully automated agentic framework for synthesizing diverse conceptual and computational tasks grounded in academic evidence to support information acquisition, tool use, and long-horizon reasoning. The authors apply supervised fine-tuning followed by agentic reinforcement learning on this data to produce SciResearcher-8B, which achieves 19.46% on HLE-Bio/Chem-Gold (new SOTA at 8B scale, surpassing some larger proprietary agents) along with 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature.

Significance. If the synthetic tasks are shown to impose reasoning loads comparable to real frontier scientific problems, the framework would provide a scalable, automated alternative to knowledge-graph or web-browsing curation methods, potentially enabling more capable agents for automated discovery in domains with sparse, heterogeneous sources.

major comments (3)
  1. [§3] §3 (Framework Description): The task-synthesis pipeline is described at a high level but supplies no pseudocode, concrete examples of generated computational tasks, or quantitative metrics (e.g., number of tool calls or reasoning steps required). Without these, it is impossible to verify that the data elicits the claimed long-horizon and tool-integrated reasoning rather than simpler pattern-matching proxies.
  2. [§5] §5 (Experiments): No ablation studies isolate the contribution of the agentic RL stage versus SFT alone, and no error bars or run-to-run variance are reported for the headline scores (19.46% on HLE-Bio/Chem-Gold, 13-15% gains elsewhere). This leaves open whether the reported improvements are robust or attributable to the framework.
  3. [§4] §4 (Data Construction): The manuscript provides no human-expert validation, difficulty calibration against real frontier problems, or comparison of source heterogeneity between synthetic tasks and actual academic literature. This directly bears on the central claim that performance gains reflect genuine scaling of scientific reasoning rather than distribution matching on easier proxies.
minor comments (2)
  1. [Abstract] The abstract states '13-15% absolute gains' without naming the exact baselines or providing the raw scores for each benchmark.
  2. [§2] Notation for the HLE-Bio/Chem-Gold benchmark is introduced without a reference or brief definition of its construction.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity and rigor, and we will revise the manuscript accordingly to address them. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [§3] §3 (Framework Description): The task-synthesis pipeline is described at a high level but supplies no pseudocode, concrete examples of generated computational tasks, or quantitative metrics (e.g., number of tool calls or reasoning steps required). Without these, it is impossible to verify that the data elicits the claimed long-horizon and tool-integrated reasoning rather than simpler pattern-matching proxies.

    Authors: We agree that additional concrete details are needed to substantiate the long-horizon and tool-use claims. In the revised manuscript we will include full pseudocode for the task-synthesis pipeline as an appendix. We will also provide multiple concrete examples of generated computational tasks (including both conceptual and multi-step computational instances) together with quantitative metrics such as the distribution of tool calls per task and average reasoning steps. These additions will enable readers to directly assess the complexity and structure of the synthesized data. revision: yes

  2. Referee: [§5] §5 (Experiments): No ablation studies isolate the contribution of the agentic RL stage versus SFT alone, and no error bars or run-to-run variance are reported for the headline scores (19.46% on HLE-Bio/Chem-Gold, 13-15% gains elsewhere). This leaves open whether the reported improvements are robust or attributable to the framework.

    Authors: We acknowledge the value of isolating the RL contribution and reporting statistical robustness. We will add ablation experiments comparing SFT-only training against the full SFT + agentic RL pipeline on the same data. We will also rerun the primary evaluation runs with multiple random seeds and report error bars (standard deviation) for the key metrics on HLE-Bio/Chem-Gold, SuperGPQA-Hard-Biology, and TRQA-Literature. These changes will clarify the source of the observed gains. revision: yes

  3. Referee: [§4] §4 (Data Construction): The manuscript provides no human-expert validation, difficulty calibration against real frontier problems, or comparison of source heterogeneity between synthetic tasks and actual academic literature. This directly bears on the central claim that performance gains reflect genuine scaling of scientific reasoning rather than distribution matching on easier proxies.

    Authors: We will expand the data construction section to include quantitative comparisons of source heterogeneity (e.g., topic diversity, citation graph statistics, and domain coverage) between the synthetic tasks and samples drawn from the original academic literature. We will also add indirect difficulty calibration by reporting task-complexity statistics (tool-call depth, reasoning-chain length) and relating them to the hardness of the evaluation benchmarks. However, human-expert validation was not performed because the framework is intentionally fully automated; we will explicitly discuss this design choice and its implications as a limitation. revision: partial

standing simulated objections not resolved
  • Direct human-expert validation of the synthetic tasks and explicit difficulty calibration against real frontier problems, as these steps would require substantial human annotation effort and contradict the core goal of a fully automated, scalable data-construction pipeline.

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline on external benchmarks

full rationale

The paper presents an agentic framework that synthesizes tasks from academic sources, applies standard supervised fine-tuning plus agentic RL, and reports accuracy on independent external benchmarks (HLE-Bio/Chem-Gold, SuperGPQA-Hard-Biology, TRQA-Literature). No equations, fitted parameters, or internal derivations are described. Performance claims are grounded in post-training evaluation rather than any self-referential reduction, self-citation chain, or renaming of known results. The central assumption (synthetic tasks match frontier demands) is an empirical validity question, not a circularity in the reported derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work relies on standard supervised fine-tuning and reinforcement learning applied to newly synthesized tasks.

pith-pipeline@v0.9.0 · 5546 in / 1167 out tokens · 34324 ms · 2026-05-09T14:14:38.432631+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 37 canonical work pages · 10 internal anchors

  1. [1]

    Introducing claude sonnet 4.5, September 2025

    Anthropic. Introducing claude sonnet 4.5, September 2025. URL https://www.anthropic.com/news /claude-sonnet-4-5

  2. [2]

    The discovery engine: A framework for ai-driven synthesis and navigation of scientific knowledge landscapes, 2025

    Vladimir Baulin, Austin Cook, Daniel Friedman, Janna Lumiruusu, Andrew Pashea, Shagor Rahman, and Benedikt Waldeck. The discovery engine: A framework for ai-driven synthesis and navigation of scientific knowledge landscapes, 2025. URLhttps://arxiv.org/abs/2505.17500

  3. [3]

    Scimaster: Towards general-purpose scientific ai agents, part i

    Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. Scimaster: Towards general-purpose scientific ai agents, part i. x-master as foundation: Can we lead on humanity’s last exam?, 2025. URL https: //arxiv.org/abs/2507.05241

  4. [4]

    Try deep research and our new experimental model in gemini, your ai assistant, December

    Dave Citron. Try deep research and our new experimental model in gemini, your ai assistant, December

  5. [5]

    URLhttps://blog.google/products/gemini/google-gemini-deep-research/

  6. [6]

    Evolving scientific discovery by unifying data and background knowledge with ai hilbert.Nature Communications, 15(1):5922, July 2024

    Ryan Cory-Wright, Cristina Cornelio, Sanjeeb Dash, Bachir El Khadir, and Lior Horesh. Evolving scientific discovery by unifying data and background knowledge with ai hilbert.Nature Communications, 15(1):5922, July 2024. doi: 10.1038/s41467-024-50074-w

  7. [7]

    Deepresearch bench: A comprehensive benchmark for deep research agents, 2025

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URL https://arxiv.org/abs/2506.117 63

  8. [8]

    Ontologies for increasing the fairness of plant research data.Frontiers in Plant Science, V olume 14 - 2023, 2023

    Kathryn Dumschott, Hannah Dörpholz, Marie-Angélique Laporte, Dominik Brilhaus, Andrea Schrader, Björn Usadel, Steffen Neumann, Elizabeth Arnaud, and Angela Kranz. Ontologies for increasing the fairness of plant research data.Frontiers in Plant Science, V olume 14 - 2023, 2023. ISSN 1664-462X. doi: 10.3389/fpls.2023.1279694. URL https://www.frontiersin.org...

  9. [9]

    arXiv preprint arXiv:2504.21024 , year=

    Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025. URL https: //arxiv.org/abs/2504.21024

  10. [10]

    Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

    Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Yonglin Wang, Jingchen Ni, Tianshi Zheng, Chun Chen, Wenhao Yu, Zhenwen Liang, Hongming Zhang, Haitao Mi, and Dong Yu. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training, 2026. URLhttps://arxiv...

  11. [11]

    Deep research bench: Evaluating ai web research agents, arXiv preprint arXiv:2506.06287, 2025

    FutureSearch, :, Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, and Jack Wildman. Deep research bench: Evaluating ai web research agents, 2025. URLhttps://arxiv.org/abs/2506.06287

  12. [12]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

  13. [13]

    Webcot: Enhancing web agent reasoning by reconstructing chain-of- thought in reflection, branching, and rollback, 2025

    Minda Hu, Tianqing Fang, Jianshu Zhang, Junyu Ma, Zhisong Zhang, Jingyan Zhou, Hongming Zhang, Haitao Mi, Dong Yu, and Irwin King. Webcot: Enhancing web agent reasoning by reconstructing chain-of- thought in reflection, branching, and rollback, 2025. URLhttps://arxiv.org/abs/2505.20013

  14. [14]

    Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

  15. [15]

    Aide: Ai-driven exploration in the space of code, 2025

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URL https://arxiv.org/abs/2502.131 38

  16. [16]

    autoresearch: Ai agents running research on single-gpu nanochat training automatically

    Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically. https://github.com/karpathy/autoresearch, 2026

  17. [17]

    Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning, 2025

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning, 2025. URLhttps://arxiv.org/abs/2509.13305. 10

  18. [18]

    Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent, 2025. URLhttps://arxiv.org/abs/2507.02592

  19. [19]

    Verified Critical Step Optimization for LLM Agents

    Mukai Li, Qingcheng Zeng, Tianqing Fang, Zhenwen Liang, Linfeng Song, Qi Liu, Haitao Mi, and Dong Yu. Verified critical step optimization for llm agents, 2026. URLhttps://arxiv.org/abs/2602.03412

  20. [20]

    Webexplorer: Explore and evolve for training long-horizon web agents, 2025

    Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, and Junxian He. Webexplorer: Explore and evolve for training long-horizon web agents, 2025. URLhttps://arxiv.org/abs/2509.06501

  21. [21]

    Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025b.https://arxiv.org/abs/2506.16499

    Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Weinan E, and Siheng Chen. Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025. URL https://arxiv.org/abs/2506.16499

  22. [22]

    The ai scientist: Towards fully automated open-ended scientific discovery, 2024

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/2408 .06292

  23. [23]

    Llm4sr: A survey on large language models for scientific research, 2025

    Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research, 2025. URLhttps://arxiv.org/abs/2501.04306

  24. [24]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

    M-A-P, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Zh...

  25. [25]

    GAIA: a benchmark for general AI assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net /forum?id=fibxvahvs3

  26. [26]

    Mitchener, A

    Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagora...

  27. [27]

    FEABench: 22 Evaluating language models on multi- physics reasoning ability.arXiv preprint arXiv:2504.06260, 2025

    Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P. Brenner, and Peter Norgaard. Feabench: Evaluating language models on multiphysics reasoning ability, 2025. URL https: //arxiv.org/abs/2504.06260

  28. [28]

    Introducing deep research

    OpenAI. Introducing deep research. https://openai.com/index/introducing-deep-research/, February 2025

  29. [29]

    Introducing perplexity deep research

    Perplexity AI. Introducing perplexity deep research. https://www.perplexity.ai/hub/blog/intro ducing-perplexity-deep-research, 2025

  30. [30]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, and Jason Hausenloy, et al. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649(8099): 1139–1146, January 2026. ISSN 1476-4687. doi: 10.1038/s41586-025-09962-4. URL http: //dx.doi.org/10.1038/s41586-025-09962-4. 11

  31. [31]

    Schmidgall, Y

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants, 2025. URLhttps://arxiv.org/abs/2501.04227

  32. [32]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  33. [33]

    A web-scale system for scientific knowledge exploration

    Zhihong Shen, Hao Ma, and Kuansan Wang. A web-scale system for scientific knowledge exploration. In Fei Liu and Thamar Solorio, editors,Proceedings of ACL 2018, System Demonstrations, pages 87–92, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-4015. URLhttps://aclanthology.org/P18-4015/

  34. [34]

    Si, C., Hashimoto, T., and Yang, D

    Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, 2024. URLhttps://arxiv.org/abs/2409.04109

  35. [35]

    Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning,

    Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, Wenlong Zhang, Lei Bai, Zhenfei Yin, Philip Torr, Hanrui Wang, and Di Jin. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning,

  36. [36]

    URLhttps://arxiv.org/abs/2509.21193

  37. [37]

    Webshaper: Agentically data synthesizing via information-seeking formalization, 2025

    Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webshaper: Agentically data synthesizing via information-seeking formalization, 2025. URL https://arxiv.org/abs/2507.1 5061

  38. [38]

    MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

    MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, P...

  39. [39]

    Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, and Michael R. Lyu. Inference-time scaling of verification: Self-evolving deep research agents via test-time rubric-guided verification, 2026. URLhttps://arxiv.org/abs/2601.15808

  40. [40]

    arXiv:2601.21165 , institution =

    Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks, 2026. URL https: //arxiv.org/abs/2601.21165

  41. [41]

    WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models

    Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, and Kam-Fai Wong. Webaggregator: Enhancing compositional reasoning capabilities of deep research agent foundation models, 2026. URL https://arxiv.org/abs/2510.14438

  42. [42]

    arXiv preprint arXiv:2307.10635

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models, 2024. URLhttps://arxiv.org/abs/2307.10635

  43. [43]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URLhttps://arxiv.org/abs/2504.12516

  44. [44]

    Deepscientist: Advancing frontier-pushing scientific findings progressively, 2025

    Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively, 2025. URL https://arxiv.org/abs/25 09.26603

  45. [45]

    About 30% of humanity’s last exam chemistry/biology answers are likely wrong, July 2025

    Andrew White, Michael Skarlinski, Jon Laurent, and Albert Bou. About 30% of humanity’s last exam chemistry/biology answers are likely wrong, July 2025. URL https://www.futurehouse.org/rese arch-announcements/hle-exam

  46. [46]

    Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency, 2025. URLhttps://arxiv.org/abs/2505.22648. 12

  47. [47]

    arXiv preprint arXiv:2501.07572 , year=

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web traversal, 2025. URL https://arxiv.org/abs/2501.07572

  48. [48]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  49. [49]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  50. [50]

    Origene: A self-evolving virtual disease biologist automating therapeutic target discovery.bioRxiv, 2025

    Zhongyue Zhang, Zijie Qiu, Yingcheng Wu, Shuya Li, Dingyan Wang, Zhuomin Zhou, Duo An, Yuhan Chen, Yu Li, Yongbo Wang, Chubin Ou, Zichen Wang, Jack Xiaoyu Chen, Bo Zhang, Yusong Hu, Wenxin Zhang, Zhijian Wei, Runze Ma, Qingwu Liu, Bo Dong, Yuexi He, Qiantai Feng, Lei Bai, Qiang Gao, Siqi Sun, and Shuangjia Zheng. Origene: A self-evolving virtual disease b...

  51. [51]

    Fine- grained information extraction from biomedical literature based on knowledge-enriched Abstract Meaning Representation

    Zixuan Zhang, Nikolaus Parulian, Heng Ji, Ahmed Elsayed, Skatje Myers, and Martha Palmer. Fine- grained information extraction from biomedical literature based on knowledge-enriched Abstract Meaning Representation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computationa...

  52. [52]

    From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

    Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...

  53. [53]

    Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y

    Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y . Wong, and Simon See. Newtonbench: Benchmarking generalizable scientific law discovery in llm agents, 2026. URL https://arxiv.org/abs/2510.07172. 13 A Technical Details Agent Act...

  54. [54]

    Include the seed entity or be directly grounded in it

  55. [55]

    Be concise but scientifically meaningful

  56. [56]

    Be answerable from a single authoritative academic source at this stage

  57. [57]

    Prefer multiple-choice format with plausible confounders, while allowing short-answer format when more appropriate

  58. [58]

    Avoid shortcuts that can be solved by trivia, superficial keyword matching, or generic web search without reading the academic evidence

  59. [59]

    Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientific venues

    Be suitable as the semantic backbone for later anchor-based augmentation.> ## Pre-Action Protocol: Plan Before Searching <Metric Definition> <Before browsing, understand the seed entity and its scientific context. Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientif...

  60. [60]

    Meticulousness and persistence in finding high-quality academic evidence

  61. [61]

    Task decomposition: search -> evidence extraction -> question generation -> verification

  62. [62]

    Adaptive error handling and reuse of progress state when searches fail or evidence is insufficient

  63. [63]

    Multi-query scout search and URL selection based on relevance, venue quality, source diversity, and scientific specificity

  64. [64]

    Use of the url2evidence sub-agent to access selected academic sources, extract key supporting evidence, and distinguish stand-alone scientific facts from study-specific artifacts

  65. [65]

    Evidence quality checks, including source authority, evidence-answer entailment, and avoidance of unsupported assumptions

  66. [66]

    Question formulation with plausible, unbiased, and challenging confounders for MCQs; clear expected answer for short-answer questions; and final quality checks

  67. [67]

    question

    Multi-tool coordination following the typical workflow: scout search -> source selection -> url2evidence -> question generation -> verification. ## Output Format The final output MUST be a JSON object with the following structure: ’’’json { "question": "The question text containing or directly grounded in the seed entity", 15 "answer": "The correct answer...

  68. [68]

    **Domain-specific**: It is a concrete scientific entity, such as a gene, protein, pathway, compound, species, technique, disease, mutation, phenotype, material, model, or other scientific concept

  69. [69]

    **Question-body only**: It appears in the question stem but does NOT appear in the correct answer or any confounder

  70. [70]

    **Decisive**: The question becomes substantially harder or unanswerable if this entity is masked or removed

  71. [71]

    ## Your Task Given the question, correct answer(s), and confounders below, you must:

    **Specific and concrete**: It is sufficiently specific to support further evidence-grounded browsing and question generation. ## Your Task Given the question, correct answer(s), and confounders below, you must:

  72. [72]

    Identify candidate anchor entities in the question body

  73. [73]

    Verify that each candidate does NOT appear in the correct answer or any confounder

  74. [74]

    Evaluate whether each candidate is decisive for deriving the final answer

  75. [75]

    Select the most decisive, specific, and concrete entity

  76. [76]

    ## Selection Criteria (in priority order)

    If no valid anchor exists, return an empty string. ## Selection Criteria (in priority order)

  77. [77]

    AXL" over

    Prefer the MOST SPECIFIC entity, e.g., "AXL" over "receptor tyrosine kinase"

  78. [78]

    Prefer entities that constrain the answer, such that removing them makes multiple answers plausible

  79. [79]

    Prefer named entities, such as gene, protein, compound, disease, pathway, or model names, over generic scientific terms

  80. [80]

    Prefer entities that are decoupled from the surface form of the answer options

Showing first 80 references.