arxiv: 2605.01489 · v1 · submitted 2026-05-02 · 💻 cs.AI · cs.CL

Recognition: unknown

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

Tianshi Zheng , Rui Wang , Xiyun Li , Yangqiu Song , Tianqing Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords scientific reasoning agentsautomated data synthesisfrontier sciencebiology reasoningchemistry reasoningagentic reinforcement learningsupervised fine-tuninginformation-seeking tasks

0 comments

The pith

Automated synthesis of conceptual and computational tasks trains an 8B model to set new records on frontier biology and chemistry reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SciResearcher as a fully automated agentic system that generates training data for scientific reasoning by pulling from academic sources to create diverse tasks. These tasks target information-seeking, tool use, and extended reasoning chains that standard web or graph methods struggle to produce for frontier domains. The data then supports supervised fine-tuning followed by agentic reinforcement learning to produce SciResearcher-8B. This model reaches 19.46 percent on the HLE-Bio/Chem-Gold benchmark while posting 13-15 percent absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature. The approach matters because frontier science problems involve scattered, heterogeneous sources and heavy computation, not simple recall, so scalable data construction could accelerate agent development without constant human curation.

Core claim

SciResearcher is a fully automated agentic framework for frontier-science data construction that synthesizes diverse conceptual and computational tasks grounded in academic evidence. The framework elicits information acquisition, tool-integrated reasoning, and long-horizon capabilities. Training on the resulting data via supervised fine-tuning and agentic reinforcement learning yields SciResearcher-8B, which scores 19.46 percent on HLE-Bio/Chem-Gold and delivers 13-15 percent absolute improvements on SuperGPQA-Hard-Biology and TRQA-Literature, establishing a new state of the art at the 8B scale and surpassing several larger proprietary agents.

What carries the argument

The SciResearcher agentic framework, which automatically synthesizes grounded conceptual and computational tasks from academic sources to drive supervised fine-tuning and reinforcement learning for scientific agents.

If this is right

Smaller open models can match or exceed larger closed agents on hard scientific benchmarks when trained on suitably synthesized data.
The same synthesis loop can be iterated to generate larger and more diverse datasets for continued scaling.
Agentic reinforcement learning on these tasks strengthens long-horizon planning beyond what factual pre-training alone provides.
The framework reduces dependence on manual knowledge-graph or browsing pipelines that miss computational depth in sparse academic sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could transfer to other data-scarce fields such as materials discovery or theoretical physics by swapping the source academic corpora.
Pairing the synthesized tasks with real experimental logs or simulation outputs might close remaining gaps between benchmark performance and practical discovery.
The performance edge at 8B scale suggests that future gains may come more from data quality than from raw parameter count in scientific agent work.

Load-bearing premise

Tasks synthesized by the agentic framework accurately reflect the computational and reasoning demands of actual frontier scientific problems rather than simplified or proxy versions.

What would settle it

If a model trained on human-curated real frontier problems performs no better than SciResearcher-8B on the same held-out benchmarks, or if SciResearcher-8B fails on a fresh set of unsolved domain problems never seen during data synthesis, the value of the automated construction process would be in question.

Figures

Figures reproduced from arXiv: 2605.01489 by Rui Wang, Tianqing Fang, Tianshi Zheng, Xiyun Li, Yangqiu Song.

**Figure 1.** Figure 1: Performance comparison on HLE-Bio/Chem-Gold ( view at source ↗

**Figure 2.** Figure 2: Comparison of ontology and web presence be view at source ↗

**Figure 3.** Figure 3: Overview of our SciResearcher data construction framework. specific and concrete to support further evidence-grounded expansion. After selecting the best anchor, we invoke a new web agent instance to gather additional academic evidence about that anchor and generate a new question whose answer is exactly the anchor entity. This newly generated question is then fused back into the previous question by repla… view at source ↗

**Figure 4.** Figure 4: A running example of a question evolution pipeline for conceptual task curation. Question view at source ↗

**Figure 5.** Figure 5: (a) Word clouds of the curated questions from the two pipelines. (b) Distribution and view at source ↗

**Figure 6.** Figure 6: (a) Distribution of trajectory lengths (in macro steps) for SFT and RL checkpoints. (b) view at source ↗

read the original abstract

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SciResearcher gives a workable agentic loop for turning academic papers into conceptual-plus-computational training tasks, but the performance numbers sit on top of almost no visible validation or controls.

read the letter

The core contribution is a closed-loop agent that reads papers, generates tasks requiring both information gathering and actual computation, then uses those tasks for SFT plus agentic RL. They end up with an 8B model that reaches 19.46% on HLE-Bio/Chem-Gold and posts double-digit gains on SuperGPQA-Hard-Biology and TRQA-Literature. That is the part worth noting: someone has tried to move past static knowledge graphs or generic web search for frontier-science data and instead let an agent build the curriculum directly from source material. If the generated tasks really force long-horizon tool use on sparse, heterogeneous evidence, the approach could be useful for anyone trying to train agents that do more than retrieve facts. The numbers themselves are reported cleanly enough to invite comparison at the 8B scale. Beyond that, the paper is thin. The abstract and available description give no concrete description of the synthesis prompts, the filtering steps, the tool set the agent actually calls, or any human or automated check that the produced tasks match the difficulty of real open research problems rather than easier proxies. There are also no ablations separating the effect of the agentic loop from simple increases in data volume or from standard RL tricks. Without those pieces it is impossible to know whether the gains come from better task quality or from other factors. The stress-test concern about proxy tasks therefore lands; nothing in the write-up rules it out. Readers who build scientific agents or who need ideas for automated data pipelines will still find the high-level recipe worth looking at, even if they have to re-implement the details themselves. The work is coherent on its own terms and engages the right prior baselines, so it clears the bar for a serious referee. I would send it out, but with the expectation that reviewers will ask for the missing validation experiments and controls before any stronger claims are accepted.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SciResearcher, a fully automated agentic framework for synthesizing diverse conceptual and computational tasks grounded in academic evidence to support information acquisition, tool use, and long-horizon reasoning. The authors apply supervised fine-tuning followed by agentic reinforcement learning on this data to produce SciResearcher-8B, which achieves 19.46% on HLE-Bio/Chem-Gold (new SOTA at 8B scale, surpassing some larger proprietary agents) along with 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature.

Significance. If the synthetic tasks are shown to impose reasoning loads comparable to real frontier scientific problems, the framework would provide a scalable, automated alternative to knowledge-graph or web-browsing curation methods, potentially enabling more capable agents for automated discovery in domains with sparse, heterogeneous sources.

major comments (3)

[§3] §3 (Framework Description): The task-synthesis pipeline is described at a high level but supplies no pseudocode, concrete examples of generated computational tasks, or quantitative metrics (e.g., number of tool calls or reasoning steps required). Without these, it is impossible to verify that the data elicits the claimed long-horizon and tool-integrated reasoning rather than simpler pattern-matching proxies.
[§5] §5 (Experiments): No ablation studies isolate the contribution of the agentic RL stage versus SFT alone, and no error bars or run-to-run variance are reported for the headline scores (19.46% on HLE-Bio/Chem-Gold, 13-15% gains elsewhere). This leaves open whether the reported improvements are robust or attributable to the framework.
[§4] §4 (Data Construction): The manuscript provides no human-expert validation, difficulty calibration against real frontier problems, or comparison of source heterogeneity between synthetic tasks and actual academic literature. This directly bears on the central claim that performance gains reflect genuine scaling of scientific reasoning rather than distribution matching on easier proxies.

minor comments (2)

[Abstract] The abstract states '13-15% absolute gains' without naming the exact baselines or providing the raw scores for each benchmark.
[§2] Notation for the HLE-Bio/Chem-Gold benchmark is introduced without a reference or brief definition of its construction.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity and rigor, and we will revise the manuscript accordingly to address them. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§3] §3 (Framework Description): The task-synthesis pipeline is described at a high level but supplies no pseudocode, concrete examples of generated computational tasks, or quantitative metrics (e.g., number of tool calls or reasoning steps required). Without these, it is impossible to verify that the data elicits the claimed long-horizon and tool-integrated reasoning rather than simpler pattern-matching proxies.

Authors: We agree that additional concrete details are needed to substantiate the long-horizon and tool-use claims. In the revised manuscript we will include full pseudocode for the task-synthesis pipeline as an appendix. We will also provide multiple concrete examples of generated computational tasks (including both conceptual and multi-step computational instances) together with quantitative metrics such as the distribution of tool calls per task and average reasoning steps. These additions will enable readers to directly assess the complexity and structure of the synthesized data. revision: yes
Referee: [§5] §5 (Experiments): No ablation studies isolate the contribution of the agentic RL stage versus SFT alone, and no error bars or run-to-run variance are reported for the headline scores (19.46% on HLE-Bio/Chem-Gold, 13-15% gains elsewhere). This leaves open whether the reported improvements are robust or attributable to the framework.

Authors: We acknowledge the value of isolating the RL contribution and reporting statistical robustness. We will add ablation experiments comparing SFT-only training against the full SFT + agentic RL pipeline on the same data. We will also rerun the primary evaluation runs with multiple random seeds and report error bars (standard deviation) for the key metrics on HLE-Bio/Chem-Gold, SuperGPQA-Hard-Biology, and TRQA-Literature. These changes will clarify the source of the observed gains. revision: yes
Referee: [§4] §4 (Data Construction): The manuscript provides no human-expert validation, difficulty calibration against real frontier problems, or comparison of source heterogeneity between synthetic tasks and actual academic literature. This directly bears on the central claim that performance gains reflect genuine scaling of scientific reasoning rather than distribution matching on easier proxies.

Authors: We will expand the data construction section to include quantitative comparisons of source heterogeneity (e.g., topic diversity, citation graph statistics, and domain coverage) between the synthetic tasks and samples drawn from the original academic literature. We will also add indirect difficulty calibration by reporting task-complexity statistics (tool-call depth, reasoning-chain length) and relating them to the hardness of the evaluation benchmarks. However, human-expert validation was not performed because the framework is intentionally fully automated; we will explicitly discuss this design choice and its implications as a limitation. revision: partial

standing simulated objections not resolved

Direct human-expert validation of the synthetic tasks and explicit difficulty calibration against real frontier problems, as these steps would require substantial human annotation effort and contradict the core goal of a fully automated, scalable data-construction pipeline.

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline on external benchmarks

full rationale

The paper presents an agentic framework that synthesizes tasks from academic sources, applies standard supervised fine-tuning plus agentic RL, and reports accuracy on independent external benchmarks (HLE-Bio/Chem-Gold, SuperGPQA-Hard-Biology, TRQA-Literature). No equations, fitted parameters, or internal derivations are described. Performance claims are grounded in post-training evaluation rather than any self-referential reduction, self-citation chain, or renaming of known results. The central assumption (synthetic tasks match frontier demands) is an empirical validity question, not a circularity in the reported derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work relies on standard supervised fine-tuning and reinforcement learning applied to newly synthesized tasks.

pith-pipeline@v0.9.0 · 5546 in / 1167 out tokens · 34324 ms · 2026-05-09T14:14:38.432631+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 37 canonical work pages · 10 internal anchors

[1]

Introducing claude sonnet 4.5, September 2025

Anthropic. Introducing claude sonnet 4.5, September 2025. URL https://www.anthropic.com/news /claude-sonnet-4-5

2025
[2]

The discovery engine: A framework for ai-driven synthesis and navigation of scientific knowledge landscapes, 2025

Vladimir Baulin, Austin Cook, Daniel Friedman, Janna Lumiruusu, Andrew Pashea, Shagor Rahman, and Benedikt Waldeck. The discovery engine: A framework for ai-driven synthesis and navigation of scientific knowledge landscapes, 2025. URLhttps://arxiv.org/abs/2505.17500

work page arXiv 2025
[3]

Scimaster: Towards general-purpose scientific ai agents, part i

Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. Scimaster: Towards general-purpose scientific ai agents, part i. x-master as foundation: Can we lead on humanity’s last exam?, 2025. URL https: //arxiv.org/abs/2507.05241

work page arXiv 2025
[4]

Try deep research and our new experimental model in gemini, your ai assistant, December

Dave Citron. Try deep research and our new experimental model in gemini, your ai assistant, December
[5]

URLhttps://blog.google/products/gemini/google-gemini-deep-research/
[6]

Evolving scientific discovery by unifying data and background knowledge with ai hilbert.Nature Communications, 15(1):5922, July 2024

Ryan Cory-Wright, Cristina Cornelio, Sanjeeb Dash, Bachir El Khadir, and Lior Horesh. Evolving scientific discovery by unifying data and background knowledge with ai hilbert.Nature Communications, 15(1):5922, July 2024. doi: 10.1038/s41467-024-50074-w

work page doi:10.1038/s41467-024-50074-w 2024
[7]

Deepresearch bench: A comprehensive benchmark for deep research agents, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URL https://arxiv.org/abs/2506.117 63

2025
[8]

Ontologies for increasing the fairness of plant research data.Frontiers in Plant Science, V olume 14 - 2023, 2023

Kathryn Dumschott, Hannah Dörpholz, Marie-Angélique Laporte, Dominik Brilhaus, Andrea Schrader, Björn Usadel, Steffen Neumann, Elizabeth Arnaud, and Angela Kranz. Ontologies for increasing the fairness of plant research data.Frontiers in Plant Science, V olume 14 - 2023, 2023. ISSN 1664-462X. doi: 10.3389/fpls.2023.1279694. URL https://www.frontiersin.org...

work page doi:10.3389/fpls.2023.1279694 2023
[9]

arXiv preprint arXiv:2504.21024 , year=

Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025. URL https: //arxiv.org/abs/2504.21024

work page arXiv 2025
[10]

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Yonglin Wang, Jingchen Ni, Tianshi Zheng, Chun Chen, Wenhao Yu, Zhenwen Liang, Hongming Zhang, Haitao Mi, and Dong Yu. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training, 2026. URLhttps://arxiv...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Deep research bench: Evaluating ai web research agents, arXiv preprint arXiv:2506.06287, 2025

FutureSearch, :, Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, and Jack Wildman. Deep research bench: Evaluating ai web research agents, 2025. URLhttps://arxiv.org/abs/2506.06287

work page arXiv 2025
[12]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page internal anchor Pith review arXiv 2025
[13]

Webcot: Enhancing web agent reasoning by reconstructing chain-of- thought in reflection, branching, and rollback, 2025

Minda Hu, Tianqing Fang, Jianshu Zhang, Junyu Ma, Zhisong Zhang, Jingyan Zhou, Hongming Zhang, Haitao Mi, Dong Yu, and Irwin King. Webcot: Enhancing web agent reasoning by reconstructing chain-of- thought in reflection, branching, and rollback, 2025. URLhttps://arxiv.org/abs/2505.20013

work page arXiv 2025
[14]

Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

2025
[15]

Aide: Ai-driven exploration in the space of code, 2025

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URL https://arxiv.org/abs/2502.131 38

2025
[16]

autoresearch: Ai agents running research on single-gpu nanochat training automatically

Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically. https://github.com/karpathy/autoresearch, 2026

2026
[17]

Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning, 2025

Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning, 2025. URLhttps://arxiv.org/abs/2509.13305. 10

work page arXiv 2025
[18]

Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent, 2025. URLhttps://arxiv.org/abs/2507.02592

work page arXiv 2025
[19]

Verified Critical Step Optimization for LLM Agents

Mukai Li, Qingcheng Zeng, Tianqing Fang, Zhenwen Liang, Linfeng Song, Qi Liu, Haitao Mi, and Dong Yu. Verified critical step optimization for llm agents, 2026. URLhttps://arxiv.org/abs/2602.03412

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Webexplorer: Explore and evolve for training long-horizon web agents, 2025

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, and Junxian He. Webexplorer: Explore and evolve for training long-horizon web agents, 2025. URLhttps://arxiv.org/abs/2509.06501

work page arXiv 2025
[21]

Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025b.https://arxiv.org/abs/2506.16499

Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Weinan E, and Siheng Chen. Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025. URL https://arxiv.org/abs/2506.16499

work page arXiv 2025
[22]

The ai scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/2408 .06292

2024
[23]

Llm4sr: A survey on large language models for scientific research, 2025

Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research, 2025. URLhttps://arxiv.org/abs/2501.04306

work page arXiv 2025
[24]

Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

M-A-P, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Zh...

work page arXiv 2025
[25]

GAIA: a benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net /forum?id=fibxvahvs3

2024
[26]

Mitchener, A

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagora...

work page arXiv 2025
[27]

FEABench: 22 Evaluating language models on multi- physics reasoning ability.arXiv preprint arXiv:2504.06260, 2025

Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P. Brenner, and Peter Norgaard. Feabench: Evaluating language models on multiphysics reasoning ability, 2025. URL https: //arxiv.org/abs/2504.06260

work page arXiv 2025
[28]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/introducing-deep-research/, February 2025

2025
[29]

Introducing perplexity deep research

Perplexity AI. Introducing perplexity deep research. https://www.perplexity.ai/hub/blog/intro ducing-perplexity-deep-research, 2025

2025
[30]

Humanity's Last Exam

Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, and Jason Hausenloy, et al. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649(8099): 1139–1146, January 2026. ISSN 1476-4687. doi: 10.1038/s41586-025-09962-4. URL http: //dx.doi.org/10.1038/s41586-025-09962-4. 11

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
[31]

Schmidgall, Y

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants, 2025. URLhttps://arxiv.org/abs/2501.04227

work page arXiv 2025
[32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

A web-scale system for scientific knowledge exploration

Zhihong Shen, Hao Ma, and Kuansan Wang. A web-scale system for scientific knowledge exploration. In Fei Liu and Thamar Solorio, editors,Proceedings of ACL 2018, System Demonstrations, pages 87–92, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-4015. URLhttps://aclanthology.org/P18-4015/

work page doi:10.18653/v1/p18-4015 2018
[34]

Si, C., Hashimoto, T., and Yang, D

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, 2024. URLhttps://arxiv.org/abs/2409.04109

work page arXiv 2024
[35]

Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning,

Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, Wenlong Zhang, Lei Bai, Zhenfei Yin, Philip Torr, Hanrui Wang, and Di Jin. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning,
[36]

URLhttps://arxiv.org/abs/2509.21193

work page arXiv
[37]

Webshaper: Agentically data synthesizing via information-seeking formalization, 2025

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webshaper: Agentically data synthesizing via information-seeking formalization, 2025. URL https://arxiv.org/abs/2507.1 5061

2025
[38]

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, P...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, and Michael R. Lyu. Inference-time scaling of verification: Self-evolving deep research agents via test-time rubric-guided verification, 2026. URLhttps://arxiv.org/abs/2601.15808

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

arXiv:2601.21165 , institution =

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks, 2026. URL https: //arxiv.org/abs/2601.21165

work page arXiv 2026
[41]

WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models

Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, and Kam-Fai Wong. Webaggregator: Enhancing compositional reasoning capabilities of deep research agent foundation models, 2026. URL https://arxiv.org/abs/2510.14438

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

arXiv preprint arXiv:2307.10635

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models, 2024. URLhttps://arxiv.org/abs/2307.10635

work page arXiv 2024
[43]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URLhttps://arxiv.org/abs/2504.12516

work page internal anchor Pith review arXiv 2025
[44]

Deepscientist: Advancing frontier-pushing scientific findings progressively, 2025

Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively, 2025. URL https://arxiv.org/abs/25 09.26603

2025
[45]

About 30% of humanity’s last exam chemistry/biology answers are likely wrong, July 2025

Andrew White, Michael Skarlinski, Jon Laurent, and Albert Bou. About 30% of humanity’s last exam chemistry/biology answers are likely wrong, July 2025. URL https://www.futurehouse.org/rese arch-announcements/hle-exam

2025
[46]

Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency, 2025. URLhttps://arxiv.org/abs/2505.22648. 12

work page arXiv 2025
[47]

arXiv preprint arXiv:2501.07572 , year=

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web traversal, 2025. URL https://arxiv.org/abs/2501.07572

work page arXiv 2025
[48]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024
[49]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Origene: A self-evolving virtual disease biologist automating therapeutic target discovery.bioRxiv, 2025

Zhongyue Zhang, Zijie Qiu, Yingcheng Wu, Shuya Li, Dingyan Wang, Zhuomin Zhou, Duo An, Yuhan Chen, Yu Li, Yongbo Wang, Chubin Ou, Zichen Wang, Jack Xiaoyu Chen, Bo Zhang, Yusong Hu, Wenxin Zhang, Zhijian Wei, Runze Ma, Qingwu Liu, Bo Dong, Yuexi He, Qiantai Feng, Lei Bai, Qiang Gao, Siqi Sun, and Shuangjia Zheng. Origene: A self-evolving virtual disease b...

work page doi:10.1101/2025.06.03.657658 2025
[51]

Fine- grained information extraction from biomedical literature based on knowledge-enriched Abstract Meaning Representation

Zixuan Zhang, Nikolaus Parulian, Heng Ji, Ahmed Elsayed, Skatje Myers, and Martha Palmer. Fine- grained information extraction from biomedical literature based on knowledge-enriched Abstract Meaning Representation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computationa...

work page doi:10.18653/v1/2021.acl-long.489 2021
[52]

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2025.emnlp-main.895 2025
[53]

Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y . Wong, and Simon See. Newtonbench: Benchmarking generalizable scientific law discovery in llm agents, 2026. URL https://arxiv.org/abs/2510.07172. 13 A Technical Details Agent Act...

work page arXiv 2026
[54]

Include the seed entity or be directly grounded in it
[55]

Be concise but scientifically meaningful
[56]

Be answerable from a single authoritative academic source at this stage
[57]

Prefer multiple-choice format with plausible confounders, while allowing short-answer format when more appropriate
[58]

Avoid shortcuts that can be solved by trivia, superficial keyword matching, or generic web search without reading the academic evidence
[59]

Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientific venues

Be suitable as the semantic backbone for later anchor-based augmentation.> ## Pre-Action Protocol: Plan Before Searching <Metric Definition> <Before browsing, understand the seed entity and its scientific context. Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientif...
[60]

Meticulousness and persistence in finding high-quality academic evidence
[61]

Task decomposition: search -> evidence extraction -> question generation -> verification
[62]

Adaptive error handling and reuse of progress state when searches fail or evidence is insufficient
[63]

Multi-query scout search and URL selection based on relevance, venue quality, source diversity, and scientific specificity
[64]

Use of the url2evidence sub-agent to access selected academic sources, extract key supporting evidence, and distinguish stand-alone scientific facts from study-specific artifacts
[65]

Evidence quality checks, including source authority, evidence-answer entailment, and avoidance of unsupported assumptions
[66]

Question formulation with plausible, unbiased, and challenging confounders for MCQs; clear expected answer for short-answer questions; and final quality checks
[67]

question

Multi-tool coordination following the typical workflow: scout search -> source selection -> url2evidence -> question generation -> verification. ## Output Format The final output MUST be a JSON object with the following structure: ’’’json { "question": "The question text containing or directly grounded in the seed entity", 15 "answer": "The correct answer...
[68]

**Domain-specific**: It is a concrete scientific entity, such as a gene, protein, pathway, compound, species, technique, disease, mutation, phenotype, material, model, or other scientific concept
[69]

**Question-body only**: It appears in the question stem but does NOT appear in the correct answer or any confounder
[70]

**Decisive**: The question becomes substantially harder or unanswerable if this entity is masked or removed
[71]

## Your Task Given the question, correct answer(s), and confounders below, you must:

**Specific and concrete**: It is sufficiently specific to support further evidence-grounded browsing and question generation. ## Your Task Given the question, correct answer(s), and confounders below, you must:
[72]

Identify candidate anchor entities in the question body
[73]

Verify that each candidate does NOT appear in the correct answer or any confounder
[74]

Evaluate whether each candidate is decisive for deriving the final answer
[75]

Select the most decisive, specific, and concrete entity
[76]

## Selection Criteria (in priority order)

If no valid anchor exists, return an empty string. ## Selection Criteria (in priority order)
[77]

AXL" over

Prefer the MOST SPECIFIC entity, e.g., "AXL" over "receptor tyrosine kinase"
[78]

Prefer entities that constrain the answer, such that removing them makes multiple answers plausible
[79]

Prefer named entities, such as gene, protein, compound, disease, pathway, or model names, over generic scientific terms
[80]

Prefer entities that are decoupled from the surface form of the answer options

Showing first 80 references.