Recognition: unknown
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
Pith reviewed 2026-05-09 14:14 UTC · model grok-4.3
The pith
Automated synthesis of conceptual and computational tasks trains an 8B model to set new records on frontier biology and chemistry reasoning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SciResearcher is a fully automated agentic framework for frontier-science data construction that synthesizes diverse conceptual and computational tasks grounded in academic evidence. The framework elicits information acquisition, tool-integrated reasoning, and long-horizon capabilities. Training on the resulting data via supervised fine-tuning and agentic reinforcement learning yields SciResearcher-8B, which scores 19.46 percent on HLE-Bio/Chem-Gold and delivers 13-15 percent absolute improvements on SuperGPQA-Hard-Biology and TRQA-Literature, establishing a new state of the art at the 8B scale and surpassing several larger proprietary agents.
What carries the argument
The SciResearcher agentic framework, which automatically synthesizes grounded conceptual and computational tasks from academic sources to drive supervised fine-tuning and reinforcement learning for scientific agents.
If this is right
- Smaller open models can match or exceed larger closed agents on hard scientific benchmarks when trained on suitably synthesized data.
- The same synthesis loop can be iterated to generate larger and more diverse datasets for continued scaling.
- Agentic reinforcement learning on these tasks strengthens long-horizon planning beyond what factual pre-training alone provides.
- The framework reduces dependence on manual knowledge-graph or browsing pipelines that miss computational depth in sparse academic sources.
Where Pith is reading between the lines
- The method could transfer to other data-scarce fields such as materials discovery or theoretical physics by swapping the source academic corpora.
- Pairing the synthesized tasks with real experimental logs or simulation outputs might close remaining gaps between benchmark performance and practical discovery.
- The performance edge at 8B scale suggests that future gains may come more from data quality than from raw parameter count in scientific agent work.
Load-bearing premise
Tasks synthesized by the agentic framework accurately reflect the computational and reasoning demands of actual frontier scientific problems rather than simplified or proxy versions.
What would settle it
If a model trained on human-curated real frontier problems performs no better than SciResearcher-8B on the same held-out benchmarks, or if SciResearcher-8B fails on a fresh set of unsolved domain problems never seen during data synthesis, the value of the automated construction process would be in question.
Figures
read the original abstract
Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SciResearcher, a fully automated agentic framework for synthesizing diverse conceptual and computational tasks grounded in academic evidence to support information acquisition, tool use, and long-horizon reasoning. The authors apply supervised fine-tuning followed by agentic reinforcement learning on this data to produce SciResearcher-8B, which achieves 19.46% on HLE-Bio/Chem-Gold (new SOTA at 8B scale, surpassing some larger proprietary agents) along with 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature.
Significance. If the synthetic tasks are shown to impose reasoning loads comparable to real frontier scientific problems, the framework would provide a scalable, automated alternative to knowledge-graph or web-browsing curation methods, potentially enabling more capable agents for automated discovery in domains with sparse, heterogeneous sources.
major comments (3)
- [§3] §3 (Framework Description): The task-synthesis pipeline is described at a high level but supplies no pseudocode, concrete examples of generated computational tasks, or quantitative metrics (e.g., number of tool calls or reasoning steps required). Without these, it is impossible to verify that the data elicits the claimed long-horizon and tool-integrated reasoning rather than simpler pattern-matching proxies.
- [§5] §5 (Experiments): No ablation studies isolate the contribution of the agentic RL stage versus SFT alone, and no error bars or run-to-run variance are reported for the headline scores (19.46% on HLE-Bio/Chem-Gold, 13-15% gains elsewhere). This leaves open whether the reported improvements are robust or attributable to the framework.
- [§4] §4 (Data Construction): The manuscript provides no human-expert validation, difficulty calibration against real frontier problems, or comparison of source heterogeneity between synthetic tasks and actual academic literature. This directly bears on the central claim that performance gains reflect genuine scaling of scientific reasoning rather than distribution matching on easier proxies.
minor comments (2)
- [Abstract] The abstract states '13-15% absolute gains' without naming the exact baselines or providing the raw scores for each benchmark.
- [§2] Notation for the HLE-Bio/Chem-Gold benchmark is introduced without a reference or brief definition of its construction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity and rigor, and we will revise the manuscript accordingly to address them. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [§3] §3 (Framework Description): The task-synthesis pipeline is described at a high level but supplies no pseudocode, concrete examples of generated computational tasks, or quantitative metrics (e.g., number of tool calls or reasoning steps required). Without these, it is impossible to verify that the data elicits the claimed long-horizon and tool-integrated reasoning rather than simpler pattern-matching proxies.
Authors: We agree that additional concrete details are needed to substantiate the long-horizon and tool-use claims. In the revised manuscript we will include full pseudocode for the task-synthesis pipeline as an appendix. We will also provide multiple concrete examples of generated computational tasks (including both conceptual and multi-step computational instances) together with quantitative metrics such as the distribution of tool calls per task and average reasoning steps. These additions will enable readers to directly assess the complexity and structure of the synthesized data. revision: yes
-
Referee: [§5] §5 (Experiments): No ablation studies isolate the contribution of the agentic RL stage versus SFT alone, and no error bars or run-to-run variance are reported for the headline scores (19.46% on HLE-Bio/Chem-Gold, 13-15% gains elsewhere). This leaves open whether the reported improvements are robust or attributable to the framework.
Authors: We acknowledge the value of isolating the RL contribution and reporting statistical robustness. We will add ablation experiments comparing SFT-only training against the full SFT + agentic RL pipeline on the same data. We will also rerun the primary evaluation runs with multiple random seeds and report error bars (standard deviation) for the key metrics on HLE-Bio/Chem-Gold, SuperGPQA-Hard-Biology, and TRQA-Literature. These changes will clarify the source of the observed gains. revision: yes
-
Referee: [§4] §4 (Data Construction): The manuscript provides no human-expert validation, difficulty calibration against real frontier problems, or comparison of source heterogeneity between synthetic tasks and actual academic literature. This directly bears on the central claim that performance gains reflect genuine scaling of scientific reasoning rather than distribution matching on easier proxies.
Authors: We will expand the data construction section to include quantitative comparisons of source heterogeneity (e.g., topic diversity, citation graph statistics, and domain coverage) between the synthetic tasks and samples drawn from the original academic literature. We will also add indirect difficulty calibration by reporting task-complexity statistics (tool-call depth, reasoning-chain length) and relating them to the hardness of the evaluation benchmarks. However, human-expert validation was not performed because the framework is intentionally fully automated; we will explicitly discuss this design choice and its implications as a limitation. revision: partial
- Direct human-expert validation of the synthetic tasks and explicit difficulty calibration against real frontier problems, as these steps would require substantial human annotation effort and contradict the core goal of a fully automated, scalable data-construction pipeline.
Circularity Check
No significant circularity; empirical pipeline on external benchmarks
full rationale
The paper presents an agentic framework that synthesizes tasks from academic sources, applies standard supervised fine-tuning plus agentic RL, and reports accuracy on independent external benchmarks (HLE-Bio/Chem-Gold, SuperGPQA-Hard-Biology, TRQA-Literature). No equations, fitted parameters, or internal derivations are described. Performance claims are grounded in post-training evaluation rather than any self-referential reduction, self-citation chain, or renaming of known results. The central assumption (synthetic tasks match frontier demands) is an empirical validity question, not a circularity in the reported derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introducing claude sonnet 4.5, September 2025
Anthropic. Introducing claude sonnet 4.5, September 2025. URL https://www.anthropic.com/news /claude-sonnet-4-5
2025
-
[2]
Vladimir Baulin, Austin Cook, Daniel Friedman, Janna Lumiruusu, Andrew Pashea, Shagor Rahman, and Benedikt Waldeck. The discovery engine: A framework for ai-driven synthesis and navigation of scientific knowledge landscapes, 2025. URLhttps://arxiv.org/abs/2505.17500
-
[3]
Scimaster: Towards general-purpose scientific ai agents, part i
Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. Scimaster: Towards general-purpose scientific ai agents, part i. x-master as foundation: Can we lead on humanity’s last exam?, 2025. URL https: //arxiv.org/abs/2507.05241
-
[4]
Try deep research and our new experimental model in gemini, your ai assistant, December
Dave Citron. Try deep research and our new experimental model in gemini, your ai assistant, December
-
[5]
URLhttps://blog.google/products/gemini/google-gemini-deep-research/
-
[6]
Ryan Cory-Wright, Cristina Cornelio, Sanjeeb Dash, Bachir El Khadir, and Lior Horesh. Evolving scientific discovery by unifying data and background knowledge with ai hilbert.Nature Communications, 15(1):5922, July 2024. doi: 10.1038/s41467-024-50074-w
-
[7]
Deepresearch bench: A comprehensive benchmark for deep research agents, 2025
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URL https://arxiv.org/abs/2506.117 63
2025
-
[8]
Kathryn Dumschott, Hannah Dörpholz, Marie-Angélique Laporte, Dominik Brilhaus, Andrea Schrader, Björn Usadel, Steffen Neumann, Elizabeth Arnaud, and Angela Kranz. Ontologies for increasing the fairness of plant research data.Frontiers in Plant Science, V olume 14 - 2023, 2023. ISSN 1664-462X. doi: 10.3389/fpls.2023.1279694. URL https://www.frontiersin.org...
-
[9]
arXiv preprint arXiv:2504.21024 , year=
Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025. URL https: //arxiv.org/abs/2504.21024
-
[10]
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Yonglin Wang, Jingchen Ni, Tianshi Zheng, Chun Chen, Wenhao Yu, Zhenwen Liang, Hongming Zhang, Haitao Mi, and Dong Yu. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training, 2026. URLhttps://arxiv...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Deep research bench: Evaluating ai web research agents, arXiv preprint arXiv:2506.06287, 2025
FutureSearch, :, Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, and Jack Wildman. Deep research bench: Evaluating ai web research agents, 2025. URLhttps://arxiv.org/abs/2506.06287
-
[12]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...
work page internal anchor Pith review arXiv 2025
-
[13]
Minda Hu, Tianqing Fang, Jianshu Zhang, Junyu Ma, Zhisong Zhang, Jingyan Zhou, Hongming Zhang, Haitao Mi, Dong Yu, and Irwin King. Webcot: Enhancing web agent reasoning by reconstructing chain-of- thought in reflection, branching, and rollback, 2025. URLhttps://arxiv.org/abs/2505.20013
-
[14]
Biomni: A general-purpose biomedical ai agent.biorxiv, 2025
Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025
2025
-
[15]
Aide: Ai-driven exploration in the space of code, 2025
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URL https://arxiv.org/abs/2502.131 38
2025
-
[16]
autoresearch: Ai agents running research on single-gpu nanochat training automatically
Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically. https://github.com/karpathy/autoresearch, 2026
2026
-
[17]
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning, 2025. URLhttps://arxiv.org/abs/2509.13305. 10
-
[18]
Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent, 2025. URLhttps://arxiv.org/abs/2507.02592
-
[19]
Verified Critical Step Optimization for LLM Agents
Mukai Li, Qingcheng Zeng, Tianqing Fang, Zhenwen Liang, Linfeng Song, Qi Liu, Haitao Mi, and Dong Yu. Verified critical step optimization for llm agents, 2026. URLhttps://arxiv.org/abs/2602.03412
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Webexplorer: Explore and evolve for training long-horizon web agents, 2025
Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, and Junxian He. Webexplorer: Explore and evolve for training long-horizon web agents, 2025. URLhttps://arxiv.org/abs/2509.06501
-
[21]
Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Weinan E, and Siheng Chen. Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025. URL https://arxiv.org/abs/2506.16499
-
[22]
The ai scientist: Towards fully automated open-ended scientific discovery, 2024
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/2408 .06292
2024
-
[23]
Llm4sr: A survey on large language models for scientific research, 2025
Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research, 2025. URLhttps://arxiv.org/abs/2501.04306
-
[24]
M-A-P, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Zh...
-
[25]
GAIA: a benchmark for general AI assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net /forum?id=fibxvahvs3
2024
-
[26]
Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagora...
-
[27]
Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P. Brenner, and Peter Norgaard. Feabench: Evaluating language models on multiphysics reasoning ability, 2025. URL https: //arxiv.org/abs/2504.06260
-
[28]
Introducing deep research
OpenAI. Introducing deep research. https://openai.com/index/introducing-deep-research/, February 2025
2025
-
[29]
Introducing perplexity deep research
Perplexity AI. Introducing perplexity deep research. https://www.perplexity.ai/hub/blog/intro ducing-perplexity-deep-research, 2025
2025
-
[30]
Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, and Jason Hausenloy, et al. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649(8099): 1139–1146, January 2026. ISSN 1476-4687. doi: 10.1038/s41586-025-09962-4. URL http: //dx.doi.org/10.1038/s41586-025-09962-4. 11
work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
-
[31]
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants, 2025. URLhttps://arxiv.org/abs/2501.04227
-
[32]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
A web-scale system for scientific knowledge exploration
Zhihong Shen, Hao Ma, and Kuansan Wang. A web-scale system for scientific knowledge exploration. In Fei Liu and Thamar Solorio, editors,Proceedings of ACL 2018, System Demonstrations, pages 87–92, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-4015. URLhttps://aclanthology.org/P18-4015/
-
[34]
Si, C., Hashimoto, T., and Yang, D
Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, 2024. URLhttps://arxiv.org/abs/2409.04109
-
[35]
Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning,
Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, Wenlong Zhang, Lei Bai, Zhenfei Yin, Philip Torr, Hanrui Wang, and Di Jin. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning,
- [36]
-
[37]
Webshaper: Agentically data synthesizing via information-seeking formalization, 2025
Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webshaper: Agentically data synthesizing via information-seeking formalization, 2025. URL https://arxiv.org/abs/2507.1 5061
2025
-
[38]
MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, P...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, and Michael R. Lyu. Inference-time scaling of verification: Self-evolving deep research agents via test-time rubric-guided verification, 2026. URLhttps://arxiv.org/abs/2601.15808
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
arXiv:2601.21165 , institution =
Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks, 2026. URL https: //arxiv.org/abs/2601.21165
-
[41]
Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, and Kam-Fai Wong. Webaggregator: Enhancing compositional reasoning capabilities of deep research agent foundation models, 2026. URL https://arxiv.org/abs/2510.14438
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
arXiv preprint arXiv:2307.10635
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models, 2024. URLhttps://arxiv.org/abs/2307.10635
-
[43]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URLhttps://arxiv.org/abs/2504.12516
work page internal anchor Pith review arXiv 2025
-
[44]
Deepscientist: Advancing frontier-pushing scientific findings progressively, 2025
Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively, 2025. URL https://arxiv.org/abs/25 09.26603
2025
-
[45]
About 30% of humanity’s last exam chemistry/biology answers are likely wrong, July 2025
Andrew White, Michael Skarlinski, Jon Laurent, and Albert Bou. About 30% of humanity’s last exam chemistry/biology answers are likely wrong, July 2025. URL https://www.futurehouse.org/rese arch-announcements/hle-exam
2025
-
[46]
Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025
Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency, 2025. URLhttps://arxiv.org/abs/2505.22648. 12
-
[47]
arXiv preprint arXiv:2501.07572 , year=
Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web traversal, 2025. URL https://arxiv.org/abs/2501.07572
-
[48]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024
2024
-
[49]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Zhongyue Zhang, Zijie Qiu, Yingcheng Wu, Shuya Li, Dingyan Wang, Zhuomin Zhou, Duo An, Yuhan Chen, Yu Li, Yongbo Wang, Chubin Ou, Zichen Wang, Jack Xiaoyu Chen, Bo Zhang, Yusong Hu, Wenxin Zhang, Zhijian Wei, Runze Ma, Qingwu Liu, Bo Dong, Yuexi He, Qiantai Feng, Lei Bai, Qiang Gao, Siqi Sun, and Shuangjia Zheng. Origene: A self-evolving virtual disease b...
-
[51]
Zixuan Zhang, Nikolaus Parulian, Heng Ji, Ahmed Elsayed, Skatje Myers, and Martha Palmer. Fine- grained information extraction from biomedical literature based on knowledge-enriched Abstract Meaning Representation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computationa...
-
[52]
From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery
Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...
-
[53]
Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y . Wong, and Simon See. Newtonbench: Benchmarking generalizable scientific law discovery in llm agents, 2026. URL https://arxiv.org/abs/2510.07172. 13 A Technical Details Agent Act...
-
[54]
Include the seed entity or be directly grounded in it
-
[55]
Be concise but scientifically meaningful
-
[56]
Be answerable from a single authoritative academic source at this stage
-
[57]
Prefer multiple-choice format with plausible confounders, while allowing short-answer format when more appropriate
-
[58]
Avoid shortcuts that can be solved by trivia, superficial keyword matching, or generic web search without reading the academic evidence
-
[59]
Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientific venues
Be suitable as the semantic backbone for later anchor-based augmentation.> ## Pre-Action Protocol: Plan Before Searching <Metric Definition> <Before browsing, understand the seed entity and its scientific context. Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientif...
-
[60]
Meticulousness and persistence in finding high-quality academic evidence
-
[61]
Task decomposition: search -> evidence extraction -> question generation -> verification
-
[62]
Adaptive error handling and reuse of progress state when searches fail or evidence is insufficient
-
[63]
Multi-query scout search and URL selection based on relevance, venue quality, source diversity, and scientific specificity
-
[64]
Use of the url2evidence sub-agent to access selected academic sources, extract key supporting evidence, and distinguish stand-alone scientific facts from study-specific artifacts
-
[65]
Evidence quality checks, including source authority, evidence-answer entailment, and avoidance of unsupported assumptions
-
[66]
Question formulation with plausible, unbiased, and challenging confounders for MCQs; clear expected answer for short-answer questions; and final quality checks
-
[67]
question
Multi-tool coordination following the typical workflow: scout search -> source selection -> url2evidence -> question generation -> verification. ## Output Format The final output MUST be a JSON object with the following structure: ’’’json { "question": "The question text containing or directly grounded in the seed entity", 15 "answer": "The correct answer...
-
[68]
**Domain-specific**: It is a concrete scientific entity, such as a gene, protein, pathway, compound, species, technique, disease, mutation, phenotype, material, model, or other scientific concept
-
[69]
**Question-body only**: It appears in the question stem but does NOT appear in the correct answer or any confounder
-
[70]
**Decisive**: The question becomes substantially harder or unanswerable if this entity is masked or removed
-
[71]
## Your Task Given the question, correct answer(s), and confounders below, you must:
**Specific and concrete**: It is sufficiently specific to support further evidence-grounded browsing and question generation. ## Your Task Given the question, correct answer(s), and confounders below, you must:
-
[72]
Identify candidate anchor entities in the question body
-
[73]
Verify that each candidate does NOT appear in the correct answer or any confounder
-
[74]
Evaluate whether each candidate is decisive for deriving the final answer
-
[75]
Select the most decisive, specific, and concrete entity
-
[76]
## Selection Criteria (in priority order)
If no valid anchor exists, return an empty string. ## Selection Criteria (in priority order)
-
[77]
AXL" over
Prefer the MOST SPECIFIC entity, e.g., "AXL" over "receptor tyrosine kinase"
-
[78]
Prefer entities that constrain the answer, such that removing them makes multiple answers plausible
-
[79]
Prefer named entities, such as gene, protein, compound, disease, pathway, or model names, over generic scientific terms
-
[80]
Prefer entities that are decoupled from the surface form of the answer options
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.