Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Binzhao Luo; Bo Zhang; Chao Chen; Chuyi Peng; Dongchen Huang; Fei Chao; Huaihai Huang; Jiaxing Wan; Lei Bai; Maoli Gao

arxiv: 2606.18648 · v3 · pith:5GF7TP6Knew · submitted 2026-06-17 · ⚛️ physics.comp-ph

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Yigeng Jiang , Tengchao Yang , Taoyong Cui , Jiaxing Wan , Yuan Wang , Weida Wang , Zhiyu Liu , Chuyi Peng

show 20 more authors

Binzhao Luo Maoli Gao Huaihai Huang Yuqianer Zeng Ziyang Zheng Dongchen Huang Chao Chen Zichao Liu Weiping Shen Shuchen Pu Siyu Zhou Runmin Ma Yusong Hu Fei Chao Bo Zhang Xiawu Zheng Zifu Wang Lei Bai Yunqi Cai Shufei Zhang

This is my paper

Pith reviewed 2026-06-26 19:15 UTC · model grok-4.3

classification ⚛️ physics.comp-ph

keywords PhySciBenchDelveAgentmulti-agent frameworkphysical sciences benchmarkLLM agentsscientific reasoningphysicschemistry

0 comments

The pith

DelveAgent improves physical science accuracy by 7.5 points at one-third cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhySciBench, a benchmark of 200 expert-curated questions across physics and chemistry that mirror real research workflows in six task categories. Current systems including the strongest baseline reach only 33.5 percent accuracy, with failures traced to fragile long reasoning chains, weak cross-step knowledge transfer, and absent physics-based self-verification. The authors then present DelveAgent, a multi-agent system that adds an adaptive planning loop, dual-granularity memory, and hierarchical physics-grounded reflection. On four scientific benchmarks this design raises accuracy by up to 7.5 points while cutting inference cost to roughly one-third of the prior best system. The work supplies both a domain-specific testbed and evidence that targeted architecture can make autonomous scientific agents more dependable.

Core claim

The paper claims that DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism, improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline across four scientific benchmarks; PhySciBench simultaneously shows that even leading models achieve only 33.5 percent on expert-curated physical-science questions.

What carries the argument

DelveAgent, a modular multi-agent framework with adaptive planning loop, dual-granularity memory, and hierarchical physics-grounded reflection mechanism.

If this is right

PhySciBench functions as a dedicated benchmark for AI systems in physical sciences.
The three identified deficiencies (fragile reasoning chains, limited knowledge transfer, absent physics-grounded self-verification) explain why current agents underperform.
Adaptive planning, dual-granularity memory, and physics-grounded reflection together address those deficiencies.
Architectural specialization can raise both accuracy and efficiency of autonomous scientific research agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If PhySciBench questions track actual lab workflows, then DelveAgent-style agents could shorten iteration cycles in experimental physics and chemistry.
The modular structure suggests the same components could be ported to adjacent domains such as materials discovery or quantum information.
Lower inference cost makes repeated use of such agents feasible inside resource-limited research groups.

Load-bearing premise

The 200 expert-curated questions accurately represent real-world physical science research challenges and workflows, and the measured gains stem from the proposed architecture rather than prompt or implementation details.

What would settle it

Run the 200 PhySciBench questions with an ablation that removes the hierarchical physics-grounded reflection module and check whether the accuracy and cost gains disappear.

read the original abstract

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research. Our data and code are publicly available at https://github.com/yigengjiang/physci-deepresearch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhySciBench is a useful new benchmark but DelveAgent's gains lack ablations to tie them to the three proposed components.

read the letter

The paper introduces PhySciBench, 200 expert-curated questions split between physics and chemistry across six task categories meant to mirror real research steps. It also presents DelveAgent, a multi-agent system that adds an adaptive planning loop, dual-granularity memory, and hierarchical physics-grounded reflection.

They do a clear job laying out three failure modes seen in current systems—fragile long chains, weak knowledge carry-over between steps, and missing physics-based checks—and show that even Gemini Deep Research only reaches 33.5% on the new benchmark. Releasing the data and code is a practical step that lets others check the numbers.

The soft spot is the missing link between those three features and the reported 7.5-point accuracy lift plus one-third cost drop. The abstract gives no ablations that turn each component off while holding the base model, token budget, and tool access constant. Two hundred questions is a modest set; modest differences in prompting or baseline tuning could produce similar deltas without any architectural change. No error bars or significance tests are described.

The stress-test concern holds up on the information given: the causal claim that the specialized components are what drive the improvement rests on an untested assumption. The benchmark itself looks like a reasonable addition for this domain.

This work is aimed at groups building or testing LLM agents for physical science tasks. It deserves peer review because the benchmark fills a visible gap and the framework ideas are concrete, even though the evaluation will need tighter controls to be reliable.

Referee Report

2 major / 1 minor

Summary. The paper introduces PhySciBench, a benchmark of 200 expert-curated questions balanced between physics and chemistry across six task categories reflecting scientific workflows. It evaluates state-of-the-art models and agents, finding limited performance (strongest baseline Gemini Deep Research at 33.5% accuracy), identifies three failure modes (fragile extended reasoning, limited knowledge transfer, lack of physics-grounded verification), and proposes DelveAgent, a multi-agent framework with adaptive planning loop, dual-granularity memory, and hierarchical physics-grounded reflection. On four scientific benchmarks, DelveAgent achieves up to 7.5 percentage point accuracy gains and reduces inference costs to ~1/3 of the strongest baseline. Code and data are released publicly.

Significance. If the empirical results hold after addressing attribution and statistical concerns, the work supplies a new, domain-relevant benchmark for physical-science AI agents and shows that targeted architectural specialization can improve both accuracy and efficiency over general baselines. The public release of code and data is a clear strength supporting reproducibility and follow-on work.

major comments (2)

[Experimental results (across the four benchmarks)] The central performance claims (up to 7.5 pp accuracy gain and ~3× cost reduction) are presented without error bars, statistical significance tests, or explicit controls for prompt-engineering differences and baseline implementation details. This information is required to establish that the observed deltas are attributable to the three proposed components rather than implementation or prompting variations.
[DelveAgent framework description and evaluation] No ablation studies are reported that remove or disable each of the three components (adaptive planning loop, dual-granularity memory, hierarchical physics-grounded reflection) while keeping the underlying LLM, total prompt tokens, and tool access fixed. Such ablations are load-bearing for the claim that the architectural features, rather than other factors, produce the reported gains.

minor comments (1)

[PhySciBench introduction] The abstract and benchmark description would benefit from an explicit statement of how the 200 questions were selected and validated (e.g., inter-annotator agreement or coverage of typical research workflows) to strengthen the claim of representativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each major comment below and commit to revisions that strengthen the empirical validation of our claims.

read point-by-point responses

Referee: [Experimental results (across the four benchmarks)] The central performance claims (up to 7.5 pp accuracy gain and ~3× cost reduction) are presented without error bars, statistical significance tests, or explicit controls for prompt-engineering differences and baseline implementation details. This information is required to establish that the observed deltas are attributable to the three proposed components rather than implementation or prompting variations.

Authors: We agree that providing error bars, statistical tests, and clearer controls for baselines would strengthen the claims. In the revised version, we will report results with standard deviations from multiple independent runs (e.g., 5 seeds), include p-values from appropriate statistical tests, and detail the prompt templates and implementation choices for all baselines to isolate the effect of our architectural components. revision: yes
Referee: [DelveAgent framework description and evaluation] No ablation studies are reported that remove or disable each of the three components (adaptive planning loop, dual-granularity memory, hierarchical physics-grounded reflection) while keeping the underlying LLM, total prompt tokens, and tool access fixed. Such ablations are load-bearing for the claim that the architectural features, rather than other factors, produce the reported gains.

Authors: We acknowledge the importance of ablations for attributing performance gains to specific components. We will add a dedicated ablation study section in the revised manuscript, where we systematically disable each component (adaptive planning loop, dual-granularity memory, and hierarchical physics-grounded reflection) one at a time, while controlling for LLM, token budget, and tools, and report the resulting performance drops on the benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on new benchmark

full rationale

The paper introduces PhySciBench as a new 200-question benchmark and reports measured accuracy and cost improvements for DelveAgent versus baselines across four benchmarks. No equations, first-principles derivations, or predictions are claimed; the three architectural components are motivated by observed failure modes but the reported deltas are direct experimental outcomes rather than quantities forced by construction from fitted parameters or self-referential definitions. No load-bearing self-citations or uniqueness theorems appear in the text. The central claims therefore remain independent empirical results and receive the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claims rest on empirical evaluation of a new benchmark and system whose internal design choices are not detailed here.

pith-pipeline@v0.9.1-grok · 5901 in / 1204 out tokens · 20651 ms · 2026-06-26T19:15:34.779743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Khan, A., Cowen-Rivers, A

John Jumper et al. “Highly accurate protein structure prediction with AlphaFold”. In:Nature 596 (2021), pp. 583–589.doi:10.1038/s41586-021-03819-2

work page doi:10.1038/s41586-021-03819-2 2021
[2]

URL https://doi.org/10.1038/s41586-023-06735-9

Amil Merchant et al. “Scaling deep learning for materials discovery”. In:Nature624 (2023), pp. 80–85.doi:10.1038/s41586-023-06735-9

work page doi:10.1038/s41586-023-06735-9 2023
[3]

An autonomous laboratory for the accelerated synthesis of novel materials

Nathan J. Szymanski et al. “An autonomous laboratory for the accelerated synthesis of novel materials”. In:Nature624 (2023), pp. 86–91.doi:10.1038/s41586-023-06734-w

work page doi:10.1038/s41586-023-06734-w 2023
[4]

& K¨ ohn, A

Anubhav Jain et al. “Commentary: The Materials Project: A materials genome approach to accelerating materials innovation”. In:APL Materials1 (2013), p. 011002.doi:10.1063/1. 4812323

work page doi:10.1063/1 2013
[5]

Mapping cellular interactions from spatially resolved transcriptomics data

James Zhu, Yunguan Wang, et al. “Mapping cellular interactions from spatially resolved transcriptomics data”. In:Nature Methods(2024).doi:10.1038/s41592-024-02408-1

work page doi:10.1038/s41592-024-02408-1 2024
[6]

A.; MacKnight, R.; Kline, B.; Gomes, G.Nature2023,624, 570–578, DOI: 10.1038/s41586-023-06792-0

Daniil A. Boiko et al. “Autonomous chemical research with large language models”. In:Nature 624 (2023), pp. 570–578.doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023
[7]

Jacob Cohen

Andrés M. Bran et al. “Augmenting large language models with chemistry tools”. In:Nature Machine Intelligence6 (2024), pp. 525–535.doi:10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024
[8]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:Interna- tional Conference on Learning Representations (ICLR). 2023. arXiv:2210.03629

Pith/arXiv arXiv 2023
[9]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Advances in Neural Information Processing Systems (NeurIPS). Vol. 35. 2022, pp. 24824–24837. arXiv:2201.11903

Pith/arXiv arXiv 2022
[10]

https://openai.com/index/introducing-deep- research

OpenAI.Introducing Deep Research. https://openai.com/index/introducing-deep- research. 2026

2026
[11]

https://gemini.google/us/overview/deep-research

Google.GeminiDeepResearch. https://gemini.google/us/overview/deep-research. 2025

2025
[12]

Towards Autonomous Mathematics Research

Tony Feng et al. “Towards Autonomous Mathematics Research”. In:arXiv preprint arXiv:2602.10177(2026).url:https://arxiv.org/abs/2602.10177

arXiv 2026
[13]

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

David P. Woodruff et al. “Accelerating Scientific Research with Gemini: Case Studies and Common Techniques”. In:arXiv preprint arXiv:2602.03837(2026).url:https://arxiv. org/abs/2602.03837

arXiv 2026
[14]

Towards an AI co-scientist

Juraj Gottweis et al. “Towards an AI co-scientist”. In:arXiv preprint arXiv:2502.18864(2025). url:https://arxiv.org/abs/2502.18864

Pith/arXiv arXiv 2025
[15]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu et al. “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery”. In:arXiv preprint arXiv:2408.06292(2024).url: https://arxiv.org/abs/2408.06292

Pith/arXiv arXiv 2024
[16]

PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research

Tingjia Miao et al. “PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research”. In:arXiv preprint arXiv:2512.19799(2025)

arXiv 2025
[17]

Physics supernova: Ai agent matches elite gold medalists at ipho 2025

Jiahao Qiu et al. “Physics supernova: Ai agent matches elite gold medalists at ipho 2025”. In: arXiv preprint arXiv:2509.01659(2025)

arXiv 2025
[18]

Agentic AI for multi-stage physics experiments at a large-scale user facility particle accelerator

Thorsten Hellert et al. “Agentic AI for multi-stage physics experiments at a large-scale user facility particle accelerator”. In:arXiv preprint arXiv:2509.17255(2025)

Pith/arXiv arXiv 2025
[19]

From ai for science to agentic science: A survey on autonomous scientific discovery

Jiaqi Wei et al. “From ai for science to agentic science: A survey on autonomous scientific discovery”. In:arXiv preprint arXiv:2508.14111(2025). 17

arXiv 2025
[20]

Random compressed coding with neurons

Simone Blanco Malerba et al. “Random compressed coding with neurons”. In:Cell Reports (2025).doi:10.1016/j.celrep.2025.115412

work page doi:10.1016/j.celrep.2025.115412 2025
[21]

Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T

Adrian Mirza et al. “A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists”. In:Nature Chemistry(2025).doi: 10.1038/s41557-025-01815-x

work page doi:10.1038/s41557-025-01815-x 2025
[22]

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent et al. “LAB-Bench: Measuring Capabilities of Language Models for Biology Research”. In:arXiv preprint arXiv:2407.10362(2024).url: https://arxiv.org/abs/ 2407.10362

Pith/arXiv arXiv 2024
[23]

SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian et al. “SciCode: A Research Coding Benchmark Curated by Scientists”. In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track
[24]

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Ziru Chen et al. “ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery”. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.05080

arXiv 2025
[25]

PHYBench: Holistic evaluation of physical perception and reasoning in large language models

Shi Qiu et al. “PHYBench: Holistic evaluation of physical perception and reasoning in large language models”. In:NeurIPS(2025)

2025
[26]

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Weida Wang et al. “CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics”. In:ICLR(2026)

2026
[27]

CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

Haining Pan et al. “CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers”. In:ICLR(2026)

2026
[28]

Hayrapetyan et al

Long Phan et al. “Humanity’s Last Exam”. In:Nature(2025).doi:10.1038/s41586-025- 09962-4

work page doi:10.1038/s41586-025- 2025
[29]

doi:10.48550/arXiv

Shanghai Artificial Intelligence Laboratory. “Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows”. In:CoRRabs/2512.16969 (2025).doi:10.48550/ARXIV. 2512.16969. arXiv: 2512.16969.url: https://doi.org/10.48550/arXiv.2512. 16969

work page internal anchor Pith review doi:10.48550/arxiv 2025
[30]

Frontierscience: Evaluating ai's ability to perform expert-level scientific tasks

Miles Wang et al. “FrontierScience: Evaluating AI’s Ability to Perform Expert-Level Scientific Tasks”. In:CoRRabs/2601.21165 (2026).doi: 10 . 48550 / ARXIV . 2601 . 21165. arXiv: 2601.21165.url:https://doi.org/10.48550/arXiv.2601.21165

work page doi:10.48550/arxiv.2601.21165 2026
[31]

Gemini Team, Google DeepMind.Gemini 3: Frontier Intelligence Built for Speed and Scale. 2025. url:https://deepmind.google/models/gemini/flash/

2025
[32]

2026.url:https://x.ai/news/grok-4-1-fast

xAI.Grok 4.1 Fast and Agent Tools API. 2026.url:https://x.ai/news/grok-4-1-fast

2026
[33]

Integrating physical units into high-performance AI-driven scientific computing

Chaoming Wang et al. “Integrating physical units into high-performance AI-driven scientific computing”. In:Nature Communications(2025).doi:10.1038/s41467-025-58626-4

work page doi:10.1038/s41467-025-58626-4 2025
[34]

Probing the limitations of multimodal language models for chemistry and materials research

Nawaf Alampara et al. “Probing the limitations of multimodal language models for chemistry and materials research”. In:Nature computational science5.10 (2025), pp. 952–961

2025
[35]

Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents

Fan Liu, Xiaozhao Zeng, and Hao Liu. “Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents”. In:The Fourteenth International Conference on Learning Representa- tions
[36]

No free labels: Limitations of llm-as-a-judge without human ground- ing

Michael Krumdick et al. “No free labels: Limitations of llm-as-a-judge without human ground- ing”. In:arXiv preprint arXiv:2503.05061(2025)

arXiv 2025
[37]

Limitationsofthellm-as-a-judgeapproachforevaluatingllmoutputs in expert knowledge tasks

AnnalisaSzymanskietal. “Limitationsofthellm-as-a-judgeapproachforevaluatingllmoutputs in expert knowledge tasks”. In:Proceedings of the 30th international conference on intelligent user interfaces. 2025, pp. 952–966. 18

2025
[38]

Autonomous artificial intelligence, scientific research, and human values

David B Resnik, Mohammad Hosseini, and Rico Hauswald. “Autonomous artificial intelligence, scientific research, and human values”. In:AI and Ethics6.1 (2026), p. 141

2026
[39]

The ethics of using artificial intelligence in scientific research: new guidance needed for a new tool

David B Resnik and Mohammad Hosseini. “The ethics of using artificial intelligence in scientific research: new guidance needed for a new tool”. In:AI and Ethics5.2 (2025), pp. 1499–1521

2025
[40]

https://openai.com/index/introducing-gpt-5-2/

OpenAI.Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ . 2025

2025
[41]

https : / / blog

Google DeepMind.A new era of intelligence with Gemini 3. https : / / blog . google / products-and-platforms/products/gemini/gemini-3/. 2025

2025
[42]

Anthropic.Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude- opus-4-5. 2025

2025
[43]

https : / / github

Moonshot AI.Kimi K2.5: Visual Coding Meets Agent Swarm. https : / / github . com / MoonshotAI/Kimi-K2.5. 2026

2026
[44]

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Intern-S1-Pro Team. “Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale”. In:CoRRabs/2603.25040 (2026).doi: 10 . 48550 / ARXIV . 2603 . 25040. arXiv: 2603 . 25040.url:https://doi.org/10.48550/arXiv.2603.25040

work page doi:10.48550/arxiv.2603.25040 2026
[45]

DeepSeek AI.DeepSeek-V3.2 Release: DeepSeek-V3.2 & DeepSeek-V3.2-Speciale.https://api- docs.deepseek.com/news/news251201. 2025

2025
[46]

Qwen3-vl technical report

Shuai Bai et al. “Qwen3-vl technical report”. In:arXiv preprint arXiv:2511.21631(2025)

Pith/arXiv arXiv 2025
[47]

Aymeric Roucher et al.‘smolagents‘: a smol library to build great agentic systems.https : //github.com/huggingface/smolagents. 2025

2025
[48]

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?

Chai Jingyi et al. “SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?” In:arXiv preprint arXiv:2507.05241 (2025)

arXiv 2025
[49]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team et al. “Tongyi DeepResearch Technical Report”. In:arXiv preprint arXiv:2510.24701(2025)

Pith/arXiv arXiv 2025
[50]

Yuchen Shi et al.Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization. 2025. arXiv:2512.24615 [cs.AI].url:https://arxiv.org/abs/ 2512.24615

arXiv 2025
[51]

Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation

Mengkang Hu et al. “Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation”. In:NeurIPS(2025). 19

2025

[1] [1]

Khan, A., Cowen-Rivers, A

John Jumper et al. “Highly accurate protein structure prediction with AlphaFold”. In:Nature 596 (2021), pp. 583–589.doi:10.1038/s41586-021-03819-2

work page doi:10.1038/s41586-021-03819-2 2021

[2] [2]

URL https://doi.org/10.1038/s41586-023-06735-9

Amil Merchant et al. “Scaling deep learning for materials discovery”. In:Nature624 (2023), pp. 80–85.doi:10.1038/s41586-023-06735-9

work page doi:10.1038/s41586-023-06735-9 2023

[3] [3]

An autonomous laboratory for the accelerated synthesis of novel materials

Nathan J. Szymanski et al. “An autonomous laboratory for the accelerated synthesis of novel materials”. In:Nature624 (2023), pp. 86–91.doi:10.1038/s41586-023-06734-w

work page doi:10.1038/s41586-023-06734-w 2023

[4] [4]

& K¨ ohn, A

Anubhav Jain et al. “Commentary: The Materials Project: A materials genome approach to accelerating materials innovation”. In:APL Materials1 (2013), p. 011002.doi:10.1063/1. 4812323

work page doi:10.1063/1 2013

[5] [5]

Mapping cellular interactions from spatially resolved transcriptomics data

James Zhu, Yunguan Wang, et al. “Mapping cellular interactions from spatially resolved transcriptomics data”. In:Nature Methods(2024).doi:10.1038/s41592-024-02408-1

work page doi:10.1038/s41592-024-02408-1 2024

[6] [6]

A.; MacKnight, R.; Kline, B.; Gomes, G.Nature2023,624, 570–578, DOI: 10.1038/s41586-023-06792-0

Daniil A. Boiko et al. “Autonomous chemical research with large language models”. In:Nature 624 (2023), pp. 570–578.doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023

[7] [7]

Jacob Cohen

Andrés M. Bran et al. “Augmenting large language models with chemistry tools”. In:Nature Machine Intelligence6 (2024), pp. 525–535.doi:10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024

[8] [8]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:Interna- tional Conference on Learning Representations (ICLR). 2023. arXiv:2210.03629

Pith/arXiv arXiv 2023

[9] [9]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Advances in Neural Information Processing Systems (NeurIPS). Vol. 35. 2022, pp. 24824–24837. arXiv:2201.11903

Pith/arXiv arXiv 2022

[10] [10]

https://openai.com/index/introducing-deep- research

OpenAI.Introducing Deep Research. https://openai.com/index/introducing-deep- research. 2026

2026

[11] [11]

https://gemini.google/us/overview/deep-research

Google.GeminiDeepResearch. https://gemini.google/us/overview/deep-research. 2025

2025

[12] [12]

Towards Autonomous Mathematics Research

Tony Feng et al. “Towards Autonomous Mathematics Research”. In:arXiv preprint arXiv:2602.10177(2026).url:https://arxiv.org/abs/2602.10177

arXiv 2026

[13] [13]

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

David P. Woodruff et al. “Accelerating Scientific Research with Gemini: Case Studies and Common Techniques”. In:arXiv preprint arXiv:2602.03837(2026).url:https://arxiv. org/abs/2602.03837

arXiv 2026

[14] [14]

Towards an AI co-scientist

Juraj Gottweis et al. “Towards an AI co-scientist”. In:arXiv preprint arXiv:2502.18864(2025). url:https://arxiv.org/abs/2502.18864

Pith/arXiv arXiv 2025

[15] [15]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu et al. “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery”. In:arXiv preprint arXiv:2408.06292(2024).url: https://arxiv.org/abs/2408.06292

Pith/arXiv arXiv 2024

[16] [16]

PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research

Tingjia Miao et al. “PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research”. In:arXiv preprint arXiv:2512.19799(2025)

arXiv 2025

[17] [17]

Physics supernova: Ai agent matches elite gold medalists at ipho 2025

Jiahao Qiu et al. “Physics supernova: Ai agent matches elite gold medalists at ipho 2025”. In: arXiv preprint arXiv:2509.01659(2025)

arXiv 2025

[18] [18]

Agentic AI for multi-stage physics experiments at a large-scale user facility particle accelerator

Thorsten Hellert et al. “Agentic AI for multi-stage physics experiments at a large-scale user facility particle accelerator”. In:arXiv preprint arXiv:2509.17255(2025)

Pith/arXiv arXiv 2025

[19] [19]

From ai for science to agentic science: A survey on autonomous scientific discovery

Jiaqi Wei et al. “From ai for science to agentic science: A survey on autonomous scientific discovery”. In:arXiv preprint arXiv:2508.14111(2025). 17

arXiv 2025

[20] [20]

Random compressed coding with neurons

Simone Blanco Malerba et al. “Random compressed coding with neurons”. In:Cell Reports (2025).doi:10.1016/j.celrep.2025.115412

work page doi:10.1016/j.celrep.2025.115412 2025

[21] [21]

Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T

Adrian Mirza et al. “A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists”. In:Nature Chemistry(2025).doi: 10.1038/s41557-025-01815-x

work page doi:10.1038/s41557-025-01815-x 2025

[22] [22]

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent et al. “LAB-Bench: Measuring Capabilities of Language Models for Biology Research”. In:arXiv preprint arXiv:2407.10362(2024).url: https://arxiv.org/abs/ 2407.10362

Pith/arXiv arXiv 2024

[23] [23]

SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian et al. “SciCode: A Research Coding Benchmark Curated by Scientists”. In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track

[24] [24]

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Ziru Chen et al. “ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery”. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.05080

arXiv 2025

[25] [25]

PHYBench: Holistic evaluation of physical perception and reasoning in large language models

Shi Qiu et al. “PHYBench: Holistic evaluation of physical perception and reasoning in large language models”. In:NeurIPS(2025)

2025

[26] [26]

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Weida Wang et al. “CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics”. In:ICLR(2026)

2026

[27] [27]

CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

Haining Pan et al. “CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers”. In:ICLR(2026)

2026

[28] [28]

Hayrapetyan et al

Long Phan et al. “Humanity’s Last Exam”. In:Nature(2025).doi:10.1038/s41586-025- 09962-4

work page doi:10.1038/s41586-025- 2025

[29] [29]

doi:10.48550/arXiv

Shanghai Artificial Intelligence Laboratory. “Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows”. In:CoRRabs/2512.16969 (2025).doi:10.48550/ARXIV. 2512.16969. arXiv: 2512.16969.url: https://doi.org/10.48550/arXiv.2512. 16969

work page internal anchor Pith review doi:10.48550/arxiv 2025

[30] [30]

Frontierscience: Evaluating ai's ability to perform expert-level scientific tasks

Miles Wang et al. “FrontierScience: Evaluating AI’s Ability to Perform Expert-Level Scientific Tasks”. In:CoRRabs/2601.21165 (2026).doi: 10 . 48550 / ARXIV . 2601 . 21165. arXiv: 2601.21165.url:https://doi.org/10.48550/arXiv.2601.21165

work page doi:10.48550/arxiv.2601.21165 2026

[31] [31]

Gemini Team, Google DeepMind.Gemini 3: Frontier Intelligence Built for Speed and Scale. 2025. url:https://deepmind.google/models/gemini/flash/

2025

[32] [32]

2026.url:https://x.ai/news/grok-4-1-fast

xAI.Grok 4.1 Fast and Agent Tools API. 2026.url:https://x.ai/news/grok-4-1-fast

2026

[33] [33]

Integrating physical units into high-performance AI-driven scientific computing

Chaoming Wang et al. “Integrating physical units into high-performance AI-driven scientific computing”. In:Nature Communications(2025).doi:10.1038/s41467-025-58626-4

work page doi:10.1038/s41467-025-58626-4 2025

[34] [34]

Probing the limitations of multimodal language models for chemistry and materials research

Nawaf Alampara et al. “Probing the limitations of multimodal language models for chemistry and materials research”. In:Nature computational science5.10 (2025), pp. 952–961

2025

[35] [35]

Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents

Fan Liu, Xiaozhao Zeng, and Hao Liu. “Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents”. In:The Fourteenth International Conference on Learning Representa- tions

[36] [36]

No free labels: Limitations of llm-as-a-judge without human ground- ing

Michael Krumdick et al. “No free labels: Limitations of llm-as-a-judge without human ground- ing”. In:arXiv preprint arXiv:2503.05061(2025)

arXiv 2025

[37] [37]

Limitationsofthellm-as-a-judgeapproachforevaluatingllmoutputs in expert knowledge tasks

AnnalisaSzymanskietal. “Limitationsofthellm-as-a-judgeapproachforevaluatingllmoutputs in expert knowledge tasks”. In:Proceedings of the 30th international conference on intelligent user interfaces. 2025, pp. 952–966. 18

2025

[38] [38]

Autonomous artificial intelligence, scientific research, and human values

David B Resnik, Mohammad Hosseini, and Rico Hauswald. “Autonomous artificial intelligence, scientific research, and human values”. In:AI and Ethics6.1 (2026), p. 141

2026

[39] [39]

The ethics of using artificial intelligence in scientific research: new guidance needed for a new tool

David B Resnik and Mohammad Hosseini. “The ethics of using artificial intelligence in scientific research: new guidance needed for a new tool”. In:AI and Ethics5.2 (2025), pp. 1499–1521

2025

[40] [40]

https://openai.com/index/introducing-gpt-5-2/

OpenAI.Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ . 2025

2025

[41] [41]

https : / / blog

Google DeepMind.A new era of intelligence with Gemini 3. https : / / blog . google / products-and-platforms/products/gemini/gemini-3/. 2025

2025

[42] [42]

Anthropic.Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude- opus-4-5. 2025

2025

[43] [43]

https : / / github

Moonshot AI.Kimi K2.5: Visual Coding Meets Agent Swarm. https : / / github . com / MoonshotAI/Kimi-K2.5. 2026

2026

[44] [44]

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Intern-S1-Pro Team. “Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale”. In:CoRRabs/2603.25040 (2026).doi: 10 . 48550 / ARXIV . 2603 . 25040. arXiv: 2603 . 25040.url:https://doi.org/10.48550/arXiv.2603.25040

work page doi:10.48550/arxiv.2603.25040 2026

[45] [45]

DeepSeek AI.DeepSeek-V3.2 Release: DeepSeek-V3.2 & DeepSeek-V3.2-Speciale.https://api- docs.deepseek.com/news/news251201. 2025

2025

[46] [46]

Qwen3-vl technical report

Shuai Bai et al. “Qwen3-vl technical report”. In:arXiv preprint arXiv:2511.21631(2025)

Pith/arXiv arXiv 2025

[47] [47]

Aymeric Roucher et al.‘smolagents‘: a smol library to build great agentic systems.https : //github.com/huggingface/smolagents. 2025

2025

[48] [48]

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?

Chai Jingyi et al. “SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?” In:arXiv preprint arXiv:2507.05241 (2025)

arXiv 2025

[49] [49]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team et al. “Tongyi DeepResearch Technical Report”. In:arXiv preprint arXiv:2510.24701(2025)

Pith/arXiv arXiv 2025

[50] [50]

Yuchen Shi et al.Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization. 2025. arXiv:2512.24615 [cs.AI].url:https://arxiv.org/abs/ 2512.24615

arXiv 2025

[51] [51]

Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation

Mengkang Hu et al. “Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation”. In:NeurIPS(2025). 19

2025