pith. sign in

arxiv: 2606.31478 · v1 · pith:NEX3B4HSnew · submitted 2026-06-30 · 💻 cs.AI · cs.CV

One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution

Pith reviewed 2026-07-01 05:42 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords autonomous research agentsfailure recoverymulti-hypothesis attributiongrounded reportingself-correcting agentsAI scientific discoveryexperiment diagnosis
0
0 comments X

The pith

Autonomous research agents recover from experiment failures more reliably by generating and routing multiple evidence-grounded hypotheses instead of using a single reflection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the common practice of compressing experiment trajectories into one free-form reflection leaves autonomous research agents brittle, because the single critique often produces only local fixes or discards useful context. SAGE replaces this with Multi-Hypothesis Failure Attribution, which examines trajectory features to create several distinct causal explanations, ranks their severity, and sends the root cause to the matching intervention level while enforcing that every reported number matches an actual measurement. On a 12-topic benchmark spanning five domains the method raises the share of outputs that contain real metrics from 42 percent to 92 percent, lifts average artifact quality from 5.00 to 6.75 out of 10, and outscores an earlier autonomous system in blind ratings. A reader would care because the change addresses a concrete bottleneck that has kept fully autonomous scientific agents unreliable.

Core claim

SAGE treats failure recovery as structured causal diagnosis through Multi-Hypothesis Failure Attribution: the system analyzes dynamic trajectory features to produce multiple evidence-grounded explanations for each failure, evaluates their severity independently, and deterministically routes the verified root cause to the appropriate intervention level (hypothesis, design, or implementation); it further applies grounded reporting that redacts any numbers not directly measured, yielding higher rates of metrics-bearing outputs and higher-quality artifacts than single-reflection baselines.

What carries the argument

Multi-Hypothesis Failure Attribution (MHFA): a causal-diagnosis process that generates multiple evidence-grounded explanations from trajectory features, evaluates their severity, and routes the root cause to the correct intervention level.

If this is right

  • A larger share of autonomous research outputs will contain verifiable measured metrics rather than ungrounded numbers.
  • Artifact quality will rise most in code-development and execution tasks.
  • The approach will produce higher blind-evaluation scores than earlier autonomous research systems.
  • Structured recovery plus explicit grounding will supply a more consistent base for future autonomous scientific work.
  • Targeted fixes at the right level will preserve useful context that single reflections often discard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Longer experiment chains could become feasible if the same diagnosis loop is applied repeatedly without resetting the entire trajectory.
  • The same multi-explanation structure might reduce error propagation in other iterative AI-agent settings such as automated planning or tool-use sequences.
  • Requiring every numeric claim to trace to a recorded measurement could serve as a general safeguard against hallucination in any report-generating agent.
  • Evaluating the method on research problems that involve longer time scales or noisier data would test whether the benchmark gains generalize.
  • msc

Load-bearing premise

The 12-topic, 5-domain benchmark represents typical autonomous-research challenges and the observed gains result specifically from the multi-hypothesis mechanism and grounding constraints rather than from other prompting or implementation differences.

What would settle it

Running the same comparison on a fresh collection of research topics and finding no rise in the fraction of metrics-bearing outputs or no quality-score gain would show that the claimed mechanism does not produce the reported benefits.

read the original abstract

Autonomous research agents can now draft hypotheses, write code, run experiments, and produce papers, but they remain brittle when experiments fail. Under the prevailing paradigm, failure recovery is usually delegated to a single free-form reflection: a rich trajectory of metrics, logs, and design choices is compressed into one verbal critique, which often leads either to localized trial-and-error or to hard pivots that discard useful context. We propose SAGE, a Self-correcting, Autonomous, Grounded Experimenter, to tackle this failure-recovery bottleneck. Its core mechanism, Multi-Hypothesis Failure Attribution (MHFA), treats recovery as a structured causal diagnosis. By analyzing dynamic trajectory features, MHFA systematically generates multiple evidence-grounded explanations for a failure, independently evaluates their severity, and deterministically routes the verified root cause to the correct intervention level (hypothesis, experimental design, or implementation). To guarantee scientific honesty, SAGE further employs a grounded reporting mechanism that explicitly constrains drafted results to actual measured values, redacting hallucinated numbers. On a 12-topic, 5-domain benchmark, SAGE increases metrics-bearing outputs from 42% to 92% over a reflection baseline, improves artifact quality from 5.00 to 6.75/10, and blindly outscores AI-Scientist-v2 (52.0 vs. 48.2), with gains concentrated in code development and execution. While fully autonomous scientific writing and generating conference-ready papers remain notoriously difficult open problems for the entire field, SAGE successfully produces significantly more reliable and higher-quality scientific artifacts. Ultimately, by coupling structured recovery with explicit grounding constraints, SAGE significantly outperforms monolithic reflection paradigms, establishing a highly trustworthy foundation for future autonomous research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SAGE, a Self-correcting Autonomous Grounded Experimenter for autonomous research agents. Its core contribution is Multi-Hypothesis Failure Attribution (MHFA), which replaces single free-form reflection with structured causal diagnosis: it generates multiple evidence-grounded failure explanations from trajectory features, evaluates their severity, and routes the root cause to the appropriate intervention level (hypothesis, design, or implementation). A grounded reporting mechanism further constrains outputs to measured values to prevent hallucination. On a 12-topic, 5-domain benchmark, SAGE is reported to raise metrics-bearing outputs from 42% to 92%, artifact quality from 5.00 to 6.75/10, and blind scores over AI-Scientist-v2 (52.0 vs. 48.2), with gains concentrated in code development and execution.

Significance. If the performance lifts can be causally attributed to MHFA rather than uncontrolled implementation differences, the work would supply a concrete, mechanism-level improvement to failure recovery in autonomous scientific agents—an acknowledged bottleneck. The emphasis on deterministic routing and explicit grounding constraints addresses hallucination and context-loss problems that single-reflection baselines commonly exhibit. Reproducibility of the benchmark and isolation of the MHFA module would make the result a useful reference point for subsequent agent designs.

major comments (2)
  1. [Experimental Evaluation] Experimental Evaluation section: the manuscript provides no description of how the reflection baseline was implemented (model version, prompt scaffolding, temperature/sampling settings, retry logic, or termination criteria). Without explicit parity controls that differ from SAGE only in the MHFA module, the reported gains (42 % → 92 % metrics-bearing outputs, 5.00 → 6.75 artifact quality) cannot be attributed specifically to the proposed mechanism rather than to richer prompting or other unstated differences.
  2. [Benchmark and Results] Benchmark and Results section: no information is given on topic-selection criteria, exclusion rules, statistical tests, or inter-rater reliability for the blind scoring (52.0 vs. 48.2). These omissions make it impossible to assess whether the 12-topic, 5-domain set is representative or whether evaluator bias could explain the observed differences.
minor comments (2)
  1. [Abstract] The abstract states quantitative improvements but does not cross-reference the sections that would contain the required experimental controls and statistical details.
  2. [Method] Notation for MHFA components (e.g., severity scoring, routing rules) should be introduced with explicit definitions or pseudocode to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental controls and benchmark transparency. We address each major comment below and will revise the manuscript to improve reproducibility and attribution of results to the MHFA mechanism.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental Evaluation section: the manuscript provides no description of how the reflection baseline was implemented (model version, prompt scaffolding, temperature/sampling settings, retry logic, or termination criteria). Without explicit parity controls that differ from SAGE only in the MHFA module, the reported gains (42 % → 92 % metrics-bearing outputs, 5.00 → 6.75 artifact quality) cannot be attributed specifically to the proposed mechanism rather than to richer prompting or other unstated differences.

    Authors: We agree that the baseline implementation details were insufficiently specified, preventing clear isolation of the MHFA contribution. In the revised manuscript we will add a dedicated subsection describing the reflection baseline in full, including the exact model version, prompt scaffolding, temperature and sampling settings, retry logic, and termination criteria. We will also document the parity controls, stating that all other components (trajectory logging, grounded reporting, intervention routing outside MHFA, etc.) are identical between conditions. revision: yes

  2. Referee: [Benchmark and Results] Benchmark and Results section: no information is given on topic-selection criteria, exclusion rules, statistical tests, or inter-rater reliability for the blind scoring (52.0 vs. 48.2). These omissions make it impossible to assess whether the 12-topic, 5-domain set is representative or whether evaluator bias could explain the observed differences.

    Authors: We acknowledge the lack of these details. The revised version will include explicit topic-selection criteria, exclusion rules, and a description of how the 12-topic, 5-domain set was assembled. For the blind scoring, we will report any available inter-rater reliability statistics or note that single-rater scoring was used if that was the case; we will also add a brief discussion of representativeness. If formal statistical tests were not performed, we will state this limitation and, where feasible, compute and report basic significance measures on the existing data. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical claims with no derivations or self-referential reductions

full rationale

The paper contains no equations, mathematical derivations, fitted parameters presented as predictions, or load-bearing self-citations that reduce a central claim to its own inputs by construction. All reported results (42%→92% metrics-bearing outputs, 5.00→6.75 artifact quality, 52.0 vs 48.2 blind scores) are empirical comparisons on a described benchmark. While experimental controls for isolating MHFA could be debated on methodological grounds, this is not circularity under the defined patterns (self-definitional, fitted-input-as-prediction, etc.). The work is self-contained against its external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, background axioms, or new postulated entities beyond the named system and mechanism.

pith-pipeline@v0.9.1-grok · 5868 in / 1277 out tokens · 33906 ms · 2026-07-01T05:42:19.449203+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

109 extracted references · 44 canonical work pages · 16 internal anchors

  1. [1]

    2024 , url =

    Jie Ma and Yalong Bai and Bineng Zhong and Wei Zhang and Ting Yao and Tao Mei , title =. 2024 , url =. doi:10.1109/TNNLS.2023.3270479 , timestamp =

  2. [6]

    Foerster and Jeff Clune and David Ha , title =

    Chris Lu and Cong Lu and Robert Tjarko Lange and Jakob N. Foerster and Jeff Clune and David Ha , title =. CoRR , volume =

  3. [7]

    Foerster and Jeff Clune and David Ha , title =

    Yutaro Yamada and Robert Tjarko Lange and Cong Lu and Shengran Hu and Chris Lu and Jakob N. Foerster and Jeff Clune and David Ha , title =. CoRR , volume =

  4. [8]

    Samuel Schmidgall and Yusheng Su and Ze Wang and Ximeng Sun and Jialian Wu and Xiaodong Yu and Jiang Liu and Michael Moor and Zicheng Liu and Emad Barsoum , title =

  5. [9]

    Towards an

    Juraj Gottweis and Wei. Towards an. CoRR , volume =

  6. [10]

    CoRR , volume =

    Jiaqi Liu and Shi Qiu and Mairui Li and Bingzhou Li and Haonian Ji and Siwei Han and Xinyu Ye and Peng Xia and Zihan Dong and Meng Chen and Congyu Zhang and Letian Zhang and Guiming Chen and Haoqin Tu and Xinyu Yang and Lu Feng and Xujiang Zhao and Haifeng Chen and Jiawei Zhou and Xiao Wang and Weitong Zhang and Hongtu Zhu and Yun Li and Jieru Mei and Hon...

  7. [11]

    CoRR , volume =

    Zhengyao Jiang and Dominik Schmidt and Dhruv Srikanth and Dixing Xu and Ian Kaplan and Deniss Jacenko and Yuxiang Wu , title =. CoRR , volume =

  8. [12]

    Jun Shern Chan and Neil Chowdhury and Oliver Jaffe and James Aung and Dane Sherburn and Evan Mays and Giulio Starace and Kevin Liu and Leon Maksin and Tejal Patwardhan and Aleksander Madry and Lilian Weng , title =

  9. [14]

    NeurIPS , year =

    Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , title =. NeurIPS , year =

  10. [15]

    NeurIPS , year =

    Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , title =. NeurIPS , year =

  11. [16]

    Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Nan Duan and Weizhu Chen , title =

  12. [17]

    Teaching Large Language Models to Self-Debug , booktitle =

    Xinyun Chen and Maxwell Lin and Nathanael Sch. Teaching Large Language Models to Self-Debug , booktitle =

  13. [18]

    CoRR , volume =

    Yougang Lyu and Xi Zhang and Xinhao Yi and Yuyue Zhao and Shuyu Guo and Wenxiang Hu and Jan Piotrowski and Jakub Kaliski and Jacopo Urbani and Zaiqiao Meng and Lun Zhou and Xiaohui Yan , title =. CoRR , volume =

  14. [19]

    Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Trans. Mach. Learn. Res. , volume =

  15. [20]

    Le and Ed H

    Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =

  16. [21]

    NeurIPS , year =

    Shunyu Yao and Dian Yu and Jeffrey Zhao and Izhak Shafran and Tom Griffiths and Yuan Cao and Karthik Narasimhan , title =. NeurIPS , year =

  17. [22]

    Tenenbaum and Igor Mordatch , title =

    Yilun Du and Shuang Li and Antonio Torralba and Joshua B. Tenenbaum and Igor Mordatch , title =

  18. [23]

    Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , booktitle =

    Andy Zhou and Kai Yan and Michal Shlapentokh. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , booktitle =

  19. [25]

    T. C. Chamberlin , title =. Science , volume =. 1965 , doi =

  20. [26]

    Platt , title =

    John R. Platt , title =. Science , volume =. 1964 , doi =

  21. [30]

    The Thirteenth International Conference on Learning Representations,

    Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  22. [31]

    Baker and Benjamin Burns and Daniel Adu

    Ziru Chen and Shijie Chen and Yuting Ning and Qianheng Zhang and Boshi Wang and Botao Yu and Yifei Li and Zeyi Liao and Chen Wei and Zitong Lu and Vishal Dey and Mingyi Xue and Frazier N. Baker and Benjamin Burns and Daniel Adu. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery , booktitle =. 2025 , url =

  23. [37]

    Nature , year=

    Functional genomic hypothesis generation and experimentation by a robot scientist , author=. Nature , year=

  24. [40]

    Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

    Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  25. [41]

    Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

    Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and et al. , title =. The Thirteenth ...

  26. [42]

    Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

    Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  27. [43]

    2024 , eprint=

    Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models , author=. 2024 , eprint=

  28. [45]

    Nature , year=

    Synthesizing scientific literature with retrieval-augmented language models , author=. Nature , year=

  29. [51]

    Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =

    John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =

  30. [52]

    Narasimhan and Yuan Cao , title =

    Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  31. [54]

    and Zhou, Xin and Wheeler, Matthew and Bernstein, Jonathan A

    Huang, Kexin and Zhang, Serena and Wang, Hanchen and Qu, Yuanhao and Lu, Yingzhou and Roohani, Yusuf and Li, Ryan and Qiu, Lin and Li, Gavin and Zhang, Junze and Yin, Di and Marwaha, Shruti and Carter, Jennefer N. and Zhou, Xin and Wheeler, Matthew and Bernstein, Jonathan A. and Wang, Mengdi and He, Peng and Zhou, Jingtian and Snyder, Michael and Cong, Le...

  32. [55]

    2025 , eprint=

    InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification , author=. 2025 , eprint=

  33. [56]

    The Twelfth International Conference on Learning Representations,

    Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , title =. The Twelfth Internatio...

  34. [61]

    PaperBench: Evaluating AI's Ability to Replicate

    Giulio Starace and Oliver Jaffe and Dane Sherburn and James Aung and Jun Shern Chan and Leon Maksin and Rachel Dias and Evan Mays and Benjamin Kinsella and Wyatt Thompson and Johannes Heidecke and Amelia Glaese and Tejal Patwardhan , editor =. PaperBench: Evaluating AI's Ability to Replicate. Forty-second International Conference on Machine Learning,. 202...

  35. [62]

    Huerta and Hao Peng , editor =

    Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufe...

  36. [63]

    Journal of the Royal Society Interface , year=

    Cheaper faster drug development validated by the repositioning of drugs against neglected tropical diseases , author=. Journal of the Royal Society Interface , year=

  37. [65]

    Goodman , title =

    Ruocheng Wang and Eric Zelikman and Gabriel Poesia and Yewen Pu and Nick Haber and Noah D. Goodman , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  38. [66]

    Eureka: Human-Level Reward Design via Coding Large Language Models , booktitle =

    Yecheng Jason Ma and William Liang and Guanzhi Wang and De. Eureka: Human-Level Reward Design via Coding Large Language Models , booktitle =. 2024 , url =

  39. [67]

    Le and Denny Zhou and Xinyun Chen , title =

    Chengrun Yang and Xuezhi Wang and Yifeng Lu and Hanxiao Liu and Quoc V. Le and Denny Zhou and Xinyun Chen , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  40. [71]

    2026 , eprint=

    AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration , author=. 2026 , eprint=

  41. [73]

    Hwang, Varsha Kishore, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S

    Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Jenna Sparks, Jena D. Hwang, Varsha Kishore, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S. Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang...

  42. [74]

    17682–17690

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Confe...

  43. [75]

    Emergent autonomous scientific research capabilities of large language models

    Daniil A. Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. CoRR, abs/2304.05332, 2023. doi:10.48550/ARXIV.2304.05332. URL https://doi.org/10.48550/arXiv.2304.05332

  44. [76]

    Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D

    Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nat. Mac. Intell., 6 0 (5): 0 525--535, 2024. doi:10.1038/S42256-024-00832-8. URL https://doi.org/10.1038/s42256-024-00832-8

  45. [77]

    T. C. Chamberlin. The method of multiple working hypotheses. Science, 148 0 (3671): 0 754--759, 1965. doi:10.1126/science.148.3671.754. URL https://www.science.org/doi/abs/10.1126/science.148.3671.754

  46. [78]

    Mle-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In ICLR . OpenReview.net, 2025

  47. [79]

    Mlr-bench: Evaluating ai agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025

    Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating AI agents on open-ended machine learning research. CoRR, abs/2505.19955, 2025 a . doi:10.48550/ARXIV.2505.19955. URL https://doi.org/10.48550/arXiv.2505.19955

  48. [80]

    Teaching large language models to self-debug

    Xinyun Chen, Maxwell Lin, Nathanael Sch \" a rli, and Denny Zhou. Teaching large language models to self-debug. In ICLR . OpenReview.net, 2024

  49. [81]

    Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. I...

  50. [82]

    ORGANA: A robotic assistant for automated chemistry experimentation and characterization

    Kourosh Darvish, Marta Skreta, Yuchi Zhao, Naruki Yoshikawa, Sagnik Som, Miroslav Bogdanovic, Yang Cao, Han Hao, Haoping Xu, Al \' a n Aspuru - Guzik, Animesh Garg, and Florian Shkurti. ORGANA: A robotic assistant for automated chemistry experimentation and characterization. CoRR, abs/2401.06949, 2024. doi:10.48550/ARXIV.2401.06949. URL https://doi.org/10...

  51. [83]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In ICML , volume 235 of Proceedings of Machine Learning Research, pages 11733--11763. PMLR / OpenReview.net, 2024

  52. [84]

    Camyla: Scaling Autonomous Research in Medical Image Segmentation

    Yifan Gao, Haoyue Li, Feng Yuan, Xin Gao, Weiran Huang, and Xiaosong Wang. Camyla: Scaling autonomous research in medical image segmentation. CoRR, abs/2604.10696, 2026. doi:10.48550/ARXIV.2604.10696. URL https://doi.org/10.48550/arXiv.2604.10696

  53. [85]

    Juraj Gottweis, Wei - Hung Weng, Alexander N. Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Gua...

  54. [86]

    CRITIC: large language models can self-correct with tool-interactive critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: large language models can self-correct with tool-interactive critiquing. In ICLR . OpenReview.net, 2024

  55. [87]

    Data interpreter: An LLM agent for data science

    Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. Data int...

  56. [88]

    Carter, Xin Zhou, Matthew Wheeler, Jonathan A

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, Michael Snyder, Le Cong, Aviv Regev, and Jure Leskovec. Biomni: A general-purpose biomedical ai agent. b...

  57. [89]

    Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery

    InternScience Team . Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery. CoRR, abs/2602.08990, 2026. doi:10.48550/ARXIV.2602.08990. URL https://doi.org/10.48550/arXiv.2602.08990

  58. [90]

    AIDE: AI-Driven Exploration in the Space of Code

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: ai-driven exploration in the space of code. CoRR, abs/2502.13138, 2025

  59. [91]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  60. [92]

    King, Ken E

    Ross D. King, Ken E. Whelan, Ffion Mair Jones, Philip G. K. Reiser, Christopher H. Bryant, Stephen H. Muggleton, Douglas B. Kell, and Stephen G. Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427: 0 247--252, 2004. URL https://api.semanticscholar.org/CorpusID:4428725

  61. [93]

    King, Jem Rowland, Stephen G

    Ross D. King, Jem Rowland, Stephen G. Oliver, Michael Young, Wayne Aubrey, Emma Byrne, Maria Liakata, Magdalena Markham, Pinar Pir, Larisa N. Soldatova, Andrew Sparkes, Kenneth E. Whelan, and Amanda Clare. The automation of science. Science, 324 0 (5923): 0 85--89, 2009. doi:10.1126/science.1165620. URL https://www.science.org/doi/abs/10.1126/science.1165620

  62. [94]

    org/abs/2505.24785

    Patrick Tser Jern Kon, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia, and Ang Chen. Exp-bench: Can AI conduct AI research experiments? CoRR, abs/2505.24785, 2025. doi:10.48550/ARXIV.2505.24785. URL https://doi.org/10.48550/arXiv.2505.24785

  63. [95]

    LAB-Bench: Measuring Capabilities of Language Models for Biology Research

    Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. Lab-bench: Measuring capabilities of language models for biology research. CoRR, abs/2407.10362, 2024. doi:10.48550/ARXIV.2407.10362. URL https://doi.org/10.48550/arXiv.2407.10362

  64. [96]

    Tongyi DeepResearch Technical Report

    Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayi...

  65. [97]

    AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

    Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wa...

  66. [98]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning ...

  67. [99]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. CoRR, abs/2408.06292, 2024

  68. [100]

    Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery

    Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery. CoRR, abs/2603.08127, 2026

  69. [101]

    Eureka: Human-level reward design via coding large language models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De - An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://op...

  70. [102]

    B. P. MacLeod, F. G. L. Parlane, T. D. Morrissey, F. Häse, L. M. Roch, K. E. Dettelbach, R. Moreira, L. P. E. Yunker, M. B. Rooney, J. R. Deeth, V. Lai, G. J. Ng, H. Situ, R. H. Zhang, M. S. Elliott, T. H. Haley, D. J. Dvorak, A. Aspuru-Guzik, J. E. Hein, and C. P. Berlinguette. Self-driving laboratory for accelerated discovery of thin-film materials. Sci...

  71. [103]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023

  72. [104]

    Discoverybench: Towards data-driven discovery with large language models

    Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28...

  73. [105]

    Arik, and Tomas Pfister

    Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan \" O . Arik, and Tomas Pfister. MLE-STAR: machine learning engineering agent via search and targeted refinement. CoRR, abs/2506.15692, 2025. doi:10.48550/ARXIV.2506.15692. URL https://doi.org/10.48550/arXiv.2506.15692

  74. [106]

    Alexander Novikov, Ng \^ a n Vu, Marvin Eisenberger, Emilien Dupont, Po - Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and...

  75. [107]

    John R. Platt. Strong inference. Science, 146 0 (3642): 0 347--353, 1964. doi:10.1126/science.146.3642.347. URL https://www.science.org/doi/abs/10.1126/science.146.3642.347

  76. [108]

    WebResearcher: Unleashing unbounded reasoning capability in long-horizon agents.arXiv preprint arXiv:2509.13309,

    Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents. CoRR, abs/2509.13309, 2025. doi:10.48550/ARXIV.2509.13309. URL https://doi.o...

  77. [109]

    Agent laboratory: Using LLM agents as research assistants

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In EMNLP (Findings) , pages 5977--6043. Association for Computational Linguistics, 2025

  78. [110]

    Kanell, Peter Xu, Omar Khattab, and Monica S

    Yijia Shao, Yucheng Jiang, Theodore A. Kanell, Peter Xu, Omar Khattab, and Monica S. Lam. Assisting in writing wikipedia-like articles from scratch with large language models, 2024. URL https://arxiv.org/abs/2402.14207

  79. [111]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In NeurIPS, 2023

  80. [112]

    Skarlinski, Sam Cox, Jon M

    Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela M. Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge. CoRR, abs/2409.13740, 2024. doi:10.48550/ARXIV.2409.13740. URL https://doi.org/10.48550/arXiv.2409.13740

Showing first 80 references.