One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution

Binfei Chu; Jiayi Ji; Jie Gao; Jie Ma; Jinlu Zhang; Rongrong Ji; Xiaoshuai Sun; Yi Tan; Yiwei Ma

arxiv: 2606.31478 · v1 · pith:NEX3B4HSnew · submitted 2026-06-30 · 💻 cs.AI · cs.CV

One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution

Jie Ma , Binfei Chu , Jie Gao , Jinlu Zhang , Yiwei Ma , Yi Tan , Jiayi Ji , Xiaoshuai Sun

show 1 more author

Rongrong Ji

This is my paper

Pith reviewed 2026-07-01 05:42 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords autonomous research agentsfailure recoverymulti-hypothesis attributiongrounded reportingself-correcting agentsAI scientific discoveryexperiment diagnosis

0 comments

The pith

Autonomous research agents recover from experiment failures more reliably by generating and routing multiple evidence-grounded hypotheses instead of using a single reflection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the common practice of compressing experiment trajectories into one free-form reflection leaves autonomous research agents brittle, because the single critique often produces only local fixes or discards useful context. SAGE replaces this with Multi-Hypothesis Failure Attribution, which examines trajectory features to create several distinct causal explanations, ranks their severity, and sends the root cause to the matching intervention level while enforcing that every reported number matches an actual measurement. On a 12-topic benchmark spanning five domains the method raises the share of outputs that contain real metrics from 42 percent to 92 percent, lifts average artifact quality from 5.00 to 6.75 out of 10, and outscores an earlier autonomous system in blind ratings. A reader would care because the change addresses a concrete bottleneck that has kept fully autonomous scientific agents unreliable.

Core claim

SAGE treats failure recovery as structured causal diagnosis through Multi-Hypothesis Failure Attribution: the system analyzes dynamic trajectory features to produce multiple evidence-grounded explanations for each failure, evaluates their severity independently, and deterministically routes the verified root cause to the appropriate intervention level (hypothesis, design, or implementation); it further applies grounded reporting that redacts any numbers not directly measured, yielding higher rates of metrics-bearing outputs and higher-quality artifacts than single-reflection baselines.

What carries the argument

Multi-Hypothesis Failure Attribution (MHFA): a causal-diagnosis process that generates multiple evidence-grounded explanations from trajectory features, evaluates their severity, and routes the root cause to the correct intervention level.

If this is right

A larger share of autonomous research outputs will contain verifiable measured metrics rather than ungrounded numbers.
Artifact quality will rise most in code-development and execution tasks.
The approach will produce higher blind-evaluation scores than earlier autonomous research systems.
Structured recovery plus explicit grounding will supply a more consistent base for future autonomous scientific work.
Targeted fixes at the right level will preserve useful context that single reflections often discard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Longer experiment chains could become feasible if the same diagnosis loop is applied repeatedly without resetting the entire trajectory.
The same multi-explanation structure might reduce error propagation in other iterative AI-agent settings such as automated planning or tool-use sequences.
Requiring every numeric claim to trace to a recorded measurement could serve as a general safeguard against hallucination in any report-generating agent.
Evaluating the method on research problems that involve longer time scales or noisier data would test whether the benchmark gains generalize.
msc

Load-bearing premise

The 12-topic, 5-domain benchmark represents typical autonomous-research challenges and the observed gains result specifically from the multi-hypothesis mechanism and grounding constraints rather than from other prompting or implementation differences.

What would settle it

Running the same comparison on a fresh collection of research topics and finding no rise in the fraction of metrics-bearing outputs or no quality-score gain would show that the claimed mechanism does not produce the reported benefits.

read the original abstract

Autonomous research agents can now draft hypotheses, write code, run experiments, and produce papers, but they remain brittle when experiments fail. Under the prevailing paradigm, failure recovery is usually delegated to a single free-form reflection: a rich trajectory of metrics, logs, and design choices is compressed into one verbal critique, which often leads either to localized trial-and-error or to hard pivots that discard useful context. We propose SAGE, a Self-correcting, Autonomous, Grounded Experimenter, to tackle this failure-recovery bottleneck. Its core mechanism, Multi-Hypothesis Failure Attribution (MHFA), treats recovery as a structured causal diagnosis. By analyzing dynamic trajectory features, MHFA systematically generates multiple evidence-grounded explanations for a failure, independently evaluates their severity, and deterministically routes the verified root cause to the correct intervention level (hypothesis, experimental design, or implementation). To guarantee scientific honesty, SAGE further employs a grounded reporting mechanism that explicitly constrains drafted results to actual measured values, redacting hallucinated numbers. On a 12-topic, 5-domain benchmark, SAGE increases metrics-bearing outputs from 42% to 92% over a reflection baseline, improves artifact quality from 5.00 to 6.75/10, and blindly outscores AI-Scientist-v2 (52.0 vs. 48.2), with gains concentrated in code development and execution. While fully autonomous scientific writing and generating conference-ready papers remain notoriously difficult open problems for the entire field, SAGE successfully produces significantly more reliable and higher-quality scientific artifacts. Ultimately, by coupling structured recovery with explicit grounding constraints, SAGE significantly outperforms monolithic reflection paradigms, establishing a highly trustworthy foundation for future autonomous research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reported gains from MHFA are hard to credit to the mechanism because the baseline comparison shows no sign of being isolated from other implementation differences.

read the letter

The main thing to know is that MHFA gives a structured way to generate and rank multiple failure explanations instead of one free-form reflection, plus a check to keep reported numbers tied to actual measurements. That is the concrete addition over the usual single-reflection setup.

The paper describes SAGE using dynamic trajectory features to produce several evidence-based hypotheses for a failure, score their severity, and route the root cause to the right intervention level. It also constrains outputs to measured values to reduce hallucinated results. On the 12-topic benchmark the numbers show a jump from 42% to 92% metrics-bearing outputs, quality from 5.00 to 6.75, and a blind score of 52.0 versus 48.2 for AI-Scientist-v2, with most of the lift in code and execution.

The soft spot is exactly the isolation problem. The abstract and summary give no indication that the reflection baseline was re-run with identical model, prompt scaffolding, temperature, or termination rules; without that, the lift could come from any of those other factors rather than from the multi-hypothesis diagnosis. The benchmark is narrow, and no statistical tests or variance numbers are mentioned, so the results stay hard to read as clean evidence for the mechanism.

This is for people working on agentic research systems who want explicit failure-handling recipes. It shows clear thinking about the recovery bottleneck and honest engagement with the limits of current agents. It deserves a serious referee because the problem is real and the proposal is specific, even though the current evidence needs controlled ablations and released code to hold up.

Referee Report

2 major / 2 minor

Summary. The paper proposes SAGE, a Self-correcting Autonomous Grounded Experimenter for autonomous research agents. Its core contribution is Multi-Hypothesis Failure Attribution (MHFA), which replaces single free-form reflection with structured causal diagnosis: it generates multiple evidence-grounded failure explanations from trajectory features, evaluates their severity, and routes the root cause to the appropriate intervention level (hypothesis, design, or implementation). A grounded reporting mechanism further constrains outputs to measured values to prevent hallucination. On a 12-topic, 5-domain benchmark, SAGE is reported to raise metrics-bearing outputs from 42% to 92%, artifact quality from 5.00 to 6.75/10, and blind scores over AI-Scientist-v2 (52.0 vs. 48.2), with gains concentrated in code development and execution.

Significance. If the performance lifts can be causally attributed to MHFA rather than uncontrolled implementation differences, the work would supply a concrete, mechanism-level improvement to failure recovery in autonomous scientific agents—an acknowledged bottleneck. The emphasis on deterministic routing and explicit grounding constraints addresses hallucination and context-loss problems that single-reflection baselines commonly exhibit. Reproducibility of the benchmark and isolation of the MHFA module would make the result a useful reference point for subsequent agent designs.

major comments (2)

[Experimental Evaluation] Experimental Evaluation section: the manuscript provides no description of how the reflection baseline was implemented (model version, prompt scaffolding, temperature/sampling settings, retry logic, or termination criteria). Without explicit parity controls that differ from SAGE only in the MHFA module, the reported gains (42 % → 92 % metrics-bearing outputs, 5.00 → 6.75 artifact quality) cannot be attributed specifically to the proposed mechanism rather than to richer prompting or other unstated differences.
[Benchmark and Results] Benchmark and Results section: no information is given on topic-selection criteria, exclusion rules, statistical tests, or inter-rater reliability for the blind scoring (52.0 vs. 48.2). These omissions make it impossible to assess whether the 12-topic, 5-domain set is representative or whether evaluator bias could explain the observed differences.

minor comments (2)

[Abstract] The abstract states quantitative improvements but does not cross-reference the sections that would contain the required experimental controls and statistical details.
[Method] Notation for MHFA components (e.g., severity scoring, routing rules) should be introduced with explicit definitions or pseudocode to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental controls and benchmark transparency. We address each major comment below and will revise the manuscript to improve reproducibility and attribution of results to the MHFA mechanism.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental Evaluation section: the manuscript provides no description of how the reflection baseline was implemented (model version, prompt scaffolding, temperature/sampling settings, retry logic, or termination criteria). Without explicit parity controls that differ from SAGE only in the MHFA module, the reported gains (42 % → 92 % metrics-bearing outputs, 5.00 → 6.75 artifact quality) cannot be attributed specifically to the proposed mechanism rather than to richer prompting or other unstated differences.

Authors: We agree that the baseline implementation details were insufficiently specified, preventing clear isolation of the MHFA contribution. In the revised manuscript we will add a dedicated subsection describing the reflection baseline in full, including the exact model version, prompt scaffolding, temperature and sampling settings, retry logic, and termination criteria. We will also document the parity controls, stating that all other components (trajectory logging, grounded reporting, intervention routing outside MHFA, etc.) are identical between conditions. revision: yes
Referee: [Benchmark and Results] Benchmark and Results section: no information is given on topic-selection criteria, exclusion rules, statistical tests, or inter-rater reliability for the blind scoring (52.0 vs. 48.2). These omissions make it impossible to assess whether the 12-topic, 5-domain set is representative or whether evaluator bias could explain the observed differences.

Authors: We acknowledge the lack of these details. The revised version will include explicit topic-selection criteria, exclusion rules, and a description of how the 12-topic, 5-domain set was assembled. For the blind scoring, we will report any available inter-rater reliability statistics or note that single-rater scoring was used if that was the case; we will also add a brief discussion of representativeness. If formal statistical tests were not performed, we will state this limitation and, where feasible, compute and report basic significance measures on the existing data. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical claims with no derivations or self-referential reductions

full rationale

The paper contains no equations, mathematical derivations, fitted parameters presented as predictions, or load-bearing self-citations that reduce a central claim to its own inputs by construction. All reported results (42%→92% metrics-bearing outputs, 5.00→6.75 artifact quality, 52.0 vs 48.2 blind scores) are empirical comparisons on a described benchmark. While experimental controls for isolating MHFA could be debated on methodological grounds, this is not circularity under the defined patterns (self-definitional, fitted-input-as-prediction, etc.). The work is self-contained against its external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, background axioms, or new postulated entities beyond the named system and mechanism.

pith-pipeline@v0.9.1-grok · 5868 in / 1277 out tokens · 33906 ms · 2026-07-01T05:42:19.449203+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

109 extracted references · 44 canonical work pages · 16 internal anchors

[1]

2024 , url =

Jie Ma and Yalong Bai and Bineng Zhong and Wei Zhang and Ting Yao and Tao Mei , title =. 2024 , url =. doi:10.1109/TNNLS.2023.3270479 , timestamp =

work page doi:10.1109/tnnls.2023.3270479 2024
[6]

Foerster and Jeff Clune and David Ha , title =

Chris Lu and Cong Lu and Robert Tjarko Lange and Jakob N. Foerster and Jeff Clune and David Ha , title =. CoRR , volume =
[7]

Foerster and Jeff Clune and David Ha , title =

Yutaro Yamada and Robert Tjarko Lange and Cong Lu and Shengran Hu and Chris Lu and Jakob N. Foerster and Jeff Clune and David Ha , title =. CoRR , volume =
[8]

Samuel Schmidgall and Yusheng Su and Ze Wang and Ximeng Sun and Jialian Wu and Xiaodong Yu and Jiang Liu and Michael Moor and Zicheng Liu and Emad Barsoum , title =
[9]

Towards an

Juraj Gottweis and Wei. Towards an. CoRR , volume =
[10]

CoRR , volume =

Jiaqi Liu and Shi Qiu and Mairui Li and Bingzhou Li and Haonian Ji and Siwei Han and Xinyu Ye and Peng Xia and Zihan Dong and Meng Chen and Congyu Zhang and Letian Zhang and Guiming Chen and Haoqin Tu and Xinyu Yang and Lu Feng and Xujiang Zhao and Haifeng Chen and Jiawei Zhou and Xiao Wang and Weitong Zhang and Hongtu Zhu and Yun Li and Jieru Mei and Hon...
[11]

CoRR , volume =

Zhengyao Jiang and Dominik Schmidt and Dhruv Srikanth and Dixing Xu and Ian Kaplan and Deniss Jacenko and Yuxiang Wu , title =. CoRR , volume =
[12]

Jun Shern Chan and Neil Chowdhury and Oliver Jaffe and James Aung and Dane Sherburn and Evan Mays and Giulio Starace and Kevin Liu and Leon Maksin and Tejal Patwardhan and Aleksander Madry and Lilian Weng , title =
[14]

NeurIPS , year =

Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , title =. NeurIPS , year =
[15]

NeurIPS , year =

Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , title =. NeurIPS , year =
[16]

Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Nan Duan and Weizhu Chen , title =
[17]

Teaching Large Language Models to Self-Debug , booktitle =

Xinyun Chen and Maxwell Lin and Nathanael Sch. Teaching Large Language Models to Self-Debug , booktitle =
[18]

CoRR , volume =

Yougang Lyu and Xi Zhang and Xinhao Yi and Yuyue Zhao and Shuyu Guo and Wenxiang Hu and Jan Piotrowski and Jakub Kaliski and Jacopo Urbani and Zaiqiao Meng and Lun Zhou and Xiaohui Yan , title =. CoRR , volume =
[19]

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Trans. Mach. Learn. Res. , volume =
[20]

Le and Ed H

Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =
[21]

NeurIPS , year =

Shunyu Yao and Dian Yu and Jeffrey Zhao and Izhak Shafran and Tom Griffiths and Yuan Cao and Karthik Narasimhan , title =. NeurIPS , year =
[22]

Tenenbaum and Igor Mordatch , title =

Yilun Du and Shuang Li and Antonio Torralba and Joshua B. Tenenbaum and Igor Mordatch , title =
[23]

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , booktitle =

Andy Zhou and Kai Yan and Michal Shlapentokh. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , booktitle =
[25]

T. C. Chamberlin , title =. Science , volume =. 1965 , doi =

1965
[26]

Platt , title =

John R. Platt , title =. Science , volume =. 1964 , doi =

1964
[30]

The Thirteenth International Conference on Learning Representations,

Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[31]

Baker and Benjamin Burns and Daniel Adu

Ziru Chen and Shijie Chen and Yuting Ning and Qianheng Zhang and Boshi Wang and Botao Yu and Yifei Li and Zeyi Liao and Chen Wei and Zitong Lu and Vishal Dey and Mingyi Xue and Frazier N. Baker and Benjamin Burns and Daniel Adu. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery , booktitle =. 2025 , url =

2025
[37]

Nature , year=

Functional genomic hypothesis generation and experimentation by a robot scientist , author=. Nature , year=
[40]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[41]

Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and et al. , title =. The Thirteenth ...

2025
[42]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[43]

2024 , eprint=

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models , author=. 2024 , eprint=

2024
[45]

Nature , year=

Synthesizing scientific literature with retrieval-augmented language models , author=. Nature , year=
[51]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =

2024
[52]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[54]

and Zhou, Xin and Wheeler, Matthew and Bernstein, Jonathan A

Huang, Kexin and Zhang, Serena and Wang, Hanchen and Qu, Yuanhao and Lu, Yingzhou and Roohani, Yusuf and Li, Ryan and Qiu, Lin and Li, Gavin and Zhang, Junze and Yin, Di and Marwaha, Shruti and Carter, Jennefer N. and Zhou, Xin and Wheeler, Matthew and Bernstein, Jonathan A. and Wang, Mengdi and He, Peng and Zhou, Jingtian and Snyder, Michael and Cong, Le...

2025
[55]

2025 , eprint=

InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification , author=. 2025 , eprint=

2025
[56]

The Twelfth International Conference on Learning Representations,

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , title =. The Twelfth Internatio...

2024
[61]

PaperBench: Evaluating AI's Ability to Replicate

Giulio Starace and Oliver Jaffe and Dane Sherburn and James Aung and Jun Shern Chan and Leon Maksin and Rachel Dias and Evan Mays and Benjamin Kinsella and Wyatt Thompson and Johannes Heidecke and Amelia Glaese and Tejal Patwardhan , editor =. PaperBench: Evaluating AI's Ability to Replicate. Forty-second International Conference on Machine Learning,. 202...

2025
[62]

Huerta and Hao Peng , editor =

Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufe...

2024
[63]

Journal of the Royal Society Interface , year=

Cheaper faster drug development validated by the repositioning of drugs against neglected tropical diseases , author=. Journal of the Royal Society Interface , year=
[65]

Goodman , title =

Ruocheng Wang and Eric Zelikman and Gabriel Poesia and Yewen Pu and Nick Haber and Noah D. Goodman , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[66]

Eureka: Human-Level Reward Design via Coding Large Language Models , booktitle =

Yecheng Jason Ma and William Liang and Guanzhi Wang and De. Eureka: Human-Level Reward Design via Coding Large Language Models , booktitle =. 2024 , url =

2024
[67]

Le and Denny Zhou and Xinyun Chen , title =

Chengrun Yang and Xuezhi Wang and Yifeng Lu and Hanxiao Liu and Quoc V. Le and Denny Zhou and Xinyun Chen , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[71]

2026 , eprint=

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration , author=. 2026 , eprint=

2026
[73]

Hwang, Varsha Kishore, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Jenna Sparks, Jena D. Hwang, Varsha Kishore, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S. Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang...

2026
[74]

17682–17690

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Confe...

work page doi:10.1609/aaai.v38i16.29720 2024
[75]

Emergent autonomous scientific research capabilities of large language models

Daniil A. Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. CoRR, abs/2304.05332, 2023. doi:10.48550/ARXIV.2304.05332. URL https://doi.org/10.48550/arXiv.2304.05332

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.05332 2023
[76]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nat. Mac. Intell., 6 0 (5): 0 525--535, 2024. doi:10.1038/S42256-024-00832-8. URL https://doi.org/10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024
[77]

T. C. Chamberlin. The method of multiple working hypotheses. Science, 148 0 (3671): 0 754--759, 1965. doi:10.1126/science.148.3671.754. URL https://www.science.org/doi/abs/10.1126/science.148.3671.754

work page doi:10.1126/science.148.3671.754 1965
[78]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In ICLR . OpenReview.net, 2025

2025
[79]

Mlr-bench: Evaluating ai agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating AI agents on open-ended machine learning research. CoRR, abs/2505.19955, 2025 a . doi:10.48550/ARXIV.2505.19955. URL https://doi.org/10.48550/arXiv.2505.19955

work page doi:10.48550/arxiv.2505.19955 2025
[80]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Sch \" a rli, and Denny Zhou. Teaching large language models to self-debug. In ICLR . OpenReview.net, 2024

2024
[81]

Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. I...

2025
[82]

ORGANA: A robotic assistant for automated chemistry experimentation and characterization

Kourosh Darvish, Marta Skreta, Yuchi Zhao, Naruki Yoshikawa, Sagnik Som, Miroslav Bogdanovic, Yang Cao, Han Hao, Haoping Xu, Al \' a n Aspuru - Guzik, Animesh Garg, and Florian Shkurti. ORGANA: A robotic assistant for automated chemistry experimentation and characterization. CoRR, abs/2401.06949, 2024. doi:10.48550/ARXIV.2401.06949. URL https://doi.org/10...

work page doi:10.48550/arxiv.2401.06949 2024
[83]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In ICML , volume 235 of Proceedings of Machine Learning Research, pages 11733--11763. PMLR / OpenReview.net, 2024

2024
[84]

Camyla: Scaling Autonomous Research in Medical Image Segmentation

Yifan Gao, Haoyue Li, Feng Yuan, Xin Gao, Weiran Huang, and Xiaosong Wang. Camyla: Scaling autonomous research in medical image segmentation. CoRR, abs/2604.10696, 2026. doi:10.48550/ARXIV.2604.10696. URL https://doi.org/10.48550/arXiv.2604.10696

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10696 2026
[85]

Juraj Gottweis, Wei - Hung Weng, Alexander N. Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Gua...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[86]

CRITIC: large language models can self-correct with tool-interactive critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: large language models can self-correct with tool-interactive critiquing. In ICLR . OpenReview.net, 2024

2024
[87]

Data interpreter: An LLM agent for data science

Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. Data int...

work page doi:10.18653/v1/2025.findings-acl.1016 2025
[88]

Carter, Xin Zhou, Matthew Wheeler, Jonathan A

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, Michael Snyder, Le Cong, Aviv Regev, and Jure Leskovec. Biomni: A general-purpose biomedical ai agent. b...

work page doi:10.1101/2025.05.30.656746 2025
[89]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery

InternScience Team . Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery. CoRR, abs/2602.08990, 2026. doi:10.48550/ARXIV.2602.08990. URL https://doi.org/10.48550/arXiv.2602.08990

work page doi:10.48550/arxiv.2602.08990 2026
[90]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: ai-driven exploration in the space of code. CoRR, abs/2502.13138, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[91]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

2024
[92]

King, Ken E

Ross D. King, Ken E. Whelan, Ffion Mair Jones, Philip G. K. Reiser, Christopher H. Bryant, Stephen H. Muggleton, Douglas B. Kell, and Stephen G. Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427: 0 247--252, 2004. URL https://api.semanticscholar.org/CorpusID:4428725

2004
[93]

King, Jem Rowland, Stephen G

Ross D. King, Jem Rowland, Stephen G. Oliver, Michael Young, Wayne Aubrey, Emma Byrne, Maria Liakata, Magdalena Markham, Pinar Pir, Larisa N. Soldatova, Andrew Sparkes, Kenneth E. Whelan, and Amanda Clare. The automation of science. Science, 324 0 (5923): 0 85--89, 2009. doi:10.1126/science.1165620. URL https://www.science.org/doi/abs/10.1126/science.1165620

work page doi:10.1126/science.1165620 2009
[94]

org/abs/2505.24785

Patrick Tser Jern Kon, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia, and Ang Chen. Exp-bench: Can AI conduct AI research experiments? CoRR, abs/2505.24785, 2025. doi:10.48550/ARXIV.2505.24785. URL https://doi.org/10.48550/arXiv.2505.24785

work page doi:10.48550/arxiv.2505.24785 2025
[95]

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. Lab-bench: Measuring capabilities of language models for biology research. CoRR, abs/2407.10362, 2024. doi:10.48550/ARXIV.2407.10362. URL https://doi.org/10.48550/arXiv.2407.10362

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.10362 2024
[96]

Tongyi DeepResearch Technical Report

Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.24701 2025
[97]

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[98]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning ...

2024
[99]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. CoRR, abs/2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[100]

Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery. CoRR, abs/2603.08127, 2026

work page arXiv 2026
[101]

Eureka: Human-level reward design via coding large language models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De - An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://op...

2024
[102]

B. P. MacLeod, F. G. L. Parlane, T. D. Morrissey, F. Häse, L. M. Roch, K. E. Dettelbach, R. Moreira, L. P. E. Yunker, M. B. Rooney, J. R. Deeth, V. Lai, G. J. Ng, H. Situ, R. H. Zhang, M. S. Elliott, T. H. Haley, D. J. Dvorak, A. Aspuru-Guzik, J. E. Hein, and C. P. Berlinguette. Self-driving laboratory for accelerated discovery of thin-film materials. Sci...

work page doi:10.1126/sciadv.aaz8867 2020
[103]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023

2023
[104]

Discoverybench: Towards data-driven discovery with large language models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28...

2025
[105]

Arik, and Tomas Pfister

Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan \" O . Arik, and Tomas Pfister. MLE-STAR: machine learning engineering agent via search and targeted refinement. CoRR, abs/2506.15692, 2025. doi:10.48550/ARXIV.2506.15692. URL https://doi.org/10.48550/arXiv.2506.15692

work page doi:10.48550/arxiv.2506.15692 2025
[106]

Alexander Novikov, Ng \^ a n Vu, Marvin Eisenberger, Emilien Dupont, Po - Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.13131 2025
[107]

John R. Platt. Strong inference. Science, 146 0 (3642): 0 347--353, 1964. doi:10.1126/science.146.3642.347. URL https://www.science.org/doi/abs/10.1126/science.146.3642.347

work page doi:10.1126/science.146.3642.347 1964
[108]

WebResearcher: Unleashing unbounded reasoning capability in long-horizon agents.arXiv preprint arXiv:2509.13309,

Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents. CoRR, abs/2509.13309, 2025. doi:10.48550/ARXIV.2509.13309. URL https://doi.o...

work page doi:10.48550/arxiv.2509.13309 2025
[109]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In EMNLP (Findings) , pages 5977--6043. Association for Computational Linguistics, 2025

2025
[110]

Kanell, Peter Xu, Omar Khattab, and Monica S

Yijia Shao, Yucheng Jiang, Theodore A. Kanell, Peter Xu, Omar Khattab, and Monica S. Lam. Assisting in writing wikipedia-like articles from scratch with large language models, 2024. URL https://arxiv.org/abs/2402.14207

work page arXiv 2024
[111]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In NeurIPS, 2023

2023
[112]

Skarlinski, Sam Cox, Jon M

Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela M. Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge. CoRR, abs/2409.13740, 2024. doi:10.48550/ARXIV.2409.13740. URL https://doi.org/10.48550/arXiv.2409.13740

work page doi:10.48550/arxiv.2409.13740 2024

Showing first 80 references.

[1] [1]

2024 , url =

Jie Ma and Yalong Bai and Bineng Zhong and Wei Zhang and Ting Yao and Tao Mei , title =. 2024 , url =. doi:10.1109/TNNLS.2023.3270479 , timestamp =

work page doi:10.1109/tnnls.2023.3270479 2024

[2] [6]

Foerster and Jeff Clune and David Ha , title =

Chris Lu and Cong Lu and Robert Tjarko Lange and Jakob N. Foerster and Jeff Clune and David Ha , title =. CoRR , volume =

[3] [7]

Foerster and Jeff Clune and David Ha , title =

Yutaro Yamada and Robert Tjarko Lange and Cong Lu and Shengran Hu and Chris Lu and Jakob N. Foerster and Jeff Clune and David Ha , title =. CoRR , volume =

[4] [8]

Samuel Schmidgall and Yusheng Su and Ze Wang and Ximeng Sun and Jialian Wu and Xiaodong Yu and Jiang Liu and Michael Moor and Zicheng Liu and Emad Barsoum , title =

[5] [9]

Towards an

Juraj Gottweis and Wei. Towards an. CoRR , volume =

[6] [10]

CoRR , volume =

Jiaqi Liu and Shi Qiu and Mairui Li and Bingzhou Li and Haonian Ji and Siwei Han and Xinyu Ye and Peng Xia and Zihan Dong and Meng Chen and Congyu Zhang and Letian Zhang and Guiming Chen and Haoqin Tu and Xinyu Yang and Lu Feng and Xujiang Zhao and Haifeng Chen and Jiawei Zhou and Xiao Wang and Weitong Zhang and Hongtu Zhu and Yun Li and Jieru Mei and Hon...

[7] [11]

CoRR , volume =

Zhengyao Jiang and Dominik Schmidt and Dhruv Srikanth and Dixing Xu and Ian Kaplan and Deniss Jacenko and Yuxiang Wu , title =. CoRR , volume =

[8] [12]

Jun Shern Chan and Neil Chowdhury and Oliver Jaffe and James Aung and Dane Sherburn and Evan Mays and Giulio Starace and Kevin Liu and Leon Maksin and Tejal Patwardhan and Aleksander Madry and Lilian Weng , title =

[9] [14]

NeurIPS , year =

Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , title =. NeurIPS , year =

[10] [15]

NeurIPS , year =

Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , title =. NeurIPS , year =

[11] [16]

Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Nan Duan and Weizhu Chen , title =

[12] [17]

Teaching Large Language Models to Self-Debug , booktitle =

Xinyun Chen and Maxwell Lin and Nathanael Sch. Teaching Large Language Models to Self-Debug , booktitle =

[13] [18]

CoRR , volume =

Yougang Lyu and Xi Zhang and Xinhao Yi and Yuyue Zhao and Shuyu Guo and Wenxiang Hu and Jan Piotrowski and Jakub Kaliski and Jacopo Urbani and Zaiqiao Meng and Lun Zhou and Xiaohui Yan , title =. CoRR , volume =

[14] [19]

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Trans. Mach. Learn. Res. , volume =

[15] [20]

Le and Ed H

Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =

[16] [21]

NeurIPS , year =

Shunyu Yao and Dian Yu and Jeffrey Zhao and Izhak Shafran and Tom Griffiths and Yuan Cao and Karthik Narasimhan , title =. NeurIPS , year =

[17] [22]

Tenenbaum and Igor Mordatch , title =

Yilun Du and Shuang Li and Antonio Torralba and Joshua B. Tenenbaum and Igor Mordatch , title =

[18] [23]

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , booktitle =

Andy Zhou and Kai Yan and Michal Shlapentokh. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , booktitle =

[19] [25]

T. C. Chamberlin , title =. Science , volume =. 1965 , doi =

1965

[20] [26]

Platt , title =

John R. Platt , title =. Science , volume =. 1964 , doi =

1964

[21] [30]

The Thirteenth International Conference on Learning Representations,

Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[22] [31]

Baker and Benjamin Burns and Daniel Adu

Ziru Chen and Shijie Chen and Yuting Ning and Qianheng Zhang and Boshi Wang and Botao Yu and Yifei Li and Zeyi Liao and Chen Wei and Zitong Lu and Vishal Dey and Mingyi Xue and Frazier N. Baker and Benjamin Burns and Daniel Adu. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery , booktitle =. 2025 , url =

2025

[23] [37]

Nature , year=

Functional genomic hypothesis generation and experimentation by a robot scientist , author=. Nature , year=

[24] [40]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[25] [41]

Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and et al. , title =. The Thirteenth ...

2025

[26] [42]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[27] [43]

2024 , eprint=

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models , author=. 2024 , eprint=

2024

[28] [45]

Nature , year=

Synthesizing scientific literature with retrieval-augmented language models , author=. Nature , year=

[29] [51]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =

2024

[30] [52]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[31] [54]

and Zhou, Xin and Wheeler, Matthew and Bernstein, Jonathan A

Huang, Kexin and Zhang, Serena and Wang, Hanchen and Qu, Yuanhao and Lu, Yingzhou and Roohani, Yusuf and Li, Ryan and Qiu, Lin and Li, Gavin and Zhang, Junze and Yin, Di and Marwaha, Shruti and Carter, Jennefer N. and Zhou, Xin and Wheeler, Matthew and Bernstein, Jonathan A. and Wang, Mengdi and He, Peng and Zhou, Jingtian and Snyder, Michael and Cong, Le...

2025

[32] [55]

2025 , eprint=

InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification , author=. 2025 , eprint=

2025

[33] [56]

The Twelfth International Conference on Learning Representations,

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , title =. The Twelfth Internatio...

2024

[34] [61]

PaperBench: Evaluating AI's Ability to Replicate

Giulio Starace and Oliver Jaffe and Dane Sherburn and James Aung and Jun Shern Chan and Leon Maksin and Rachel Dias and Evan Mays and Benjamin Kinsella and Wyatt Thompson and Johannes Heidecke and Amelia Glaese and Tejal Patwardhan , editor =. PaperBench: Evaluating AI's Ability to Replicate. Forty-second International Conference on Machine Learning,. 202...

2025

[35] [62]

Huerta and Hao Peng , editor =

Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufe...

2024

[36] [63]

Journal of the Royal Society Interface , year=

Cheaper faster drug development validated by the repositioning of drugs against neglected tropical diseases , author=. Journal of the Royal Society Interface , year=

[37] [65]

Goodman , title =

Ruocheng Wang and Eric Zelikman and Gabriel Poesia and Yewen Pu and Nick Haber and Noah D. Goodman , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[38] [66]

Eureka: Human-Level Reward Design via Coding Large Language Models , booktitle =

Yecheng Jason Ma and William Liang and Guanzhi Wang and De. Eureka: Human-Level Reward Design via Coding Large Language Models , booktitle =. 2024 , url =

2024

[39] [67]

Le and Denny Zhou and Xinyun Chen , title =

Chengrun Yang and Xuezhi Wang and Yifeng Lu and Hanxiao Liu and Quoc V. Le and Denny Zhou and Xinyun Chen , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[40] [71]

2026 , eprint=

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration , author=. 2026 , eprint=

2026

[41] [73]

Hwang, Varsha Kishore, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Jenna Sparks, Jena D. Hwang, Varsha Kishore, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S. Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang...

2026

[42] [74]

17682–17690

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Confe...

work page doi:10.1609/aaai.v38i16.29720 2024

[43] [75]

Emergent autonomous scientific research capabilities of large language models

Daniil A. Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. CoRR, abs/2304.05332, 2023. doi:10.48550/ARXIV.2304.05332. URL https://doi.org/10.48550/arXiv.2304.05332

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.05332 2023

[44] [76]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nat. Mac. Intell., 6 0 (5): 0 525--535, 2024. doi:10.1038/S42256-024-00832-8. URL https://doi.org/10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024

[45] [77]

T. C. Chamberlin. The method of multiple working hypotheses. Science, 148 0 (3671): 0 754--759, 1965. doi:10.1126/science.148.3671.754. URL https://www.science.org/doi/abs/10.1126/science.148.3671.754

work page doi:10.1126/science.148.3671.754 1965

[46] [78]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In ICLR . OpenReview.net, 2025

2025

[47] [79]

Mlr-bench: Evaluating ai agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating AI agents on open-ended machine learning research. CoRR, abs/2505.19955, 2025 a . doi:10.48550/ARXIV.2505.19955. URL https://doi.org/10.48550/arXiv.2505.19955

work page doi:10.48550/arxiv.2505.19955 2025

[48] [80]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Sch \" a rli, and Denny Zhou. Teaching large language models to self-debug. In ICLR . OpenReview.net, 2024

2024

[49] [81]

Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. I...

2025

[50] [82]

ORGANA: A robotic assistant for automated chemistry experimentation and characterization

Kourosh Darvish, Marta Skreta, Yuchi Zhao, Naruki Yoshikawa, Sagnik Som, Miroslav Bogdanovic, Yang Cao, Han Hao, Haoping Xu, Al \' a n Aspuru - Guzik, Animesh Garg, and Florian Shkurti. ORGANA: A robotic assistant for automated chemistry experimentation and characterization. CoRR, abs/2401.06949, 2024. doi:10.48550/ARXIV.2401.06949. URL https://doi.org/10...

work page doi:10.48550/arxiv.2401.06949 2024

[51] [83]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In ICML , volume 235 of Proceedings of Machine Learning Research, pages 11733--11763. PMLR / OpenReview.net, 2024

2024

[52] [84]

Camyla: Scaling Autonomous Research in Medical Image Segmentation

Yifan Gao, Haoyue Li, Feng Yuan, Xin Gao, Weiran Huang, and Xiaosong Wang. Camyla: Scaling autonomous research in medical image segmentation. CoRR, abs/2604.10696, 2026. doi:10.48550/ARXIV.2604.10696. URL https://doi.org/10.48550/arXiv.2604.10696

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10696 2026

[53] [85]

Juraj Gottweis, Wei - Hung Weng, Alexander N. Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Gua...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [86]

CRITIC: large language models can self-correct with tool-interactive critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: large language models can self-correct with tool-interactive critiquing. In ICLR . OpenReview.net, 2024

2024

[55] [87]

Data interpreter: An LLM agent for data science

Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. Data int...

work page doi:10.18653/v1/2025.findings-acl.1016 2025

[56] [88]

Carter, Xin Zhou, Matthew Wheeler, Jonathan A

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, Michael Snyder, Le Cong, Aviv Regev, and Jure Leskovec. Biomni: A general-purpose biomedical ai agent. b...

work page doi:10.1101/2025.05.30.656746 2025

[57] [89]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery

InternScience Team . Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery. CoRR, abs/2602.08990, 2026. doi:10.48550/ARXIV.2602.08990. URL https://doi.org/10.48550/arXiv.2602.08990

work page doi:10.48550/arxiv.2602.08990 2026

[58] [90]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: ai-driven exploration in the space of code. CoRR, abs/2502.13138, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [91]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

2024

[60] [92]

King, Ken E

Ross D. King, Ken E. Whelan, Ffion Mair Jones, Philip G. K. Reiser, Christopher H. Bryant, Stephen H. Muggleton, Douglas B. Kell, and Stephen G. Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427: 0 247--252, 2004. URL https://api.semanticscholar.org/CorpusID:4428725

2004

[61] [93]

King, Jem Rowland, Stephen G

Ross D. King, Jem Rowland, Stephen G. Oliver, Michael Young, Wayne Aubrey, Emma Byrne, Maria Liakata, Magdalena Markham, Pinar Pir, Larisa N. Soldatova, Andrew Sparkes, Kenneth E. Whelan, and Amanda Clare. The automation of science. Science, 324 0 (5923): 0 85--89, 2009. doi:10.1126/science.1165620. URL https://www.science.org/doi/abs/10.1126/science.1165620

work page doi:10.1126/science.1165620 2009

[62] [94]

org/abs/2505.24785

Patrick Tser Jern Kon, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia, and Ang Chen. Exp-bench: Can AI conduct AI research experiments? CoRR, abs/2505.24785, 2025. doi:10.48550/ARXIV.2505.24785. URL https://doi.org/10.48550/arXiv.2505.24785

work page doi:10.48550/arxiv.2505.24785 2025

[63] [95]

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. Lab-bench: Measuring capabilities of language models for biology research. CoRR, abs/2407.10362, 2024. doi:10.48550/ARXIV.2407.10362. URL https://doi.org/10.48550/arXiv.2407.10362

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.10362 2024

[64] [96]

Tongyi DeepResearch Technical Report

Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.24701 2025

[65] [97]

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[66] [98]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning ...

2024

[67] [99]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. CoRR, abs/2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [100]

Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery. CoRR, abs/2603.08127, 2026

work page arXiv 2026

[69] [101]

Eureka: Human-level reward design via coding large language models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De - An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://op...

2024

[70] [102]

B. P. MacLeod, F. G. L. Parlane, T. D. Morrissey, F. Häse, L. M. Roch, K. E. Dettelbach, R. Moreira, L. P. E. Yunker, M. B. Rooney, J. R. Deeth, V. Lai, G. J. Ng, H. Situ, R. H. Zhang, M. S. Elliott, T. H. Haley, D. J. Dvorak, A. Aspuru-Guzik, J. E. Hein, and C. P. Berlinguette. Self-driving laboratory for accelerated discovery of thin-film materials. Sci...

work page doi:10.1126/sciadv.aaz8867 2020

[71] [103]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023

2023

[72] [104]

Discoverybench: Towards data-driven discovery with large language models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28...

2025

[73] [105]

Arik, and Tomas Pfister

Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan \" O . Arik, and Tomas Pfister. MLE-STAR: machine learning engineering agent via search and targeted refinement. CoRR, abs/2506.15692, 2025. doi:10.48550/ARXIV.2506.15692. URL https://doi.org/10.48550/arXiv.2506.15692

work page doi:10.48550/arxiv.2506.15692 2025

[74] [106]

Alexander Novikov, Ng \^ a n Vu, Marvin Eisenberger, Emilien Dupont, Po - Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.13131 2025

[75] [107]

John R. Platt. Strong inference. Science, 146 0 (3642): 0 347--353, 1964. doi:10.1126/science.146.3642.347. URL https://www.science.org/doi/abs/10.1126/science.146.3642.347

work page doi:10.1126/science.146.3642.347 1964

[76] [108]

WebResearcher: Unleashing unbounded reasoning capability in long-horizon agents.arXiv preprint arXiv:2509.13309,

Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents. CoRR, abs/2509.13309, 2025. doi:10.48550/ARXIV.2509.13309. URL https://doi.o...

work page doi:10.48550/arxiv.2509.13309 2025

[77] [109]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In EMNLP (Findings) , pages 5977--6043. Association for Computational Linguistics, 2025

2025

[78] [110]

Kanell, Peter Xu, Omar Khattab, and Monica S

Yijia Shao, Yucheng Jiang, Theodore A. Kanell, Peter Xu, Omar Khattab, and Monica S. Lam. Assisting in writing wikipedia-like articles from scratch with large language models, 2024. URL https://arxiv.org/abs/2402.14207

work page arXiv 2024

[79] [111]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In NeurIPS, 2023

2023

[80] [112]

Skarlinski, Sam Cox, Jon M

Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela M. Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge. CoRR, abs/2409.13740, 2024. doi:10.48550/ARXIV.2409.13740. URL https://doi.org/10.48550/arXiv.2409.13740

work page doi:10.48550/arxiv.2409.13740 2024