Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy

Chugang Yi; Haizhao Yang; Jianda Du; Jiaxuan Guo; Kejia Zhang; Xingyu Ren; Youran Sun

arxiv: 2606.24177 · v1 · pith:DCXNYLKXnew · submitted 2026-06-23 · 💻 cs.SE · cs.AI· cs.CL· cs.MA

Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy

Youran Sun , Xingyu Ren , Chugang Yi , Jiaxuan Guo , Kejia Zhang , Jianda Du , Haizhao Yang This is my paper

Pith reviewed 2026-06-25 23:22 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.MA

keywords autonomous researchprompt economylarge language modelsresearch orchestrationfailure taxonomyomnidisciplinary systemszero-code workflows

0 comments

The pith

Agon shows prompt economy loops can scale research production across domains while a taxonomy separates machine-fixable failures from those needing human judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agon as a research orchestrator that validates checkable claims inside automated workflows and leaves the rest to human scientists. It operated for 444 iterations across multiple domains using only small starting topics and no human-written experimental code. The runs establish scalability of the approach while surfacing failures that are organized into a taxonomy along severity, fixability, visibility, and capability locus. This taxonomy identifies which issues the loops can detect and correct versus those that require external judgment. The overall result frames a division in which machines handle research scale and humans provide steering.

Core claim

Agon is an autonomous large-scale omnidisciplinary research system built on the six principles of Prompt Economy, Future-Facing, Minimal Prompts, OmniDisciplinary, Massive Parallelism, and Zero-Code. When run for 444 iterations across domains from minimal starting topics and without any human-written experimental code, it validates what can be checked inside the workflow and exposes new classes of failure. These failures are organized into a taxonomy along severity, fixability, visibility, and capability locus that separates issues the loops can see and fix from those that require human judgment.

What carries the argument

Prompt Economy loops that autonomously generate research artifacts and validate checkable claims inside the workflow.

If this is right

Research production can proceed at larger scales when only initial topics are supplied by humans.
Failures in autonomous research systems can be systematically grouped to clarify the boundary between machine and human roles.
The taxonomy provides a practical way to route tasks so that loops handle visible and fixable problems while humans address the remainder.
Omnidisciplinary operation becomes feasible when the same loop structure applies across fields without custom code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Testing the same loops on narrower domains could reveal whether the machine-human boundary shifts with field-specific checkability.
The taxonomy might serve as a design tool for other agent systems by making explicit which capabilities must remain with humans.
Repeated deployments could track whether the fraction of machine-fixable failures decreases as the loops accumulate experience.

Load-bearing premise

That the prompt economy loops can reliably validate checkable claims inside the workflow and that the derived failure taxonomy accurately distinguishes machine-fixable issues from those requiring human judgment, all without any human-written experimental code.

What would settle it

A run in which Agon produces a claim whose truth value is independently verifiable yet the internal loops accept an incorrect conclusion, or a failure whose classification under the taxonomy does not match independent review of its fixability.

Figures

Figures reproduced from arXiv: 2606.24177 by Chugang Yi, Haizhao Yang, Jianda Du, Jiaxuan Guo, Kejia Zhang, Xingyu Ren, Youran Sun.

**Figure 1.** Figure 1: System overview of Agon. The workflow starts from either topic-radar or human topic selection, then proceeds through idea, proposal, experiment, and paper factories. Each factory advances a research artifact through role-specific agent loops, while deep-literature research supplies reusable context to multiple stages. 1 Introduction Many children dream of becoming scientists, yet very few ever do. The barr… view at source ↗

read the original abstract

Large language models are making research production scalable, shifting the bottleneck from producing artifacts to judging claims. We present \textsc{Agon}, a research orchestrator that validates what can be checked inside the workflow and leaves the remaining judgments to human scientists. \textsc{Agon} is built on six design principles: Prompt Economy, Future-Facing, Minimal Prompts, OmniDisciplinary, Massive Parallelism, and Zero-Code. We ran \textsc{Agon} across domains for 444 iterations of Prompt Economy loops, using only small starting topics and no human-written experimental code. These deployments demonstrate scalability while exposing new classes of failure. We organize these failures into a taxonomy along severity, fixability, visibility, and capability locus. The taxonomy separates failures the loops can see and fix from those that require human judgment. Together, these results show that \textsc{Agon} is pushing research toward a new paradigm: machine scales, human steers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agon asserts 444 autonomous research iterations and a new failure taxonomy but supplies no examples, metrics, outputs, or validation details to support those claims.

read the letter

The paper introduces Agon, an LLM orchestrator built on six principles including Prompt Economy and Zero-Code. It claims to have run 444 iterations across domains starting from small topics, with no human-written experimental code, and to have produced a taxonomy of failures organized by severity, fixability, visibility, and capability locus. The taxonomy is meant to separate issues the loops can detect and fix from those needing human judgment.

What the work does is name the shift from artifact production to claim judgment as the new bottleneck, and it sketches a framework that tries to keep the loops self-contained where possible. The six principles are stated clearly enough to understand the intended design.

The central problem is that none of the claims are backed by evidence in the text. There are no reported research outputs, no success rates on validated claims, no concrete examples of a classified failure, and no trace of how any loop actually checked a result or observed a failure. The number 444 stands alone without context on what domains were covered or what was achieved. Without those details the distinction between machine-fixable and human-required failures cannot be evaluated, and the scalability assertion remains untested.

The absence of comparisons to prior LLM-agent or automated-research systems also leaves the novelty of the framework and taxonomy hard to judge. The circularity risk around self-defined concepts like Prompt Economy is present but secondary to the missing empirical grounding.

This is the kind of high-level system sketch that might interest groups already building multi-agent research tools, but it offers little for readers who need methods or results they can examine. It does not look ready for peer review in its current state; the missing validation steps would need to be added before a referee could assess whether the taxonomy actually works as described.

Referee Report

3 major / 1 minor

Summary. The manuscript presents Agon, an autonomous research orchestrator built on six design principles (Prompt Economy, Future-Facing, Minimal Prompts, OmniDisciplinary, Massive Parallelism, Zero-Code). It reports running the system for 444 iterations across domains from small starting topics with no human-written experimental code, claims these runs demonstrate scalability while exposing failures, and organizes the failures into a taxonomy along severity, fixability, visibility, and capability locus that separates machine-fixable issues from those requiring human judgment, advancing a paradigm of machine scaling with human steering.

Significance. If the empirical claims and taxonomy validation hold with supporting data, the work could represent a notable contribution to automated research systems by showing how LLM-based loops can handle internal validation at scale while deferring only select judgments. The zero-code and omnidisciplinary framing would be strengths if demonstrated with concrete outputs and reproducible traces.

major comments (3)

[Abstract] Abstract: The central claim that 444 iterations demonstrate scalability and yield a taxonomy separating machine-fixable from human-required failures lacks any reported metrics, success rates, concrete research outputs, examples of autonomously validated claims, or traces of how failures were observed and classified.
[Abstract] Abstract/Results: The taxonomy is asserted to distinguish failures the loops can see and fix from those requiring human judgment, but no evidence, examples, or classification procedure is supplied, rendering the distinction unevaluable and the paradigm-shift assertion unsupported.
[Methods] Methods/Implementation: No details are provided on how Prompt Economy loops validate checkable claims inside the workflow, what the small starting topics were, or how the zero-code constraint was maintained across iterations, all of which are load-bearing for the autonomy and validation claims.

minor comments (1)

[Abstract] The term 'Prompt Economy' is used as a foundational concept without an explicit definition or grounding in the provided text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make targeted revisions to improve clarity, evidence, and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 444 iterations demonstrate scalability and yield a taxonomy separating machine-fixable from human-required failures lacks any reported metrics, success rates, concrete research outputs, examples of autonomously validated claims, or traces of how failures were observed and classified.

Authors: We agree the abstract would be strengthened by including quantitative indicators. In revision we will add summary statistics (e.g., overall iteration throughput, fraction of claims autonomously validated, and counts of each failure category) drawn from the results section, plus one or two concrete examples of validated outputs and failure traces. The main text already reports the 444 iterations and taxonomy, but these numbers and examples will be elevated to the abstract for immediate visibility. revision: yes
Referee: [Abstract] Abstract/Results: The taxonomy is asserted to distinguish failures the loops can see and fix from those requiring human judgment, but no evidence, examples, or classification procedure is supplied, rendering the distinction unevaluable and the paradigm-shift assertion unsupported.

Authors: The results section presents the taxonomy along the four axes (severity, fixability, visibility, capability locus) with illustrative cases. To make the machine-vs-human distinction directly evaluable we will add (1) an explicit classification procedure subsection describing how visibility and fixability were assessed in each iteration and (2) a table or set of annotated examples showing which failures were resolved inside the loop versus those escalated. This will also reinforce the paradigm claim with traceable evidence. revision: yes
Referee: [Methods] Methods/Implementation: No details are provided on how Prompt Economy loops validate checkable claims inside the workflow, what the small starting topics were, or how the zero-code constraint was maintained across iterations, all of which are load-bearing for the autonomy and validation claims.

Authors: We will expand the Methods section with three new subsections: (a) the internal validation protocol used by the Prompt Economy loops (including prompt templates for self-checking and cross-verification), (b) the exact list of small starting topics and domains, and (c) the engineering steps that enforce the zero-code rule (e.g., all code generation and execution handled exclusively by the LLM agents). These additions will directly support the autonomy and validation claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical description of system runs and taxonomy

full rationale

The paper presents Agon as a system built on explicitly listed design principles including Prompt Economy, then reports running it for 444 iterations across domains with no human-written experimental code, observing failures, and organizing those failures into a taxonomy along four axes. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the text. The taxonomy is described as derived directly from the observed failures in the deployments rather than presupposed or fitted to match a prior result. The central claim that the taxonomy separates machine-fixable from human-required failures follows from the reported runs without any reduction to inputs by construction. This is a standard empirical systems paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces 'Prompt Economy' and the six design principles as foundational without external references or evidence in the abstract; no free parameters or axioms are explicitly stated.

invented entities (1)

Prompt Economy no independent evidence
purpose: Core mechanism enabling autonomous research orchestration via prompt loops
Presented as a new design principle without prior literature support visible in the abstract

pith-pipeline@v0.9.1-grok · 5716 in / 1113 out tokens · 32046 ms · 2026-06-25T23:22:59.470429+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 12 linked inside Pith

[1]

Mlr-bench: Evaluating ai agents on open-ended machine learning research

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research. In Proceedings of the Thirty-Ninth Conference on Neural Information Processing Systems (NeurIPS 2025), Datasets and Benchmarks Track ,

2025
[2]

Ruiying Chen

URL https://arxiv.org/abs/2505.19955. Ruiying Chen. Evidence-bound autonomous research (evibound): A governance framework for eliminating false claims,

arXiv
[3]

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, and Furong Huang

URL https://arxiv.org/abs/2511.05524. Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, and Furong Huang. Soundnessbench: Can your ai scientist really tell good research ideas from bad ones?,

arXiv
[4]

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder

URL https://arxiv.org/abs/2605.30329. Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. Cas- cade: Cumulative agentic skill creation through autonomous development and evolution,

Pith/arXiv arXiv
[5]

Fengqing Jiang, Yichen Feng, Yuetai Li, Luyao Niu, Basel Alomair, and Radha Poovendran

URL https://arxiv.org/abs/2512.23880. Fengqing Jiang, Yichen Feng, Yuetai Li, Luyao Niu, Basel Alomair, and Radha Poovendran. Bad- scientist: Can a research agent write convincing but unsound papers that fool llm reviewers? In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026),

arXiv 2026
[6]

Priyanka Kargupta, Ishika Agarwal, Tal August, and Jiawei Han

URL https://arxiv.org/abs/2510.18003. Priyanka Kargupta, Ishika Agarwal, Tal August, and Jiawei Han. Tree-of-debate: Multi-persona debate trees elicit critical thinking for scientific comparative analysis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL

Pith/arXiv arXiv
[7]

URL https://arxiv.org/abs/2502.14767. Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie ...

arXiv
[8]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

URL https://arxiv.org/abs/2605.20025. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery,

Pith/arXiv arXiv
[9]

org/abs/2408.06292

URL https://arxiv. org/abs/2408.06292. Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, and Tomas Pfister. Scientistone: Towards human-level autonomous research via chain-of- evidence,

Pith/arXiv arXiv
[10]

Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J

URL https://arxiv.org/abs/2605.26340. Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A ...

Pith/arXiv arXiv
[11]

39 Razeen A Rasheed, Somnath Banerjee, Animesh Mukherjee, and Rima Hazra

URL https://arxiv.org/abs/2506.13131. 39 Razeen A Rasheed, Somnath Banerjee, Animesh Mukherjee, and Rima Hazra. From fluent to verifiable: Claim-level auditability for deep research agents,

Pith/arXiv arXiv
[12]

Xingyu Ren, Youran Sun, Chugang Yi, Kejia Zhang, Jiaxuan Guo, Jianda Du, and Haizhao Yang

URL https://arxiv.org/ abs/2602.13855. Xingyu Ren, Youran Sun, Chugang Yi, Kejia Zhang, Jiaxuan Guo, Jianda Du, and Haizhao Yang. What’s missing in autonomous research? a systematization of systems, benchmarks, and verifica- tion,

arXiv
[13]

Preprint available on ResearchGate

URL https://www.researchgate.net/publication/406952713. Preprint available on ResearchGate. Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants,

arXiv
[14]

Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, and Stella Biderman

URL https://arxiv.org/abs/2501.04227. Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, and Stella Biderman. When ai co-scientists fail: Spot- a benchmark for automated verification of scientific research,

Pith/arXiv arXiv
[15]

URL https://arxiv.org/ abs/2505.11855. Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research,

arXiv
[16]

Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, and Jiaxuan Guo

URL https://arxiv.org/abs/2504.01848. Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, and Jiaxuan Guo. Perspectivegap: A benchmark for multi-agent orchestration prompting,

Pith/arXiv arXiv
[17]

Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, and James Zou

URL https://arxiv.org/abs/2505.18705. Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, and James Zou. Can llm feedback enhance review quality? a randomized study of 20k reviews at iclr 2025,

arXiv 2025
[18]

Chengcheng Wang, Qinhua Xie, Wei He, Jianyuan Guo, Shiqi Wang, and Chang Xu

URL https://arxiv.org/abs/2504.09737. Chengcheng Wang, Qinhua Xie, Wei He, Jianyuan Guo, Shiqi Wang, and Chang Xu. Sibyl- autoresearch: Autonomous research needs self-evolving trial-and-error harnesses, not paper gen- erators, 2026a. URL https://arxiv.org/abs/2605.22343. Yuanli Wang, Yaoyao Qian, Yue Zhang, Hanhan Zhou, Jindan Huang, Tianfu Fu, Qiuyang Ma...

arXiv
[19]

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha

URL https://arxiv.org/abs/2411.00816. Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search,

arXiv
[20]

Ruofeng Yang, Yongcan Li, and Shuai Li

URL https://arxiv.org/abs/2504.08066. Ruofeng Yang, Yongcan Li, and Shuai Li. Aris: Autonomous research via adversarial multi-agent collaboration,

Pith/arXiv arXiv
[21]

40 Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, and Bowen Zhou

URL https://arxiv.org/abs/2605.03042. 40 Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, and Bowen Zhou. Dolphin: Moving towards closed-loop auto-research through thinking, practice, and feedback,

Pith/arXiv arXiv
[22]

URL https://arxiv.org/abs/2501.03916. Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, and Lei Bai....

arXiv
[23]

URL https: //arxiv.org/abs/2505.16938. Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang, and Xuanjing Huang. Agentv-rl: Scaling reward modeling with agentic verifier. In Proceedings of the 64th Annual Meeting of the Ass...

arXiv 2026
[24]

Bing Zhou, Xiao Huang, Huan Ning, Qiusheng Wu, Diya Li, and Ziyi Zhang

URL https://arxiv.org/abs/2604.16004. Bing Zhou, Xiao Huang, Huan Ning, Qiusheng Wu, Diya Li, and Ziyi Zhang. Nora: A harness- engineered autonomous research agent for end-to-end spatial data science,

Pith/arXiv arXiv
[25]

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, and Yanfeng Wang

URL https: //arxiv.org/abs/2605.02092. Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, and Yanfeng Wang. Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering,

Pith/arXiv arXiv
[26]

URL https://arxiv.org/abs/ 2601.10402. 41

arXiv

[1] [1]

Mlr-bench: Evaluating ai agents on open-ended machine learning research

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research. In Proceedings of the Thirty-Ninth Conference on Neural Information Processing Systems (NeurIPS 2025), Datasets and Benchmarks Track ,

2025

[2] [2]

Ruiying Chen

URL https://arxiv.org/abs/2505.19955. Ruiying Chen. Evidence-bound autonomous research (evibound): A governance framework for eliminating false claims,

arXiv

[3] [3]

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, and Furong Huang

URL https://arxiv.org/abs/2511.05524. Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, and Furong Huang. Soundnessbench: Can your ai scientist really tell good research ideas from bad ones?,

arXiv

[4] [4]

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder

URL https://arxiv.org/abs/2605.30329. Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. Cas- cade: Cumulative agentic skill creation through autonomous development and evolution,

Pith/arXiv arXiv

[5] [5]

Fengqing Jiang, Yichen Feng, Yuetai Li, Luyao Niu, Basel Alomair, and Radha Poovendran

URL https://arxiv.org/abs/2512.23880. Fengqing Jiang, Yichen Feng, Yuetai Li, Luyao Niu, Basel Alomair, and Radha Poovendran. Bad- scientist: Can a research agent write convincing but unsound papers that fool llm reviewers? In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026),

arXiv 2026

[6] [6]

Priyanka Kargupta, Ishika Agarwal, Tal August, and Jiawei Han

URL https://arxiv.org/abs/2510.18003. Priyanka Kargupta, Ishika Agarwal, Tal August, and Jiawei Han. Tree-of-debate: Multi-persona debate trees elicit critical thinking for scientific comparative analysis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL

Pith/arXiv arXiv

[7] [7]

URL https://arxiv.org/abs/2502.14767. Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie ...

arXiv

[8] [8]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

URL https://arxiv.org/abs/2605.20025. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery,

Pith/arXiv arXiv

[9] [9]

org/abs/2408.06292

URL https://arxiv. org/abs/2408.06292. Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, and Tomas Pfister. Scientistone: Towards human-level autonomous research via chain-of- evidence,

Pith/arXiv arXiv

[10] [10]

Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J

URL https://arxiv.org/abs/2605.26340. Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A ...

Pith/arXiv arXiv

[11] [11]

39 Razeen A Rasheed, Somnath Banerjee, Animesh Mukherjee, and Rima Hazra

URL https://arxiv.org/abs/2506.13131. 39 Razeen A Rasheed, Somnath Banerjee, Animesh Mukherjee, and Rima Hazra. From fluent to verifiable: Claim-level auditability for deep research agents,

Pith/arXiv arXiv

[12] [12]

Xingyu Ren, Youran Sun, Chugang Yi, Kejia Zhang, Jiaxuan Guo, Jianda Du, and Haizhao Yang

URL https://arxiv.org/ abs/2602.13855. Xingyu Ren, Youran Sun, Chugang Yi, Kejia Zhang, Jiaxuan Guo, Jianda Du, and Haizhao Yang. What’s missing in autonomous research? a systematization of systems, benchmarks, and verifica- tion,

arXiv

[13] [13]

Preprint available on ResearchGate

URL https://www.researchgate.net/publication/406952713. Preprint available on ResearchGate. Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants,

arXiv

[14] [14]

Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, and Stella Biderman

URL https://arxiv.org/abs/2501.04227. Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, and Stella Biderman. When ai co-scientists fail: Spot- a benchmark for automated verification of scientific research,

Pith/arXiv arXiv

[15] [15]

URL https://arxiv.org/ abs/2505.11855. Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research,

arXiv

[16] [16]

Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, and Jiaxuan Guo

URL https://arxiv.org/abs/2504.01848. Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, and Jiaxuan Guo. Perspectivegap: A benchmark for multi-agent orchestration prompting,

Pith/arXiv arXiv

[17] [17]

Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, and James Zou

URL https://arxiv.org/abs/2505.18705. Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, and James Zou. Can llm feedback enhance review quality? a randomized study of 20k reviews at iclr 2025,

arXiv 2025

[18] [18]

Chengcheng Wang, Qinhua Xie, Wei He, Jianyuan Guo, Shiqi Wang, and Chang Xu

URL https://arxiv.org/abs/2504.09737. Chengcheng Wang, Qinhua Xie, Wei He, Jianyuan Guo, Shiqi Wang, and Chang Xu. Sibyl- autoresearch: Autonomous research needs self-evolving trial-and-error harnesses, not paper gen- erators, 2026a. URL https://arxiv.org/abs/2605.22343. Yuanli Wang, Yaoyao Qian, Yue Zhang, Hanhan Zhou, Jindan Huang, Tianfu Fu, Qiuyang Ma...

arXiv

[19] [19]

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha

URL https://arxiv.org/abs/2411.00816. Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search,

arXiv

[20] [20]

Ruofeng Yang, Yongcan Li, and Shuai Li

URL https://arxiv.org/abs/2504.08066. Ruofeng Yang, Yongcan Li, and Shuai Li. Aris: Autonomous research via adversarial multi-agent collaboration,

Pith/arXiv arXiv

[21] [21]

40 Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, and Bowen Zhou

URL https://arxiv.org/abs/2605.03042. 40 Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, and Bowen Zhou. Dolphin: Moving towards closed-loop auto-research through thinking, practice, and feedback,

Pith/arXiv arXiv

[22] [22]

URL https://arxiv.org/abs/2501.03916. Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, and Lei Bai....

arXiv

[23] [23]

URL https: //arxiv.org/abs/2505.16938. Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang, and Xuanjing Huang. Agentv-rl: Scaling reward modeling with agentic verifier. In Proceedings of the 64th Annual Meeting of the Ass...

arXiv 2026

[24] [24]

Bing Zhou, Xiao Huang, Huan Ning, Qiusheng Wu, Diya Li, and Ziyi Zhang

URL https://arxiv.org/abs/2604.16004. Bing Zhou, Xiao Huang, Huan Ning, Qiusheng Wu, Diya Li, and Ziyi Zhang. Nora: A harness- engineered autonomous research agent for end-to-end spatial data science,

Pith/arXiv arXiv

[25] [25]

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, and Yanfeng Wang

URL https: //arxiv.org/abs/2605.02092. Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, and Yanfeng Wang. Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering,

Pith/arXiv arXiv

[26] [26]

URL https://arxiv.org/abs/ 2601.10402. 41

arXiv