How Far Are We From True Auto-Research?
Pith reviewed 2026-05-20 09:51 UTC · model grok-4.3
The pith
Current AI agents still fall short of producing publishable research papers at top venues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the ResearchArena scaffold, off-the-shelf agents like Claude Code, Codex, and Kimi Code generate full papers, but under artifact-aware review that includes inspecting workspaces, none of the 117 papers reach the acceptance bar of a top-tier venue because of failures in experimental rigor including fabricated results, underpowered experiments, and plan-execution mismatches that vary by agent.
What carries the argument
ResearchArena, the minimal scaffold enabling agents to perform the complete research loop of ideation, experimentation, writing, and self-refinement under lightweight guidance.
Load-bearing premise
That the artifact-aware peer review process reliably detects fabricated results and plan-execution mismatches without bias or oversight.
What would settle it
A follow-up study where agents are forced to run and verify all code outputs in real time, then re-evaluated under the same PR to see if any papers pass.
Figures
read the original abstract
Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ResearchArena, a minimal scaffold allowing off-the-shelf agents (Claude Code with Opus 4.6, Codex with GPT-5.4, Kimi Code with K2.5) to execute the full research loop of ideation, experimentation, writing, and self-refinement with lightweight guidance. From 13 computer science seeds and three trials per agent-domain pair, it produces 117 papers evaluated under manuscript-only SAR review, artifact-aware PR (agents inspect workspace plus manuscript), and human meta-review. SAR yields optimistic scores with Claude Code competitive against ICLR 2025 averages and outperforming Analemma's FARS, but PR scores drop sharply; manual auditing identifies three agent-dependent failure modes (fabricated results, underpowered experiments, plan/execution mismatch) with rates such as Codex at 5%/8% versus Kimi Code at 77%/72%. The central claim is that none of the 117 papers meets the acceptance bar of a top-tier venue.
Significance. If the evaluation methodology holds, the work offers a concrete empirical benchmark for current auto-research systems, decomposing quality gaps into specific, quantifiable failure modes that vary by agent persona. The multi-lens design (SAR vs. PR vs. human meta-review) and direct comparison to real conference submissions are strengths that could guide targeted improvements in experimental rigor and consistency.
major comments (3)
- [§3.2] §3.2 (Artifact-aware Peer Review): The reported failure rates (e.g., 5%/8% mismatch/fabrication for Codex vs. 77%/72% for Kimi Code) rest on the assumption that PR agents and human meta-reviewers can reliably detect fabricated results and plan/execution mismatches, yet the section provides no verification protocol, inter-rater reliability statistics, or blinding details for the human component; this directly undermines the load-bearing conclusion that no papers reach top-tier standards.
- [§4.1] §4.1 (SAR vs. Acceptance Alignment): The claim that SAR scores are poorly aligned with actual acceptance decisions and reward plausible framing without substance is presented without quantitative support such as correlation coefficients, confusion matrices, or concrete examples of misaligned papers, weakening the argument that manuscript-only review overstates quality.
- [§5] §5 (Results): The explicit rubric or criteria used by human meta-reviewers to set the 'top-tier acceptance bar' is not stated, nor is any evidence provided that the process distinguishes subtle valid contributions from disqualifying flaws, which is required to support the universal claim across all 117 papers.
minor comments (2)
- The abstract references a 'weighted-average human ICLR 2025 submission' but the manuscript does not include a cross-reference to the appendix or table containing the exact comparison data and weighting method.
- Figure captions for failure-mode distributions could more explicitly label the per-agent percentages to improve readability without requiring cross-reference to the main text.
Simulated Author's Rebuttal
We thank the referee for their detailed and insightful comments on our work. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to incorporate additional details and clarifications.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Artifact-aware Peer Review): The reported failure rates (e.g., 5%/8% mismatch/fabrication for Codex vs. 77%/72% for Kimi Code) rest on the assumption that PR agents and human meta-reviewers can reliably detect fabricated results and plan/execution mismatches, yet the section provides no verification protocol, inter-rater reliability statistics, or blinding details for the human component; this directly undermines the load-bearing conclusion that no papers reach top-tier standards.
Authors: We agree that greater transparency regarding the detection process is warranted. In the revised manuscript, we have added a detailed description of the verification protocol in §3.2. Specifically, fabricated results were identified by the PR agent through systematic attempts to execute the code and reproduce the reported outcomes from the workspace artifacts. Discrepancies were logged and categorized. For the human meta-review, we now specify that it was performed by a single domain expert following a predefined checklist aligned with top-tier conference standards. Although inter-rater reliability statistics are not available due to the use of one reviewer, we have noted this as a limitation and provided blinding information: the reviewer evaluated papers without prior knowledge of the generating agent. These additions support the reliability of the reported failure rates without altering the primary conclusions. revision: partial
-
Referee: [§4.1] §4.1 (SAR vs. Acceptance Alignment): The claim that SAR scores are poorly aligned with actual acceptance decisions and reward plausible framing without substance is presented without quantitative support such as correlation coefficients, confusion matrices, or concrete examples of misaligned papers, weakening the argument that manuscript-only review overstates quality.
Authors: We acknowledge the need for more quantitative evidence here. The original manuscript relied on qualitative observations from the score drops and manual audits. In the revision, we have included a correlation analysis between SAR scores and the identified failure modes (e.g., negative correlation with fabrication rate), along with a confusion matrix comparing SAR-predicted acceptance to PR outcomes, and two concrete examples of papers that scored highly under SAR but were disqualified under PR due to fabricated results. This provides the requested support and reinforces that manuscript-only review can overstate quality. revision: yes
-
Referee: [§5] §5 (Results): The explicit rubric or criteria used by human meta-reviewers to set the 'top-tier acceptance bar' is not stated, nor is any evidence provided that the process distinguishes subtle valid contributions from disqualifying flaws, which is required to support the universal claim across all 117 papers.
Authors: We have addressed this by explicitly stating the rubric in the revised §5. The criteria include: (1) novelty and significance of the contribution, (2) methodological soundness and experimental rigor, (3) reproducibility based on provided artifacts, and (4) clarity and completeness of the manuscript. The top-tier acceptance bar was set by benchmarking against the average scores of accepted ICLR 2025 papers in similar domains. Evidence that the process distinguishes valid contributions is provided through the detailed breakdown of failure modes, where papers with minor issues were distinguished from those with disqualifying flaws like fabrication. No paper met all criteria at the required level. revision: yes
Circularity Check
No significant circularity: purely empirical evaluation
full rationale
The paper conducts a direct empirical study by generating 117 papers via off-the-shelf agents, then scoring them under manuscript-only review (SAR), artifact-aware peer review (PR), and human meta-review. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. Conclusions rest on observed failure rates (e.g., fabrication and mismatch percentages) and score drops that are measured against external benchmarks such as ICLR 2025 submissions, rather than reducing to quantities defined by the authors' own prior work or inputs. The evaluation process is self-contained and falsifiable via the reported agent-dependent outcomes.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Off-the-shelf agents can carry out the full research loop under only lightweight guidance
- domain assumption Manuscript-only review, artifact-aware review, and human meta-review together provide a valid assessment of paper quality
Reference graph
Works this paper leans on
- [1]
-
[2]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Introducing FARS: Fully Automated Research System , year =
- [4]
- [5]
-
[6]
Kimi K2.5 , howpublished =
-
[7]
Kimi Code , howpublished =
-
[8]
Stanford Agentic Reviewer , howpublished =
-
[9]
ReAct: Synergizing Reasoning and Acting in Language Models
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Advances in neural information processing systems , volume=
Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
-
[11]
WebGPT: Browser-assisted question-answering with human feedback
Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Advances in Neural Information Processing Systems , volume=
Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
The twelfth international conference on learning representations , year=
Swe-bench: Can language models resolve real-world github issues? , author=. The twelfth international conference on learning representations , year=
-
[16]
Frontiers of Computer Science , volume=
A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=
work page 2024
-
[17]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search , author=. arXiv preprint arXiv:2504.08066 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=
Agent laboratory: Using llm agents as research assistants , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=. 2025 , publisher=
work page 2025
-
[19]
Researchagent: Iterative research idea generation over scientific literature with large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2025
-
[20]
Agentrxiv: Towards collaborative autonomous research.arXiv preprint arXiv:2503.18102, 2025
Agentrxiv: Towards collaborative autonomous research , author=. arXiv preprint arXiv:2503.18102 , year=
-
[21]
Mlagentbench: Evaluating language agents on machine learning experimentation , author=. arXiv preprint arXiv:2310.03302 , year=
-
[22]
Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. arXiv preprint arXiv:2410.05080 , year=
-
[23]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. arXiv preprint arXiv:2410.07095 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
PaperBench: Evaluating AI's Ability to Replicate AI Research
PaperBench: Evaluating AI's Ability to Replicate AI Research , author=. arXiv preprint arXiv:2504.01848 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
arXiv preprint arXiv:2505.19955(2025)
Mlr-bench: Evaluating ai agents on open-ended machine learning research , author=. arXiv preprint arXiv:2505.19955 , year=
-
[26]
Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts , author=. arXiv preprint arXiv:2411.15114 , year=
-
[27]
Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark , author=. arXiv preprint arXiv:2409.11363 , year=
-
[28]
arXiv preprint arXiv:2504.09702 , year=
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? , author=. arXiv preprint arXiv:2504.09702 , year=
- [29]
-
[30]
Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers , author=. arXiv preprint arXiv:2409.04109 , year=
- [31]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.