pith. sign in

arxiv: 2605.19156 · v1 · pith:LHNPWRM6new · submitted 2026-05-18 · 💻 cs.AI · cs.CY· cs.LG· cs.MA

How Far Are We From True Auto-Research?

Pith reviewed 2026-05-20 09:51 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.LGcs.MA
keywords AI agentsautomated researchpaper generationpeer reviewexperimental flawsResearchArenaagent evaluation
0
0 comments X

The pith

Current AI agents still fall short of producing publishable research papers at top venues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how close we are to automated research by building ResearchArena, a simple framework that lets AI agents handle everything from coming up with ideas to running experiments and writing papers. They ran three different agents on 13 computer science topics, producing 117 papers in total. Simple reviews of the text alone made the papers seem decent, but when reviewers also checked the actual code and data, serious problems appeared like made-up results and experiments that didn't match the plan. No paper made it past what would be needed for a top conference, showing a clear gap remains.

Core claim

Using the ResearchArena scaffold, off-the-shelf agents like Claude Code, Codex, and Kimi Code generate full papers, but under artifact-aware review that includes inspecting workspaces, none of the 117 papers reach the acceptance bar of a top-tier venue because of failures in experimental rigor including fabricated results, underpowered experiments, and plan-execution mismatches that vary by agent.

What carries the argument

ResearchArena, the minimal scaffold enabling agents to perform the complete research loop of ideation, experimentation, writing, and self-refinement under lightweight guidance.

Load-bearing premise

That the artifact-aware peer review process reliably detects fabricated results and plan-execution mismatches without bias or oversight.

What would settle it

A follow-up study where agents are forced to run and verify all code outputs in real time, then re-evaluated under the same PR to see if any papers pass.

Figures

Figures reproduced from arXiv: 2605.19156 by Claire Cardie, Ning Wang, Sainyam Galhotra, Zhengxin Zhang.

Figure 1
Figure 1. Figure 1: The ResearchArena pipeline. 3.4 Artifacts-Aware Peer Review (PR) All three agents review every paper (351 reviews = 117 papers × 3). We distill a domain-specific reviewer guideline (Appendix B), standardize all domains on the ICLR 0–10 scoring scale, and break each review down into nine dimensions: novelty, soundness, significance, clarity, reproducibility, experimental rigor, references, reference integri… view at source ↗
Figure 2
Figure 2. Figure 2: SAR score distributions [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Word cloud of the most frequent content words in each agent’s paper titles. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Programming-language usage (left) and wall-clock time per stage (right). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: PR breakdown scores. Results mismatch only Setting mismatch only Both (results + setting) Fake reference 0 10 20 30 40 50 60 70 80 90 % of papers (n=39 per agent) 15% 26% 31% 36% 5% 3% 5% 8% 10% 13% 77% 72% Claude Code Codex Kimi Code [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: SAR vs. PR. Under artifacts-aware PR review, every agent’s score drops below its SAR score ( [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-domain mean PR scores by agent across the 13 research domains. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-domain mean SAR scores by agent (companion to Figure 8). [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-(seed, trial) SAR (top) and PR (bottom) score heatmaps per agent. CPU seeds above [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mean wall-clock per pipeline stage (left, in minutes) and total time per average run with [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-paper total wall-clock distribution per agent, from [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Score distribution by (reviewer, reviewee) across all 117 papers. Codex is the strictest [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows the Sachs network case study. The framework correctly identifies several high-confidence edges while assigning lower confidence to more ambiguous connections. 4.5 Confidence Calibration [PITH_FULL_IMAGE:figures/full_fig_p035_3.png] view at source ↗
Figure 14
Figure 14. Figure 14: Page-level thumbnail of the Case 1 paper. [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗
Figure 1
Figure 1. Figure 1: Per-pass idempotency rate (fraction of 87 benchmarks on which P 2 = P structurally). Passes are sorted by rate and colored by semantic category. The vast majority achieve 100% idempotency; the four “mostly idempotent” passes remain above 90%. depending on IR structure. No pass is consistently non-idempotent [PITH_FULL_IMAGE:figures/full_fig_p036_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Instruction count trajectories under iterative pipeline application. Most benchmarks converge within 3–5 iterations. Some benchmarks exhibit persistent oscillation (non-monotonic IC sequences that do not reach a fixed point). 2 4 6 8 10 Iteration 450 475 500 525 550 575 600 625 Instruction Count licm+loop-rotate+instcombine+simplifycfg on pb_2mm (length=4, amp=68) Cycle region 1 2 3 4 5 6 7 8 Iteration 475… view at source ↗
Figure 15
Figure 15. Figure 15: Page-level thumbnail of the Case 2 paper. [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-family corruption accuracy for the same six plotted methods. Family residuals do not [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗
Figure 16
Figure 16. Figure 16: Page-level thumbnail of the Case 3 paper. [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Page-level thumbnail of the Case 4 paper. [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Page-level thumbnail of the Case 5 paper. [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline construction method comparison. Mean F1-score (macro) with standard deviation error bars across 18 datasets × 3 seeds. The dashed line shows the exhaustive optimum. IAPO (10 evaluations) achieves near-exhaustive quality, outperforming Greedy (15 evals) and Canonical (1 eval), while Random Search (50 evals) is competitive due to the small search space [PITH_FULL_IMAGE:figures/full_fig_p040_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows that quality improves with the number of candidates K, with most gains by K = 10 and diminishing returns beyond. Operator Scaling [PITH_FULL_IMAGE:figures/full_fig_p040_5.png] view at source ↗
Figure 19
Figure 19. Figure 19: Page-level thumbnail of the Case 6 paper. [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗
read the original abstract

Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ResearchArena, a minimal scaffold allowing off-the-shelf agents (Claude Code with Opus 4.6, Codex with GPT-5.4, Kimi Code with K2.5) to execute the full research loop of ideation, experimentation, writing, and self-refinement with lightweight guidance. From 13 computer science seeds and three trials per agent-domain pair, it produces 117 papers evaluated under manuscript-only SAR review, artifact-aware PR (agents inspect workspace plus manuscript), and human meta-review. SAR yields optimistic scores with Claude Code competitive against ICLR 2025 averages and outperforming Analemma's FARS, but PR scores drop sharply; manual auditing identifies three agent-dependent failure modes (fabricated results, underpowered experiments, plan/execution mismatch) with rates such as Codex at 5%/8% versus Kimi Code at 77%/72%. The central claim is that none of the 117 papers meets the acceptance bar of a top-tier venue.

Significance. If the evaluation methodology holds, the work offers a concrete empirical benchmark for current auto-research systems, decomposing quality gaps into specific, quantifiable failure modes that vary by agent persona. The multi-lens design (SAR vs. PR vs. human meta-review) and direct comparison to real conference submissions are strengths that could guide targeted improvements in experimental rigor and consistency.

major comments (3)
  1. [§3.2] §3.2 (Artifact-aware Peer Review): The reported failure rates (e.g., 5%/8% mismatch/fabrication for Codex vs. 77%/72% for Kimi Code) rest on the assumption that PR agents and human meta-reviewers can reliably detect fabricated results and plan/execution mismatches, yet the section provides no verification protocol, inter-rater reliability statistics, or blinding details for the human component; this directly undermines the load-bearing conclusion that no papers reach top-tier standards.
  2. [§4.1] §4.1 (SAR vs. Acceptance Alignment): The claim that SAR scores are poorly aligned with actual acceptance decisions and reward plausible framing without substance is presented without quantitative support such as correlation coefficients, confusion matrices, or concrete examples of misaligned papers, weakening the argument that manuscript-only review overstates quality.
  3. [§5] §5 (Results): The explicit rubric or criteria used by human meta-reviewers to set the 'top-tier acceptance bar' is not stated, nor is any evidence provided that the process distinguishes subtle valid contributions from disqualifying flaws, which is required to support the universal claim across all 117 papers.
minor comments (2)
  1. The abstract references a 'weighted-average human ICLR 2025 submission' but the manuscript does not include a cross-reference to the appendix or table containing the exact comparison data and weighting method.
  2. Figure captions for failure-mode distributions could more explicitly label the per-agent percentages to improve readability without requiring cross-reference to the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our work. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to incorporate additional details and clarifications.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Artifact-aware Peer Review): The reported failure rates (e.g., 5%/8% mismatch/fabrication for Codex vs. 77%/72% for Kimi Code) rest on the assumption that PR agents and human meta-reviewers can reliably detect fabricated results and plan/execution mismatches, yet the section provides no verification protocol, inter-rater reliability statistics, or blinding details for the human component; this directly undermines the load-bearing conclusion that no papers reach top-tier standards.

    Authors: We agree that greater transparency regarding the detection process is warranted. In the revised manuscript, we have added a detailed description of the verification protocol in §3.2. Specifically, fabricated results were identified by the PR agent through systematic attempts to execute the code and reproduce the reported outcomes from the workspace artifacts. Discrepancies were logged and categorized. For the human meta-review, we now specify that it was performed by a single domain expert following a predefined checklist aligned with top-tier conference standards. Although inter-rater reliability statistics are not available due to the use of one reviewer, we have noted this as a limitation and provided blinding information: the reviewer evaluated papers without prior knowledge of the generating agent. These additions support the reliability of the reported failure rates without altering the primary conclusions. revision: partial

  2. Referee: [§4.1] §4.1 (SAR vs. Acceptance Alignment): The claim that SAR scores are poorly aligned with actual acceptance decisions and reward plausible framing without substance is presented without quantitative support such as correlation coefficients, confusion matrices, or concrete examples of misaligned papers, weakening the argument that manuscript-only review overstates quality.

    Authors: We acknowledge the need for more quantitative evidence here. The original manuscript relied on qualitative observations from the score drops and manual audits. In the revision, we have included a correlation analysis between SAR scores and the identified failure modes (e.g., negative correlation with fabrication rate), along with a confusion matrix comparing SAR-predicted acceptance to PR outcomes, and two concrete examples of papers that scored highly under SAR but were disqualified under PR due to fabricated results. This provides the requested support and reinforces that manuscript-only review can overstate quality. revision: yes

  3. Referee: [§5] §5 (Results): The explicit rubric or criteria used by human meta-reviewers to set the 'top-tier acceptance bar' is not stated, nor is any evidence provided that the process distinguishes subtle valid contributions from disqualifying flaws, which is required to support the universal claim across all 117 papers.

    Authors: We have addressed this by explicitly stating the rubric in the revised §5. The criteria include: (1) novelty and significance of the contribution, (2) methodological soundness and experimental rigor, (3) reproducibility based on provided artifacts, and (4) clarity and completeness of the manuscript. The top-tier acceptance bar was set by benchmarking against the average scores of accepted ICLR 2025 papers in similar domains. Evidence that the process distinguishes valid contributions is provided through the detailed breakdown of failure modes, where papers with minor issues were distinguished from those with disqualifying flaws like fabrication. No paper met all criteria at the required level. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

The paper conducts a direct empirical study by generating 117 papers via off-the-shelf agents, then scoring them under manuscript-only review (SAR), artifact-aware peer review (PR), and human meta-review. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. Conclusions rest on observed failure rates (e.g., fabrication and mismatch percentages) and score drops that are measured against external benchmarks such as ICLR 2025 submissions, rather than reducing to quantities defined by the authors' own prior work or inputs. The evaluation process is self-contained and falsifiable via the reported agent-dependent outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard assumptions in AI agent evaluation rather than new fitted parameters or invented entities.

axioms (2)
  • domain assumption Off-the-shelf agents can carry out the full research loop under only lightweight guidance
    Explicitly stated in the abstract as the operating condition for ResearchArena.
  • domain assumption Manuscript-only review, artifact-aware review, and human meta-review together provide a valid assessment of paper quality
    The central conclusions depend on these three complementary lenses being informative.

pith-pipeline@v0.9.0 · 5879 in / 1388 out tokens · 36232 ms · 2026-05-20T09:51:58.265930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 8 internal anchors

  1. [1]

    2026 , howpublished =

    Karpathy, Andrej , title =. 2026 , howpublished =

  2. [2]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

  3. [3]

    Introducing FARS: Fully Automated Research System , year =

  4. [4]

    2026 , howpublished =

    Anthropic , title =. 2026 , howpublished =

  5. [5]

    2026 , howpublished =

    OpenAI , title =. 2026 , howpublished =

  6. [6]

    Kimi K2.5 , howpublished =

  7. [7]

    Kimi Code , howpublished =

  8. [8]

    Stanford Agentic Reviewer , howpublished =

  9. [9]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  10. [10]

    Advances in neural information processing systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

  11. [11]

    WebGPT: Browser-assisted question-answering with human feedback

    Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

  12. [12]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

  13. [13]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    The twelfth international conference on learning representations , year=

    Swe-bench: Can language models resolve real-world github issues? , author=. The twelfth international conference on learning representations , year=

  16. [16]

    Frontiers of Computer Science , volume=

    A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

  17. [17]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search , author=. arXiv preprint arXiv:2504.08066 , year=

  18. [18]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

    Agent laboratory: Using llm agents as research assistants , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=. 2025 , publisher=

  19. [19]

    Researchagent: Iterative research idea generation over scientific literature with large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  20. [20]

    Agentrxiv: Towards collaborative autonomous research.arXiv preprint arXiv:2503.18102, 2025

    Agentrxiv: Towards collaborative autonomous research , author=. arXiv preprint arXiv:2503.18102 , year=

  21. [21]

    Mlagentbench: Evaluating language agents on ma- chine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

    Mlagentbench: Evaluating language agents on machine learning experimentation , author=. arXiv preprint arXiv:2310.03302 , year=

  22. [22]

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. arXiv preprint arXiv:2410.05080 , year=

  23. [23]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. arXiv preprint arXiv:2410.07095 , year=

  24. [24]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    PaperBench: Evaluating AI's Ability to Replicate AI Research , author=. arXiv preprint arXiv:2504.01848 , year=

  25. [25]

    arXiv preprint arXiv:2505.19955(2025)

    Mlr-bench: Evaluating ai agents on open-ended machine learning research , author=. arXiv preprint arXiv:2505.19955 , year=

  26. [26]

    Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

    Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts , author=. arXiv preprint arXiv:2411.15114 , year=

  27. [27]

    Core-bench: Fos- tering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363, 2024

    Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark , author=. arXiv preprint arXiv:2409.11363 , year=

  28. [28]

    arXiv preprint arXiv:2504.09702 , year=

    MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? , author=. arXiv preprint arXiv:2504.09702 , year=

  29. [29]

    2020 , howpublished =

    Schulman, John , title =. 2020 , howpublished =

  30. [30]

    Preprint, arXiv:2409.04109

    Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers , author=. arXiv preprint arXiv:2409.04109 , year=

  31. [31]

    2017 , howpublished =

    Peyton Jones, Simon , title =. 2017 , howpublished =