pith. sign in

arxiv: 2605.26340 · v1 · pith:243VABJPnew · submitted 2026-05-25 · 💻 cs.AI · cs.CL· cs.MA

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Pith reviewed 2026-06-29 21:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA
keywords autonomous research agentsChain-of-Evidenceverifiabilityhallucinated referencesmethod-code alignmentresearch automationAI agentsscore verification
0
0 comments X

The pith

ScientistOne produces research papers with zero hallucinated references by enforcing traceable evidence chains for every claim.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that autonomous research agents can avoid verifiability failures such as fabricated citations and unreproducible scores by adopting Chain-of-Evidence, a framework that requires every claim to link directly to its supporting evidence. ScientistOne implements this requirement by construction across literature review, solution discovery, and manuscript writing. In evaluations covering 75 papers from five systems on five tasks, baselines exhibited failures including hallucinated reference rates up to 21 percent and score verification passing in as few as 42 percent of cases, while ScientistOne recorded zero hallucinated references, perfect score verification, and top method-code alignment. It also matched or exceeded human expert performance on the core tasks and generalized to six more with strong results. This addresses the gap between surface-level quality in AI outputs and actual trustworthiness needed for scientific use.

Core claim

Chain-of-Evidence is a verifiability framework requiring every claim to be traceable to its evidence source. ScientistOne is an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. CoE Audit is a post-hoc procedure whose four integrity checks apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, ScientistOne achieves zero hallucinated references, perfect score verification, and the highest method-code alignment while matching or exceeding human expert performance on all five tasks and generalizing further with state-of-the-art results on addi

What carries the argument

Chain-of-Evidence, a framework requiring every claim to be traceable to its evidence source, which ScientistOne maintains by construction during the full research pipeline.

If this is right

  • ScientistOne matches or exceeds human expert performance on all five evaluated tasks.
  • The system generalizes to six additional tasks with state-of-the-art results on Parameter Golf and gold medals on MLE-Bench tasks.
  • Every baseline system exhibits at least one systematic failure mode under CoE Audit.
  • CoE Audit detects failures such as hallucinated references and method-code misalignment that surface-level checks miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evidence chaining could become a required component in future autonomous research pipelines to ensure outputs remain reproducible.
  • The same traceability requirement might transfer to other AI-generated technical documents where accuracy matters more than polish.
  • Human researchers could borrow the evidence-link structure to reduce their own citation and reproduction errors.

Load-bearing premise

The four integrity checks in CoE Audit are sufficient to catch all relevant verifiability failures and the 75-paper evaluation is representative of broader autonomous research performance.

What would settle it

A ScientistOne-generated paper that passes all four CoE Audit checks but is later shown to contain an unverifiable claim, or a baseline system that produces fully verifiable outputs without using Chain-of-Evidence.

read the original abstract

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source; ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction; and CoE Audit, a post-hoc audit with four integrity checks (score verification, specification violation, reference verification, method-code alignment). Across 75 papers from five systems on five frontier tasks, baselines exhibit failures (e.g., 21% hallucinated references, 42% score verification pass rate), while ScientistOne achieves 0/337 hallucinated references, 12/12 score verification, 14/15 method-code alignment, matches or exceeds human expert performance, and generalizes to six additional tasks with SOTA results on Parameter Golf and gold medals on MLE-Bench.

Significance. If the results hold, this would be significant for autonomous AI research by providing a concrete framework and system to mitigate verifiability failures that current agents exhibit, supported by a uniform audit enabling cross-system comparison and evidence of generalization across diverse tasks.

major comments (2)
  1. [Abstract] Abstract: The abstract states quantitative results such as zero hallucinated references (0/337) and perfect score verification (12/12) but supplies no implementation details, baseline descriptions, experimental protocol, or error analysis, making it impossible to determine whether the data support the claims.
  2. [CoE Audit] CoE Audit description: The four integrity checks are assumed sufficient to catch all relevant verifiability failures (as evidenced by catching baseline issues while ScientistOne passes), but the manuscript provides no argument or evidence that other failure modes—such as contextually incorrect citations that still resolve to real papers, or method descriptions that match code at the checked granularity but omit critical hyperparameters—would be detected.
minor comments (1)
  1. [Abstract] Abstract: The five frontier research tasks and six additional tasks are referenced but not enumerated, which would aid immediate understanding of the evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to improve clarity and completeness. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states quantitative results such as zero hallucinated references (0/337) and perfect score verification (12/12) but supplies no implementation details, baseline descriptions, experimental protocol, or error analysis, making it impossible to determine whether the data support the claims.

    Authors: We acknowledge that the abstract, due to standard length constraints, omits full experimental details. The manuscript body (Sections 3, 4, and 5) contains the requested information on implementation, baselines, protocol, and error analysis. To address the concern, we will revise the abstract to include a concise reference to the experimental scale (75 papers, five systems, five tasks) and explicitly direct readers to the methods and results sections for supporting details. revision: yes

  2. Referee: [CoE Audit] CoE Audit description: The four integrity checks are assumed sufficient to catch all relevant verifiability failures (as evidenced by catching baseline issues while ScientistOne passes), but the manuscript provides no argument or evidence that other failure modes—such as contextually incorrect citations that still resolve to real papers, or method descriptions that match code at the checked granularity but omit critical hyperparameters—would be detected.

    Authors: The comment is correct: the current manuscript does not provide an explicit argument or evidence for the sufficiency of the four checks against all possible failure modes. The checks were chosen to target the dominant observed failures (hallucinated references, score errors, specification violations, and method-code divergence). We will add a dedicated paragraph in the CoE Audit section that (a) justifies the selection based on baseline failure patterns and prior literature, (b) acknowledges limitations for subtler issues such as contextual citation errors or omitted hyperparameters, and (c) outlines potential extensions for future audits. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation

full rationale

The paper contains no equations, derivations, or first-principles predictions. All claims are empirical performance results on defined tasks, with CoE and CoE Audit introduced as explicit frameworks whose checks are applied uniformly. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear. The evaluation stands on its own reported metrics against the stated benchmarks and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5851 in / 1189 out tokens · 35303 ms · 2026-06-29T21:16:27.134419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy

    cs.SE 2026-06 unverdicted novelty 5.0

    Agon is a new autonomous research system using prompt economy loops across 444 iterations to demonstrate scalable omnidisciplinary research and a taxonomy separating machine-fixable failures from those needing human judgment.

Reference graph

Works this paper leans on

35 extracted references · 24 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

    M. Cemri, S. Agrawal, A. Gupta, S. Liu, A. Cheng, Q. Mang, A. Naren, L. E. Erdogan, K. Sen, M. Zaharia, et al. AdaEvolve : Adaptive LLM driven zeroth-order optimization. arXiv preprint arXiv:2602.20133, 2026

  2. [2]

    J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, et al. MLE-Bench : Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

  3. [3]

    T. Chen, S. Anumasa, B. Lin, V. Shah, A. Goyal, and D. Liu. Auto-Bench : An automated benchmark for scientific discovery in LLM s. arXiv preprint arXiv:2502.15224, 2025

  4. [4]

    Cheng, S

    A. Cheng, S. Liu, M. Pan, Z. Li, S. Agarwal, M. Cemri, B. Wang, A. Krentsel, T. Xia, J. Park, et al. Let the barbarians in: How AI can accelerate systems performance research. arXiv preprint arXiv:2512.14806, 2025 a

  5. [5]

    Cheng, S

    A. Cheng, S. Liu, M. Pan, Z. Li, B. Wang, A. Krentsel, T. Xia, M. Cemri, J. Park, S. Yang, et al. Barbarians at the gate: How AI is upending systems research. arXiv preprint arXiv:2510.06189, 2025 b

  6. [6]

    ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review

    P. Goyal, M. Parmar, Y. Song, H. Palangi, T. Pfister, and J. Yoon. ScholarPeer : A context-aware multi-agent framework for automated peer review. arXiv preprint arXiv:2601.22638, 2026

  7. [7]

    H \"a rder and A

    T. H \"a rder and A. Reuter. Principles of transaction-oriented database recovery. ACM Computing Surveys , 15 0 (4): 0 287--317, 1983

  8. [8]

    MLAgentBench: Evaluating language agents on machine learning experimentation, 2024

    Q. Huang, J. Vora, P. Liang, and J. Leskovec. MLAgentBench : Evaluating language agents on machine learning experimentation. arXiv preprint arXiv:2310.03302, 2023

  9. [9]

    Jansen, O

    P. Jansen, O. Tafjord, M. Radensky, P. Siangliulue, T. Hope, B. Dalvi, B. P. Majumder, D. S. Weld, and P. Clark. CodeScientist : End-to-end semi-automated scientific discovery with code-based experimentation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 13370--13467, 2025

  10. [10]

    P. T. J. Kon, J. Liu, Q. Ding, Y. Qiu, Z. Yang, Y. Huang, J. Srinivasa, M. Lee, M. Chowdhury, and A. Chen. Curie : Toward rigorous and automated scientific experimentation with AI agents. arXiv preprint arXiv:2502.16069, 2025 a

  11. [11]

    P. T. J. Kon, J. Liu, X. Zhu, Q. Ding, J. Peng, J. Xing, Y. Huang, Y. Qiu, J. Srinivasa, M. Lee, et al. EXP-Bench : Can AI conduct AI research experiments? arXiv preprint arXiv:2505.24785, 2025 b

  12. [12]

    J. Liu, S. Qiu, M. Li, B. Li, H. Ji, S. Han, X. Ye, P. Xia, Z. Dong, C. Zhang, et al. Autoresearchclaw: Self-reinforcing autonomous research with human-ai collaboration. arXiv preprint arXiv:2605.20025, 2026 a

  13. [13]

    N. F. Liu, T. Zhang, and P. Liang. Evaluating verifiability in generative search engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001--7025, 2023 a

  14. [14]

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024

  15. [15]

    S. Liu, S. Agarwal, M. Maheswaran, M. Cemri, Z. Li, Q. Mang, A. Naren, E. Boneh, A. Cheng, M. Z. Pan, et al. EvoX : Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026 b

  16. [16]

    S. Liu, M. Cemri, S. Agarwal, A. Krentsel, A. Naren, Q. Mang, Z. Li, A. Gupta, M. Maheswaran, A. Cheng, M. Pan, E. Boneh, K. Ramchandran, K. Sen, A. G. Dimakis, M. Zaharia, and I. Stoica. Skydiscover: A flexible framework for ai-driven scientific and algorithmic discovery, 2026 c . URL https://skydiscover-ai.github.io/blog.html

  17. [18]

    Y. Liu, Z. Yang, T. Xie, J. Ni, B. Gao, Y. Li, S. Tang, W. Ouyang, E. Cambria, and D. Zhou. ResearchBench : Benchmarking LLM s in scientific discovery via inspiration-based task decomposition. arXiv preprint arXiv:2503.21248, 2025

  18. [19]

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

  19. [20]

    Lupidi, B

    A. Lupidi, B. Gauri, T. S. Foster, B. A. Omari, D. Magka, A. Pepe, A. Audran-Reiss, M. Aghamelu, N. Baldwin, L. Cipolina-Kun, et al. AIRS-Bench : A suite of tasks for frontier AI research science agents. arXiv preprint arXiv:2602.06855, 2026

  20. [21]

    Y. Lyu, X. Zhang, X. Yi, Y. Zhao, S. Guo, W. Hu, J. Piotrowski, J. Kaliski, J. Urbani, Z. Meng, et al. EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery. arXiv preprint arXiv:2603.08127, 2026

  21. [22]

    S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore : Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076--12100, 2023

  22. [23]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    A. Novikov, N. V \ u , M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabian, et al. AlphaEvolve : A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131, 2025

  23. [24]

    Parameter golf: OpenAI model craft challenge

    OpenAI. Parameter golf: OpenAI model craft challenge. https://github.com/openai/parameter-golf, 2026

  24. [25]

    Press, A

    O. Press, A. Hochlehnert, A. Prabhu, V. Udandarao, O. Press, and M. Bethge. CiteME : Can language models accurately cite scientific claims? Advances in Neural Information Processing Systems, 37: 0 7847--7877, 2024

  25. [26]

    Y. Pu, T. Lin, and H. Chen. PiFlow : Principle-aware scientific discovery with multi-agent collaboration. arXiv preprint arXiv:2505.15047, 2025

  26. [27]

    Schmidgall, Y

    S. Schmidgall, Y. Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, M. Moor, Z. Liu, and E. Barsoum. Agent laboratory: Using LLM agents as research assistants. Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977--6043, 2025

  27. [28]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson, et al. PaperBench : Evaluating AI 's ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025

  28. [29]

    J. Tang, L. Xia, Z. Li, and C. Huang. AI-Researcher : Autonomous scientific innovation. arXiv preprint arXiv:2505.18705, 2025

  29. [30]

    Z. Wang, F. Bai, Z. Luo, J. Su, K. Sun, X. Yu, J. Liu, K. Zhou, C. Cardie, M. Dredze, et al. FIRE-Bench : Evaluating agents on the rediscovery of scientific insights. arXiv preprint arXiv:2602.02905, 2026

  30. [31]

    Y. Weng, M. Zhu, Q. Xie, Q. Sun, Z. Lin, S. Liu, and Y. Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively. arXiv preprint arXiv:2509.26603, 2025

  31. [32]

    T. Xu, P. Lu, L. Ye, X. Hu, and P. Liu. ResearcherBench : Evaluating deep AI research systems on the frontiers of scientific inquiry. arXiv preprint arXiv:2507.16280, 2025

  32. [33]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

  33. [34]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  34. [35]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  35. [36]

    AgentBench: Evaluating LLMs as Agents

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...