ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Bhavana Dalvi Mishra; Burak Gokturk; Chun-Liang Li; Jiefeng Chen; Jinsung Yoon; Mihir Parmar; Palash Goyal; Parthasarathy Ranganathan; Rajarishi Sinha; Rui Meng

arxiv: 2605.26340 · v1 · pith:243VABJPnew · submitted 2026-05-25 · 💻 cs.AI · cs.CL· cs.MA

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Rui Meng , Bhavana Dalvi Mishra , Jiefeng Chen , Chun-Liang Li , Palash Goyal , Mihir Parmar , Yiwen Song , Yale Song

show 5 more authors

Rajarishi Sinha Parthasarathy Ranganathan Burak Gokturk Jinsung Yoon Tomas Pfister

This is my paper

Pith reviewed 2026-06-29 21:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA

keywords autonomous research agentsChain-of-Evidenceverifiabilityhallucinated referencesmethod-code alignmentresearch automationAI agentsscore verification

0 comments

The pith

ScientistOne produces research papers with zero hallucinated references by enforcing traceable evidence chains for every claim.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that autonomous research agents can avoid verifiability failures such as fabricated citations and unreproducible scores by adopting Chain-of-Evidence, a framework that requires every claim to link directly to its supporting evidence. ScientistOne implements this requirement by construction across literature review, solution discovery, and manuscript writing. In evaluations covering 75 papers from five systems on five tasks, baselines exhibited failures including hallucinated reference rates up to 21 percent and score verification passing in as few as 42 percent of cases, while ScientistOne recorded zero hallucinated references, perfect score verification, and top method-code alignment. It also matched or exceeded human expert performance on the core tasks and generalized to six more with strong results. This addresses the gap between surface-level quality in AI outputs and actual trustworthiness needed for scientific use.

Core claim

Chain-of-Evidence is a verifiability framework requiring every claim to be traceable to its evidence source. ScientistOne is an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. CoE Audit is a post-hoc procedure whose four integrity checks apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, ScientistOne achieves zero hallucinated references, perfect score verification, and the highest method-code alignment while matching or exceeding human expert performance on all five tasks and generalizing further with state-of-the-art results on addi

What carries the argument

Chain-of-Evidence, a framework requiring every claim to be traceable to its evidence source, which ScientistOne maintains by construction during the full research pipeline.

If this is right

ScientistOne matches or exceeds human expert performance on all five evaluated tasks.
The system generalizes to six additional tasks with state-of-the-art results on Parameter Golf and gold medals on MLE-Bench tasks.
Every baseline system exhibits at least one systematic failure mode under CoE Audit.
CoE Audit detects failures such as hallucinated references and method-code misalignment that surface-level checks miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evidence chaining could become a required component in future autonomous research pipelines to ensure outputs remain reproducible.
The same traceability requirement might transfer to other AI-generated technical documents where accuracy matters more than polish.
Human researchers could borrow the evidence-link structure to reduce their own citation and reproduction errors.

Load-bearing premise

The four integrity checks in CoE Audit are sufficient to catch all relevant verifiability failures and the 75-paper evaluation is representative of broader autonomous research performance.

What would settle it

A ScientistOne-generated paper that passes all four CoE Audit checks but is later shown to contain an unverifiable claim, or a baseline system that produces fully verifiable outputs without using Chain-of-Evidence.

read the original abstract

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScientistOne cuts hallucinated references and verification failures to zero on its test set by enforcing evidence chains, but the four audit checks are not shown to be exhaustive.

read the letter

ScientistOne improves verifiability in autonomous research agents by building in traceability from the start. The main contributions are the Chain-of-Evidence requirement that every claim links back to a source, the ScientistOne system that follows that rule through review, discovery, and writing, and the CoE Audit that applies four checks uniformly: score verification, specification violation, reference verification, and method-code alignment.

On the 75 papers from five systems and five tasks, the baselines show real problems—21% hallucinated references in some cases, score verification passing in only 42% of papers, and method-code alignment as low as 20%. ScientistOne reports 0 hallucinated references out of 337, 12 out of 12 score verifications, 14 out of 15 alignment, and performance that matches or beats human experts on the core tasks. It also carries over to six more tasks in medical imaging, recognition, 3D, and language modeling, with SOTA on Parameter Golf and gold medals on MLE-Bench where the others fail.

The concrete numbers on a shared audit are the useful part. They turn a known complaint about agent outputs into something measurable and show that keeping evidence chains during generation reduces the failures.

The soft spot is that the argument rests on the four checks catching the important problems. Nothing in the abstract shows why other issues—real but misused citations, or method descriptions that line up at the checked level but leave out critical hyperparameters—would be detected. The evaluation stays inside five AI tasks and their own systems, so it is not clear how far the results travel. Implementation details and baseline descriptions are missing from what is visible here, which makes the numbers harder to assess directly.

This is for groups building autonomous agents for research work. It gives them a framework and a set of metrics worth testing against. It deserves peer review so the audit coverage and the system internals can be examined on a wider set of cases.

Referee Report

2 major / 1 minor

Summary. The paper introduces Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source; ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction; and CoE Audit, a post-hoc audit with four integrity checks (score verification, specification violation, reference verification, method-code alignment). Across 75 papers from five systems on five frontier tasks, baselines exhibit failures (e.g., 21% hallucinated references, 42% score verification pass rate), while ScientistOne achieves 0/337 hallucinated references, 12/12 score verification, 14/15 method-code alignment, matches or exceeds human expert performance, and generalizes to six additional tasks with SOTA results on Parameter Golf and gold medals on MLE-Bench.

Significance. If the results hold, this would be significant for autonomous AI research by providing a concrete framework and system to mitigate verifiability failures that current agents exhibit, supported by a uniform audit enabling cross-system comparison and evidence of generalization across diverse tasks.

major comments (2)

[Abstract] Abstract: The abstract states quantitative results such as zero hallucinated references (0/337) and perfect score verification (12/12) but supplies no implementation details, baseline descriptions, experimental protocol, or error analysis, making it impossible to determine whether the data support the claims.
[CoE Audit] CoE Audit description: The four integrity checks are assumed sufficient to catch all relevant verifiability failures (as evidenced by catching baseline issues while ScientistOne passes), but the manuscript provides no argument or evidence that other failure modes—such as contextually incorrect citations that still resolve to real papers, or method descriptions that match code at the checked granularity but omit critical hyperparameters—would be detected.

minor comments (1)

[Abstract] Abstract: The five frontier research tasks and six additional tasks are referenced but not enumerated, which would aid immediate understanding of the evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to improve clarity and completeness. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states quantitative results such as zero hallucinated references (0/337) and perfect score verification (12/12) but supplies no implementation details, baseline descriptions, experimental protocol, or error analysis, making it impossible to determine whether the data support the claims.

Authors: We acknowledge that the abstract, due to standard length constraints, omits full experimental details. The manuscript body (Sections 3, 4, and 5) contains the requested information on implementation, baselines, protocol, and error analysis. To address the concern, we will revise the abstract to include a concise reference to the experimental scale (75 papers, five systems, five tasks) and explicitly direct readers to the methods and results sections for supporting details. revision: yes
Referee: [CoE Audit] CoE Audit description: The four integrity checks are assumed sufficient to catch all relevant verifiability failures (as evidenced by catching baseline issues while ScientistOne passes), but the manuscript provides no argument or evidence that other failure modes—such as contextually incorrect citations that still resolve to real papers, or method descriptions that match code at the checked granularity but omit critical hyperparameters—would be detected.

Authors: The comment is correct: the current manuscript does not provide an explicit argument or evidence for the sufficiency of the four checks against all possible failure modes. The checks were chosen to target the dominant observed failures (hallucinated references, score errors, specification violations, and method-code divergence). We will add a dedicated paragraph in the CoE Audit section that (a) justifies the selection based on baseline failure patterns and prior literature, (b) acknowledges limitations for subtler issues such as contextual citation errors or omitted hyperparameters, and (c) outlines potential extensions for future audits. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation

full rationale

The paper contains no equations, derivations, or first-principles predictions. All claims are empirical performance results on defined tasks, with CoE and CoE Audit introduced as explicit frameworks whose checks are applied uniformly. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear. The evaluation stands on its own reported metrics against the stated benchmarks and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5851 in / 1189 out tokens · 35303 ms · 2026-06-29T21:16:27.134419+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy
cs.SE 2026-06 unverdicted novelty 5.0

Agon is a new autonomous research system using prompt economy loops across 444 iterations to demonstrate scalable omnidisciplinary research and a taxonomy separating machine-fixable failures from those needing human judgment.

Reference graph

Works this paper leans on

35 extracted references · 24 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

M. Cemri, S. Agrawal, A. Gupta, S. Liu, A. Cheng, Q. Mang, A. Naren, L. E. Erdogan, K. Sen, M. Zaharia, et al. AdaEvolve : Adaptive LLM driven zeroth-order optimization. arXiv preprint arXiv:2602.20133, 2026

work page arXiv 2026
[2]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, et al. MLE-Bench : Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

T. Chen, S. Anumasa, B. Lin, V. Shah, A. Goyal, and D. Liu. Auto-Bench : An automated benchmark for scientific discovery in LLM s. arXiv preprint arXiv:2502.15224, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Cheng, S

A. Cheng, S. Liu, M. Pan, Z. Li, S. Agarwal, M. Cemri, B. Wang, A. Krentsel, T. Xia, J. Park, et al. Let the barbarians in: How AI can accelerate systems performance research. arXiv preprint arXiv:2512.14806, 2025 a

work page arXiv 2025
[5]

Cheng, S

A. Cheng, S. Liu, M. Pan, Z. Li, B. Wang, A. Krentsel, T. Xia, M. Cemri, J. Park, S. Yang, et al. Barbarians at the gate: How AI is upending systems research. arXiv preprint arXiv:2510.06189, 2025 b

work page arXiv 2025
[6]

ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review

P. Goyal, M. Parmar, Y. Song, H. Palangi, T. Pfister, and J. Yoon. ScholarPeer : A context-aware multi-agent framework for automated peer review. arXiv preprint arXiv:2601.22638, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

H \"a rder and A

T. H \"a rder and A. Reuter. Principles of transaction-oriented database recovery. ACM Computing Surveys , 15 0 (4): 0 287--317, 1983

1983
[8]

MLAgentBench: Evaluating language agents on machine learning experimentation, 2024

Q. Huang, J. Vora, P. Liang, and J. Leskovec. MLAgentBench : Evaluating language agents on machine learning experimentation. arXiv preprint arXiv:2310.03302, 2023

work page arXiv 2023
[9]

Jansen, O

P. Jansen, O. Tafjord, M. Radensky, P. Siangliulue, T. Hope, B. Dalvi, B. P. Majumder, D. S. Weld, and P. Clark. CodeScientist : End-to-end semi-automated scientific discovery with code-based experimentation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 13370--13467, 2025

2025
[10]

P. T. J. Kon, J. Liu, Q. Ding, Y. Qiu, Z. Yang, Y. Huang, J. Srinivasa, M. Lee, M. Chowdhury, and A. Chen. Curie : Toward rigorous and automated scientific experimentation with AI agents. arXiv preprint arXiv:2502.16069, 2025 a

work page arXiv 2025
[11]

P. T. J. Kon, J. Liu, X. Zhu, Q. Ding, J. Peng, J. Xing, Y. Huang, Y. Qiu, J. Srinivasa, M. Lee, et al. EXP-Bench : Can AI conduct AI research experiments? arXiv preprint arXiv:2505.24785, 2025 b

work page arXiv 2025
[12]

J. Liu, S. Qiu, M. Li, B. Li, H. Ji, S. Han, X. Ye, P. Xia, Z. Dong, C. Zhang, et al. Autoresearchclaw: Self-reinforcing autonomous research with human-ai collaboration. arXiv preprint arXiv:2605.20025, 2026 a

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

N. F. Liu, T. Zhang, and P. Liang. Evaluating verifiability in generative search engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001--7025, 2023 a

2023
[14]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024

2024
[15]

S. Liu, S. Agarwal, M. Maheswaran, M. Cemri, Z. Li, Q. Mang, A. Naren, E. Boneh, A. Cheng, M. Z. Pan, et al. EvoX : Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026 b

work page arXiv 2026
[16]

S. Liu, M. Cemri, S. Agarwal, A. Krentsel, A. Naren, Q. Mang, Z. Li, A. Gupta, M. Maheswaran, A. Cheng, M. Pan, E. Boneh, K. Ramchandran, K. Sen, A. G. Dimakis, M. Zaharia, and I. Stoica. Skydiscover: A flexible framework for ai-driven scientific and algorithmic discovery, 2026 c . URL https://skydiscover-ai.github.io/blog.html

2026
[18]

Y. Liu, Z. Yang, T. Xie, J. Ni, B. Gao, Y. Li, S. Tang, W. Ouyang, E. Cambria, and D. Zhou. ResearchBench : Benchmarking LLM s in scientific discovery via inspiration-based task decomposition. arXiv preprint arXiv:2503.21248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Lupidi, B

A. Lupidi, B. Gauri, T. S. Foster, B. A. Omari, D. Magka, A. Pepe, A. Audran-Reiss, M. Aghamelu, N. Baldwin, L. Cipolina-Kun, et al. AIRS-Bench : A suite of tasks for frontier AI research science agents. arXiv preprint arXiv:2602.06855, 2026

work page arXiv 2026
[21]

Y. Lyu, X. Zhang, X. Yi, Y. Zhao, S. Guo, W. Hu, J. Piotrowski, J. Kaliski, J. Urbani, Z. Meng, et al. EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery. arXiv preprint arXiv:2603.08127, 2026

work page arXiv 2026
[22]

S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore : Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076--12100, 2023

2023
[23]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

A. Novikov, N. V \ u , M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabian, et al. AlphaEvolve : A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Parameter golf: OpenAI model craft challenge

OpenAI. Parameter golf: OpenAI model craft challenge. https://github.com/openai/parameter-golf, 2026

2026
[25]

Press, A

O. Press, A. Hochlehnert, A. Prabhu, V. Udandarao, O. Press, and M. Bethge. CiteME : Can language models accurately cite scientific claims? Advances in Neural Information Processing Systems, 37: 0 7847--7877, 2024

2024
[26]

Y. Pu, T. Lin, and H. Chen. PiFlow : Principle-aware scientific discovery with multi-agent collaboration. arXiv preprint arXiv:2505.15047, 2025

work page arXiv 2025
[27]

Schmidgall, Y

S. Schmidgall, Y. Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, M. Moor, Z. Liu, and E. Barsoum. Agent laboratory: Using LLM agents as research assistants. Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977--6043, 2025

2025
[28]

PaperBench: Evaluating AI's Ability to Replicate AI Research

G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson, et al. PaperBench : Evaluating AI 's ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

J. Tang, L. Xia, Z. Li, and C. Huang. AI-Researcher : Autonomous scientific innovation. arXiv preprint arXiv:2505.18705, 2025

work page arXiv 2025
[30]

Z. Wang, F. Bai, Z. Luo, J. Su, K. Sun, X. Yu, J. Liu, K. Zhou, C. Cardie, M. Dredze, et al. FIRE-Bench : Evaluating agents on the rediscovery of scientific insights. arXiv preprint arXiv:2602.02905, 2026

work page arXiv 2026
[31]

Y. Weng, M. Zhu, Q. Xie, Q. Sun, Z. Lin, S. Liu, and Y. Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively. arXiv preprint arXiv:2509.26603, 2025

work page arXiv 2025
[32]

T. Xu, P. Lu, L. Ye, X. Hu, and P. Liu. ResearcherBench : Evaluating deep AI research systems on the frontiers of scientific inquiry. arXiv preprint arXiv:2507.16280, 2025

work page arXiv 2025
[33]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[35]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[36]

AgentBench: Evaluating LLMs as Agents

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

M. Cemri, S. Agrawal, A. Gupta, S. Liu, A. Cheng, Q. Mang, A. Naren, L. E. Erdogan, K. Sen, M. Zaharia, et al. AdaEvolve : Adaptive LLM driven zeroth-order optimization. arXiv preprint arXiv:2602.20133, 2026

work page arXiv 2026

[2] [2]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, et al. MLE-Bench : Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

T. Chen, S. Anumasa, B. Lin, V. Shah, A. Goyal, and D. Liu. Auto-Bench : An automated benchmark for scientific discovery in LLM s. arXiv preprint arXiv:2502.15224, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Cheng, S

A. Cheng, S. Liu, M. Pan, Z. Li, S. Agarwal, M. Cemri, B. Wang, A. Krentsel, T. Xia, J. Park, et al. Let the barbarians in: How AI can accelerate systems performance research. arXiv preprint arXiv:2512.14806, 2025 a

work page arXiv 2025

[5] [5]

Cheng, S

A. Cheng, S. Liu, M. Pan, Z. Li, B. Wang, A. Krentsel, T. Xia, M. Cemri, J. Park, S. Yang, et al. Barbarians at the gate: How AI is upending systems research. arXiv preprint arXiv:2510.06189, 2025 b

work page arXiv 2025

[6] [6]

ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review

P. Goyal, M. Parmar, Y. Song, H. Palangi, T. Pfister, and J. Yoon. ScholarPeer : A context-aware multi-agent framework for automated peer review. arXiv preprint arXiv:2601.22638, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

H \"a rder and A

T. H \"a rder and A. Reuter. Principles of transaction-oriented database recovery. ACM Computing Surveys , 15 0 (4): 0 287--317, 1983

1983

[8] [8]

MLAgentBench: Evaluating language agents on machine learning experimentation, 2024

Q. Huang, J. Vora, P. Liang, and J. Leskovec. MLAgentBench : Evaluating language agents on machine learning experimentation. arXiv preprint arXiv:2310.03302, 2023

work page arXiv 2023

[9] [9]

Jansen, O

P. Jansen, O. Tafjord, M. Radensky, P. Siangliulue, T. Hope, B. Dalvi, B. P. Majumder, D. S. Weld, and P. Clark. CodeScientist : End-to-end semi-automated scientific discovery with code-based experimentation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 13370--13467, 2025

2025

[10] [10]

P. T. J. Kon, J. Liu, Q. Ding, Y. Qiu, Z. Yang, Y. Huang, J. Srinivasa, M. Lee, M. Chowdhury, and A. Chen. Curie : Toward rigorous and automated scientific experimentation with AI agents. arXiv preprint arXiv:2502.16069, 2025 a

work page arXiv 2025

[11] [11]

P. T. J. Kon, J. Liu, X. Zhu, Q. Ding, J. Peng, J. Xing, Y. Huang, Y. Qiu, J. Srinivasa, M. Lee, et al. EXP-Bench : Can AI conduct AI research experiments? arXiv preprint arXiv:2505.24785, 2025 b

work page arXiv 2025

[12] [12]

J. Liu, S. Qiu, M. Li, B. Li, H. Ji, S. Han, X. Ye, P. Xia, Z. Dong, C. Zhang, et al. Autoresearchclaw: Self-reinforcing autonomous research with human-ai collaboration. arXiv preprint arXiv:2605.20025, 2026 a

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

N. F. Liu, T. Zhang, and P. Liang. Evaluating verifiability in generative search engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001--7025, 2023 a

2023

[14] [14]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024

2024

[15] [15]

S. Liu, S. Agarwal, M. Maheswaran, M. Cemri, Z. Li, Q. Mang, A. Naren, E. Boneh, A. Cheng, M. Z. Pan, et al. EvoX : Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026 b

work page arXiv 2026

[16] [16]

S. Liu, M. Cemri, S. Agarwal, A. Krentsel, A. Naren, Q. Mang, Z. Li, A. Gupta, M. Maheswaran, A. Cheng, M. Pan, E. Boneh, K. Ramchandran, K. Sen, A. G. Dimakis, M. Zaharia, and I. Stoica. Skydiscover: A flexible framework for ai-driven scientific and algorithmic discovery, 2026 c . URL https://skydiscover-ai.github.io/blog.html

2026

[17] [18]

Y. Liu, Z. Yang, T. Xie, J. Ni, B. Gao, Y. Li, S. Tang, W. Ouyang, E. Cambria, and D. Zhou. ResearchBench : Benchmarking LLM s in scientific discovery via inspiration-based task decomposition. arXiv preprint arXiv:2503.21248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [20]

Lupidi, B

A. Lupidi, B. Gauri, T. S. Foster, B. A. Omari, D. Magka, A. Pepe, A. Audran-Reiss, M. Aghamelu, N. Baldwin, L. Cipolina-Kun, et al. AIRS-Bench : A suite of tasks for frontier AI research science agents. arXiv preprint arXiv:2602.06855, 2026

work page arXiv 2026

[20] [21]

Y. Lyu, X. Zhang, X. Yi, Y. Zhao, S. Guo, W. Hu, J. Piotrowski, J. Kaliski, J. Urbani, Z. Meng, et al. EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery. arXiv preprint arXiv:2603.08127, 2026

work page arXiv 2026

[21] [22]

S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore : Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076--12100, 2023

2023

[22] [23]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

A. Novikov, N. V \ u , M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabian, et al. AlphaEvolve : A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [24]

Parameter golf: OpenAI model craft challenge

OpenAI. Parameter golf: OpenAI model craft challenge. https://github.com/openai/parameter-golf, 2026

2026

[24] [25]

Press, A

O. Press, A. Hochlehnert, A. Prabhu, V. Udandarao, O. Press, and M. Bethge. CiteME : Can language models accurately cite scientific claims? Advances in Neural Information Processing Systems, 37: 0 7847--7877, 2024

2024

[25] [26]

Y. Pu, T. Lin, and H. Chen. PiFlow : Principle-aware scientific discovery with multi-agent collaboration. arXiv preprint arXiv:2505.15047, 2025

work page arXiv 2025

[26] [27]

Schmidgall, Y

S. Schmidgall, Y. Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, M. Moor, Z. Liu, and E. Barsoum. Agent laboratory: Using LLM agents as research assistants. Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977--6043, 2025

2025

[27] [28]

PaperBench: Evaluating AI's Ability to Replicate AI Research

G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson, et al. PaperBench : Evaluating AI 's ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [29]

J. Tang, L. Xia, Z. Li, and C. Huang. AI-Researcher : Autonomous scientific innovation. arXiv preprint arXiv:2505.18705, 2025

work page arXiv 2025

[29] [30]

Z. Wang, F. Bai, Z. Luo, J. Su, K. Sun, X. Yu, J. Liu, K. Zhou, C. Cardie, M. Dredze, et al. FIRE-Bench : Evaluating agents on the rediscovery of scientific insights. arXiv preprint arXiv:2602.02905, 2026

work page arXiv 2026

[30] [31]

Y. Weng, M. Zhu, Q. Xie, Q. Sun, Z. Lin, S. Liu, and Y. Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively. arXiv preprint arXiv:2509.26603, 2025

work page arXiv 2025

[31] [32]

T. Xu, P. Lu, L. Ye, X. Hu, and P. Liu. ResearcherBench : Evaluating deep AI research systems on the frontiers of scientific inquiry. arXiv preprint arXiv:2507.16280, 2025

work page arXiv 2025

[32] [33]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [34]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

[34] [35]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

[35] [36]

AgentBench: Evaluating LLMs as Agents

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page internal anchor Pith review Pith/arXiv arXiv 2026