MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

Diyana Muhammed; Farhana Keya; Gollam Rabby; Sasi Kiran Gaddipati; S\"oren Auer

arxiv: 2605.16616 · v1 · pith:PYGVHWY3new · submitted 2026-05-15 · 💻 cs.LG

MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

Sasi Kiran Gaddipati , Diyana Muhammed , Farhana Keya , Gollam Rabby , S\"oren Auer This is my paper

Pith reviewed 2026-05-20 19:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords autonomous research systemsmachine learning reproducibilitybenchmarkingAI-generated manuscriptsscientific discoveryworkflow design

0 comments

The pith

Autonomous research systems produce flawed machine learning papers, and workflow design predicts quality better than compute scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MLReplicate, a benchmark that feeds reformulated outstanding ICML papers as inputs to six current autonomous research systems and scores the resulting manuscripts. Both automated conference-style reviews and human expert evaluations are applied while recording token use, runtime, and human intervention. Automated reviews accept some outputs, yet human reviewers flag methodological errors, hallucinated results, and reproducibility shortfalls in every system. The study reports that neither higher token budgets nor greater computational cost reliably improves the scientific quality of the generated work.

Core claim

Using the MLReplicate benchmark on ICML 2025 papers, the authors show that all tested autonomous systems generate manuscripts containing methodological flaws and reproducibility failures according to human experts, even when some pass automated review. Neither token budget nor computational cost predicts output quality, and the cheapest system outperforms the most resource-intensive system in human evaluation despite a 38-fold difference in input tokens.

What carries the argument

The MLReplicate benchmark, which standardizes outstanding ML papers as inputs to autonomous systems and applies a dual automated-plus-human evaluation protocol while tracking cost and intervention metrics.

If this is right

Automated conference-style reviews accept outputs that contain fabricated or unsupported claims according to human experts.
No tested system achieves consistent reproducibility across the benchmark tasks.
Token budget and computational cost do not correlate with higher-quality generated research.
Differences in system workflows produce measurable differences in human-assessed rigor even when resources vary widely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future systems could prioritize improvements to internal reasoning loops and experiment validation steps rather than simply increasing model size or token limits.
Extending the benchmark to papers from other fields would test whether the observed workflow-over-compute pattern holds outside machine learning.
The gap between automated acceptance and human rejection points to a need for stronger automated checks that better approximate expert scrutiny of methods and results.

Load-bearing premise

Human expert evaluations supply an unbiased and reliable measure of scientific rigor and reproducibility.

What would settle it

A replication study that blinds reviewers to system identity, measures inter-rater agreement, and reports reviewer selection criteria; low agreement or detectable bias would undermine the human evaluation results.

Figures

Figures reproduced from arXiv: 2605.16616 by Diyana Muhammed, Farhana Keya, Gollam Rabby, Sasi Kiran Gaddipati, S\"oren Auer.

**Figure 1.** Figure 1: MLReplicate evaluation pipeline. (A) Source Curation & Paper Generation: 8 ICML 2025 papers are standardized into groundtruth specifications; 6 autonomous systems execute end-toend research cycles, ideation, code generation, writing, and internal review, to produce generated manuscripts. (B) Review Process: Outputs undergo human intervention (formatting, citation alignment, LATEX compilation) before eval… view at source ↗

**Figure 2.** Figure 2: Automated vs. human evaluation. Left: correlation analysis between human and AI evaluations. Right: overall rating score distributions across systems. AIR=AI RESEARCHER; AL=AGENT LABORATORY; CR= CYCLERESEARCHER; TS= TINY SCIENTIST; Pn denotes the paper ID as listed in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Generated experiments and acceptance distribution for each research system from automated reviews. AIS1 = AI SCIENTIST V1; AIS2 = AI SCIENTIST V2. Post-Generation Processing. All system outputs undergo limited post-processing before evaluation, strictly confined to formatting corrections, citation alignment, and figure integration fixes; no modifications are made to scientific content, experimental resu… view at source ↗

**Figure 4.** Figure 4: Overview of acceptance statistics and dimension analysis of the human reviewers. The horizontal axis of the heatmap represents the Quality, Clarity, Significance, Originality, Format, and Citation Relevance in the respective order. AIR=AI RESEARCHER; AL=AGENT LABORATORY; CR=CYCLERESEARCHER; TS=TINY SCIENTIST. AIR AIS2 AL CR TS 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Average Reviewer Rating (a) Average reviewer rating … view at source ↗

**Figure 5.** Figure 5: Overview of overall rating and dimension analysis of the automated reviews. The short forms represent the relevant research systems. AIR=AI RESEARCHER; AIS2 = AI SCIENTIST V2; AL= AGENT LABORATORY; CR= CYCLERESEARCHER; TS= TINY SCIENTIST. presentation, and contribution. Nevertheless, no system exceeded an average dimensional score of 2 out of 4, underscoring a substantial gap between current autonomous res… view at source ↗

**Figure 6.** Figure 6: Overall resource distribution and a paper similarity heatmap illustrating the similarities between generated and actual papers. AIS2=AI SCIENTIST-V2, AIR=AI RESEARCHER; AL=AGENT LABORATORY; CR= CYCLERESEARCHER; TS= TINY SCIENTIST; Pn denotes the paper ID as listed in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Excerpts from the human evaluation forms, designed in the style of NeurIPS review [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Autonomous research systems capable of generating complete scientific manuscripts have advanced rapidly, yet robust and realistic evaluation frameworks have failed to keep pace. To bridge this gap, we introduce MLReplicate, an end-to-end benchmark evaluating autonomous research systems on machine learning reproducibility. The benchmark was constructed from ICML 2025 outstanding papers reformulated into standardized input specifications and evaluated across 6 state-of-the-art research systems: AI SCIENTIST-V1, AI SCIENTIST-V2, AGENT LABORATORY, CYCLERESEARCHER, AI RESEARCHER, and TINY SCIENTIST, yielding 45 generated manuscripts, with 3 failed experiments. Outputs are assessed using a dual-protocol approach that combines automated conference-style review and structured expert human evaluation, while tracking computational cost, runtime, and the amount of required human intervention. The automated conference-style review accepted 10 out of 37 valid submissions. An additional 8 submissions were desk-rejected before review for failing to meet the minimum page threshold. In contrast to automated reviews, human reviewers consistently identified methodological flaws, hallucinated experimental results, and reproducibility failures across all systems, and 59% of accepted automated reviews contained fabricated or unsupported claims. We further find that neither token budget nor computational cost predicts output quality: the cheapest system outperforms the most resource-intensive system in human evaluation, despite a 38-fold difference in input tokens. We thus demonstrate that autonomous research workflow design matters more than the scale of compute. MLReplicate exposes a substantial gap between current autonomous research systems and genuine scientific rigor, and establishes a practical, extensible evaluation framework for systematic progress toward trustworthy AI-driven scientific discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces MLReplicate, a benchmark for autonomous research systems in machine learning reproducibility. It reformulates ICML 2025 outstanding papers into standardized inputs and evaluates six systems (AI SCIENTIST-V1, AI SCIENTIST-V2, AGENT LABORATORY, CYCLERESEARCHER, AI RESEARCHER, TINY SCIENTIST) to produce 45 manuscripts (with 3 failures). Outputs are assessed via automated conference-style review (10/37 valid submissions accepted, 8 desk-rejected for page count) plus structured human expert evaluation, while tracking token budgets, costs, runtime, and human intervention. Key results: humans identify methodological flaws, hallucinations, and reproducibility failures across systems; 59% of automated accepts contain fabricated claims; neither token budget nor cost predicts quality, with the cheapest system outperforming the most expensive despite a 38-fold token difference. The central claim is that workflow design matters more than compute scale.

Significance. If the human evaluation protocol is shown to be reliable, this work offers a concrete, extensible benchmark grounded in real conference papers and dual automated/human assessment. The empirical finding that design trumps scale, supported by cross-system comparisons and cost/token tracking, would be a useful contribution to evaluating AI-driven scientific discovery. The concrete acceptance rates, fabrication incidence, and failure counts provide falsifiable reference points for future systems.

major comments (1)

[Human evaluation protocol] Human evaluation section: the manuscript reports that human reviewers 'consistently identified methodological flaws, hallucinated experimental results, and reproducibility failures' and that the cheapest system outperforms the most resource-intensive one in human evaluation, yet provides no details on reviewer selection criteria, blinding to system identity, inter-rater agreement statistics, or rubrics for scoring rigor/reproducibility. Because the central claim (workflow design > compute scale, with no correlation between token budget/cost and quality) rests directly on these human judgments being an unbiased and reliable measure, the absence of these methodological safeguards is load-bearing and must be addressed before the ranking and correlation results can be interpreted with confidence.

minor comments (2)

[Abstract] The abstract states '3 failed experiments' without defining failure criteria or how these cases were excluded from the 37 valid submissions and subsequent analyses.
[Results] Table or results section reporting the 38-fold token difference and cost/quality correlations should include the exact per-system token counts, costs, and human scores to allow direct verification of the 'neither predicts quality' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment on the human evaluation protocol below and will revise the manuscript to incorporate the requested details, thereby strengthening the interpretability of our results.

read point-by-point responses

Referee: [Human evaluation protocol] Human evaluation section: the manuscript reports that human reviewers 'consistently identified methodological flaws, hallucinated experimental results, and reproducibility failures' and that the cheapest system outperforms the most resource-intensive one in human evaluation, yet provides no details on reviewer selection criteria, blinding to system identity, inter-rater agreement statistics, or rubrics for scoring rigor/reproducibility. Because the central claim (workflow design > compute scale, with no correlation between token budget/cost and quality) rests directly on these human judgments being an unbiased and reliable measure, the absence of these methodological safeguards is load-bearing and must be addressed before the ranking and correlation results can be interpreted with confidence.

Authors: We agree that the current manuscript would benefit from expanded methodological details on the human evaluation to support the reliability of the judgments underlying our central claims. In the revised version, we will add a dedicated subsection describing: reviewer selection criteria (PhD-level ML researchers with publications at ICML/NeurIPS and at least five years of relevant experience); blinding procedures (reviewers received anonymized manuscripts with no information on the generating system or token budgets); inter-rater agreement (Fleiss' kappa computed across the three reviewers per manuscript, with values to be reported); and the scoring rubrics (explicit criteria and examples for detecting methodological flaws, hallucinated results, and reproducibility failures). These additions will provide the necessary transparency without altering the reported outcomes or conclusions. We believe this revision directly addresses the concern and allows readers to assess the strength of the evidence that workflow design matters more than scale. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark self-contained; no circular reductions in derivation chain

full rationale

The paper constructs MLReplicate from external ICML 2025 outstanding papers as standardized inputs, then evaluates six autonomous systems to produce 45 manuscripts assessed via automated reviews and independent human evaluations. The central claim—that workflow design matters more than compute scale because neither token budget nor cost predicts quality, with the cheapest system outperforming the most expensive despite a 38-fold token difference—follows directly from these observed performance metrics and reviewer judgments. No equations, fitted parameters, or results reduce by construction to quantities defined within the paper itself; the findings emerge from new empirical comparisons against external benchmarks rather than self-definitional loops, self-citation chains, or renamed known results. The derivation remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption that ICML 2025 outstanding papers form a suitable high-quality test set and that human reviewers can reliably detect methodological and reproducibility issues; no free parameters or invented entities are introduced.

axioms (1)

domain assumption ICML 2025 outstanding papers are representative of high-quality machine learning research suitable for reproducibility benchmarking.
The benchmark is constructed directly from these papers as described in the abstract.

pith-pipeline@v0.9.0 · 5847 in / 1452 out tokens · 62075 ms · 2026-05-20T19:56:07.501783+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023). Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 12622–12645. Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

work page 2024
[3]

Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

Autonomous chemical research with large language models.Nature624, 7992 (2023), 570–578. Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

work page 2023
[4]

ChemCrow: Augmenting large-language models with chemistry tools

Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376(2023). Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al

Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al

work page 2020
[6]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095(2024). Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

arXiv preprint arXiv:2505.19955(2025)

Mlr-bench: Evaluating ai agents on open-ended machine learning research. arXiv preprint arXiv:2505.19955(2025). Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al

work page arXiv 2025
[8]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080(2024). Unai Fischer-Abaigar, Christoph Kern, and Juan Carlos Perdomo

work page arXiv 2024
[9]

Alireza Ghafarollahi and Markus J Buehler

The value of prediction in identifying the worst-off.arXiv preprint arXiv:2501.19334(2025). Alireza Ghafarollahi and Markus J Buehler

work page arXiv 2025
[10]

Josh Givens, Song Liu, and Henry WJ Reeve

SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials37, 22 (2025), 2413523. Josh Givens, Song Liu, and Henry WJ Reeve

work page 2025
[11]

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al

Score matching with missing data.arXiv preprint arXiv:2506.00557(2025). Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al

work page arXiv 2025
[12]

Towards an AI co-scientist

Towards an AI co-scientist.arXiv preprint arXiv:2502.18864(2025). 10 Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Myles Kim, Corey M Williams, Stefan Bekiranov, and Aidong Zhang

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber

Researchcodebench: Benchmarking llms on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314(2025). Jin Huang, Silviu Cucerzan, Sujay Kumar Jauhar, and Ryen W White

work page arXiv 2025
[14]

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec

Idea2Plan: Exploring AI-Powered Research Planning.arXiv preprint arXiv:2510.24891(2025). Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec

work page arXiv 2025
[15]

Mlagentbench: Evaluating language agents on ma- chine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302(2023). Maximilian Idahl and Zahra Ahmadi

work page arXiv 2023
[16]

InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

Openreviewer: A specialized large language model for generating critical scientific paper reviews. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations). 550–562. Intology AI

work page 2025
[17]

https://www.intology.ai/blog/ zochi-tech-report

Zochi Technical Report. https://www.intology.ai/blog/ zochi-tech-report. Accessed: 2025-04-24. Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Daniel S Weld, and Peter Clark

work page 2025
[18]

AIDE: AI-Driven Exploration in the Space of Code

Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025). Farhana Keya, Gollam Rabby, Sören Auer, Sahar Vahdati, Prasenjit Mitra, and Mohamad Yaser Jaradeh

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Machine Learning115, 5 (2026),

Sci-idea: Context-aware scientific ideation using token and sentence embeddings. Machine Learning115, 5 (2026),

work page 2026
[20]

Jaeho Kim, Yunseok Lee, and Seulki Lee. 2025a. Position: The AI conference peer review crisis demands author feedback and reviewer rewards.arXiv preprint arXiv:2505.04966(2025). Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. 2025b. Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv prep...

work page arXiv 2025
[21]

Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069, 2025

Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069(2025). Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du

work page arXiv 2025
[22]

Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033(2024). Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

work page arXiv 2024
[23]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai sci- entist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292 (2024). Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266(2025). 11 Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, et al

work page arXiv 2025
[25]

Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025

Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499(2025). Yansheng Qiu, Haoquan Zhang, Zhaopan Xu, Ming Li, Diping Song, Zheng Wang, and Kaipeng Zhang

work page arXiv 2025
[26]

Gollam Rabby, Diyana Muhammed, Prasenjit Mitra, and Sören Auer

Ai idea bench 2025: Ai research idea generation benchmark.arXiv preprint arXiv:2504.14191(2025). Gollam Rabby, Diyana Muhammed, Prasenjit Mitra, and Sören Auer

work page arXiv 2025
[27]

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum

Iterative hypothesis generation for scientific discovery with Monte Carlo Nash equilibrium self-refining trees.arXiv preprint arXiv:2503.19309(2025). Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum

work page arXiv 2025
[28]

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang

Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025 (2025), 5977–6043. Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang

work page 2025
[29]

Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

Paper2code: Automating code generation from scientific papers in machine learning.arXiv preprint arXiv:2504.17192(2025). Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

work page arXiv 2025
[30]

Core-bench: Fos- tering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363, 2024

Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363(2024). Jake C Snell and Thomas L Griffiths

work page arXiv 2024
[31]

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al

Conformal prediction as bayesian quadrature.arXiv preprint arXiv:2502.13228(2025). Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al

work page arXiv 2025
[32]

PaperBench: Evaluating AI's Ability to Replicate AI Research

PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025). Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

CoRR , volume =

Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705(2025). Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al

work page arXiv 2025
[34]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023). Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang

Scientific discovery in the age of artificial intelligence.Nature620, 7972 (2023), 47–60. Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang

work page 2023
[36]

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao

Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816(2024). Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao

work page arXiv 2024
[37]

arXiv preprint arXiv:2502.00640(2025)

Collabllm: From passive responders to active collaborators. arXiv preprint arXiv:2502.00640(2025). Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He

work page arXiv 2025
[38]

Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.arXiv preprint arXiv:2504.00255, 2025

Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.arXiv preprint arXiv:2504.00255(2025). Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, and Linyi Yang

work page arXiv 2025
[39]

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts.arXiv preprint arXiv:2604.26506(2026). Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066(2025). 12 Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Evaluation conducted exclusively on synthetic datasets

Tinyscientist: An interactive, extensible, and controllable framework for building research agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 558–590. Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. 2025a. Deepreview: Improving llm-based paper review with human-like deep thinking p...

work page arXiv 2025
[42]

experiment_name

6, organized by the Zhongguancun Academy, the Zhongguancun Institute of Artificial Intelligence, Tsinghua University, Westlake University, and the University of Chicago. We leveraged the ICAIS infrastructure because it enables the generation of multiple automated reviews at scale while maintaining practical usability. Looking ahead, we believe this setup ...

work page 2025
[43]

target":

Input One JSON file per run under Level 2 (Reference-Based Ideation), containing the target paper title and a ranked list of source papers with their references, types, justifications, and usage descriptions. Each source paper entry also carries metadata including authors, year and abstract Example (abbreviated): { "target": "Conformal Prediction as Bayes...

work page 2025

[1] [1]

GPT-4 Technical Report

Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023). Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 12622–12645. Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

work page 2024

[3] [3]

Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

Autonomous chemical research with large language models.Nature624, 7992 (2023), 570–578. Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

work page 2023

[4] [4]

ChemCrow: Augmenting large-language models with chemistry tools

Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376(2023). Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al

Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al

work page 2020

[6] [6]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095(2024). Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

arXiv preprint arXiv:2505.19955(2025)

Mlr-bench: Evaluating ai agents on open-ended machine learning research. arXiv preprint arXiv:2505.19955(2025). Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al

work page arXiv 2025

[8] [8]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080(2024). Unai Fischer-Abaigar, Christoph Kern, and Juan Carlos Perdomo

work page arXiv 2024

[9] [9]

Alireza Ghafarollahi and Markus J Buehler

The value of prediction in identifying the worst-off.arXiv preprint arXiv:2501.19334(2025). Alireza Ghafarollahi and Markus J Buehler

work page arXiv 2025

[10] [10]

Josh Givens, Song Liu, and Henry WJ Reeve

SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials37, 22 (2025), 2413523. Josh Givens, Song Liu, and Henry WJ Reeve

work page 2025

[11] [11]

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al

Score matching with missing data.arXiv preprint arXiv:2506.00557(2025). Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al

work page arXiv 2025

[12] [12]

Towards an AI co-scientist

Towards an AI co-scientist.arXiv preprint arXiv:2502.18864(2025). 10 Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Myles Kim, Corey M Williams, Stefan Bekiranov, and Aidong Zhang

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber

Researchcodebench: Benchmarking llms on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314(2025). Jin Huang, Silviu Cucerzan, Sujay Kumar Jauhar, and Ryen W White

work page arXiv 2025

[14] [14]

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec

Idea2Plan: Exploring AI-Powered Research Planning.arXiv preprint arXiv:2510.24891(2025). Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec

work page arXiv 2025

[15] [15]

Mlagentbench: Evaluating language agents on ma- chine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302(2023). Maximilian Idahl and Zahra Ahmadi

work page arXiv 2023

[16] [16]

InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

Openreviewer: A specialized large language model for generating critical scientific paper reviews. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations). 550–562. Intology AI

work page 2025

[17] [17]

https://www.intology.ai/blog/ zochi-tech-report

Zochi Technical Report. https://www.intology.ai/blog/ zochi-tech-report. Accessed: 2025-04-24. Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Daniel S Weld, and Peter Clark

work page 2025

[18] [18]

AIDE: AI-Driven Exploration in the Space of Code

Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025). Farhana Keya, Gollam Rabby, Sören Auer, Sahar Vahdati, Prasenjit Mitra, and Mohamad Yaser Jaradeh

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Machine Learning115, 5 (2026),

Sci-idea: Context-aware scientific ideation using token and sentence embeddings. Machine Learning115, 5 (2026),

work page 2026

[20] [20]

Jaeho Kim, Yunseok Lee, and Seulki Lee. 2025a. Position: The AI conference peer review crisis demands author feedback and reviewer rewards.arXiv preprint arXiv:2505.04966(2025). Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. 2025b. Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv prep...

work page arXiv 2025

[21] [21]

Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069, 2025

Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069(2025). Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du

work page arXiv 2025

[22] [22]

Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033(2024). Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

work page arXiv 2024

[23] [23]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai sci- entist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292 (2024). Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266(2025). 11 Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, et al

work page arXiv 2025

[25] [25]

Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025

Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499(2025). Yansheng Qiu, Haoquan Zhang, Zhaopan Xu, Ming Li, Diping Song, Zheng Wang, and Kaipeng Zhang

work page arXiv 2025

[26] [26]

Gollam Rabby, Diyana Muhammed, Prasenjit Mitra, and Sören Auer

Ai idea bench 2025: Ai research idea generation benchmark.arXiv preprint arXiv:2504.14191(2025). Gollam Rabby, Diyana Muhammed, Prasenjit Mitra, and Sören Auer

work page arXiv 2025

[27] [27]

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum

Iterative hypothesis generation for scientific discovery with Monte Carlo Nash equilibrium self-refining trees.arXiv preprint arXiv:2503.19309(2025). Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum

work page arXiv 2025

[28] [28]

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang

Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025 (2025), 5977–6043. Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang

work page 2025

[29] [29]

Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

Paper2code: Automating code generation from scientific papers in machine learning.arXiv preprint arXiv:2504.17192(2025). Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

work page arXiv 2025

[30] [30]

Core-bench: Fos- tering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363, 2024

Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363(2024). Jake C Snell and Thomas L Griffiths

work page arXiv 2024

[31] [31]

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al

Conformal prediction as bayesian quadrature.arXiv preprint arXiv:2502.13228(2025). Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al

work page arXiv 2025

[32] [32]

PaperBench: Evaluating AI's Ability to Replicate AI Research

PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025). Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

CoRR , volume =

Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705(2025). Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al

work page arXiv 2025

[34] [34]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023). Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang

Scientific discovery in the age of artificial intelligence.Nature620, 7972 (2023), 47–60. Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang

work page 2023

[36] [36]

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao

Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816(2024). Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao

work page arXiv 2024

[37] [37]

arXiv preprint arXiv:2502.00640(2025)

Collabllm: From passive responders to active collaborators. arXiv preprint arXiv:2502.00640(2025). Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He

work page arXiv 2025

[38] [38]

Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.arXiv preprint arXiv:2504.00255, 2025

Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.arXiv preprint arXiv:2504.00255(2025). Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, and Linyi Yang

work page arXiv 2025

[39] [39]

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts.arXiv preprint arXiv:2604.26506(2026). Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066(2025). 12 Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Evaluation conducted exclusively on synthetic datasets

Tinyscientist: An interactive, extensible, and controllable framework for building research agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 558–590. Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. 2025a. Deepreview: Improving llm-based paper review with human-like deep thinking p...

work page arXiv 2025

[42] [42]

experiment_name

6, organized by the Zhongguancun Academy, the Zhongguancun Institute of Artificial Intelligence, Tsinghua University, Westlake University, and the University of Chicago. We leveraged the ICAIS infrastructure because it enables the generation of multiple automated reviews at scale while maintaining practical usability. Looking ahead, we believe this setup ...

work page 2025

[43] [43]

target":

Input One JSON file per run under Level 2 (Reference-Based Ideation), containing the target paper title and a ranked list of source papers with their references, types, justifications, and usage descriptions. Each source paper entry also carries metadata including authors, year and abstract Example (abbreviated): { "target": "Conformal Prediction as Bayes...

work page 2025