MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
Pith reviewed 2026-05-20 19:56 UTC · model grok-4.3
The pith
Autonomous research systems produce flawed machine learning papers, and workflow design predicts quality better than compute scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the MLReplicate benchmark on ICML 2025 papers, the authors show that all tested autonomous systems generate manuscripts containing methodological flaws and reproducibility failures according to human experts, even when some pass automated review. Neither token budget nor computational cost predicts output quality, and the cheapest system outperforms the most resource-intensive system in human evaluation despite a 38-fold difference in input tokens.
What carries the argument
The MLReplicate benchmark, which standardizes outstanding ML papers as inputs to autonomous systems and applies a dual automated-plus-human evaluation protocol while tracking cost and intervention metrics.
If this is right
- Automated conference-style reviews accept outputs that contain fabricated or unsupported claims according to human experts.
- No tested system achieves consistent reproducibility across the benchmark tasks.
- Token budget and computational cost do not correlate with higher-quality generated research.
- Differences in system workflows produce measurable differences in human-assessed rigor even when resources vary widely.
Where Pith is reading between the lines
- Future systems could prioritize improvements to internal reasoning loops and experiment validation steps rather than simply increasing model size or token limits.
- Extending the benchmark to papers from other fields would test whether the observed workflow-over-compute pattern holds outside machine learning.
- The gap between automated acceptance and human rejection points to a need for stronger automated checks that better approximate expert scrutiny of methods and results.
Load-bearing premise
Human expert evaluations supply an unbiased and reliable measure of scientific rigor and reproducibility.
What would settle it
A replication study that blinds reviewers to system identity, measures inter-rater agreement, and reports reviewer selection criteria; low agreement or detectable bias would undermine the human evaluation results.
Figures
read the original abstract
Autonomous research systems capable of generating complete scientific manuscripts have advanced rapidly, yet robust and realistic evaluation frameworks have failed to keep pace. To bridge this gap, we introduce MLReplicate, an end-to-end benchmark evaluating autonomous research systems on machine learning reproducibility. The benchmark was constructed from ICML 2025 outstanding papers reformulated into standardized input specifications and evaluated across 6 state-of-the-art research systems: AI SCIENTIST-V1, AI SCIENTIST-V2, AGENT LABORATORY, CYCLERESEARCHER, AI RESEARCHER, and TINY SCIENTIST, yielding 45 generated manuscripts, with 3 failed experiments. Outputs are assessed using a dual-protocol approach that combines automated conference-style review and structured expert human evaluation, while tracking computational cost, runtime, and the amount of required human intervention. The automated conference-style review accepted 10 out of 37 valid submissions. An additional 8 submissions were desk-rejected before review for failing to meet the minimum page threshold. In contrast to automated reviews, human reviewers consistently identified methodological flaws, hallucinated experimental results, and reproducibility failures across all systems, and 59% of accepted automated reviews contained fabricated or unsupported claims. We further find that neither token budget nor computational cost predicts output quality: the cheapest system outperforms the most resource-intensive system in human evaluation, despite a 38-fold difference in input tokens. We thus demonstrate that autonomous research workflow design matters more than the scale of compute. MLReplicate exposes a substantial gap between current autonomous research systems and genuine scientific rigor, and establishes a practical, extensible evaluation framework for systematic progress toward trustworthy AI-driven scientific discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MLReplicate, a benchmark for autonomous research systems in machine learning reproducibility. It reformulates ICML 2025 outstanding papers into standardized inputs and evaluates six systems (AI SCIENTIST-V1, AI SCIENTIST-V2, AGENT LABORATORY, CYCLERESEARCHER, AI RESEARCHER, TINY SCIENTIST) to produce 45 manuscripts (with 3 failures). Outputs are assessed via automated conference-style review (10/37 valid submissions accepted, 8 desk-rejected for page count) plus structured human expert evaluation, while tracking token budgets, costs, runtime, and human intervention. Key results: humans identify methodological flaws, hallucinations, and reproducibility failures across systems; 59% of automated accepts contain fabricated claims; neither token budget nor cost predicts quality, with the cheapest system outperforming the most expensive despite a 38-fold token difference. The central claim is that workflow design matters more than compute scale.
Significance. If the human evaluation protocol is shown to be reliable, this work offers a concrete, extensible benchmark grounded in real conference papers and dual automated/human assessment. The empirical finding that design trumps scale, supported by cross-system comparisons and cost/token tracking, would be a useful contribution to evaluating AI-driven scientific discovery. The concrete acceptance rates, fabrication incidence, and failure counts provide falsifiable reference points for future systems.
major comments (1)
- [Human evaluation protocol] Human evaluation section: the manuscript reports that human reviewers 'consistently identified methodological flaws, hallucinated experimental results, and reproducibility failures' and that the cheapest system outperforms the most resource-intensive one in human evaluation, yet provides no details on reviewer selection criteria, blinding to system identity, inter-rater agreement statistics, or rubrics for scoring rigor/reproducibility. Because the central claim (workflow design > compute scale, with no correlation between token budget/cost and quality) rests directly on these human judgments being an unbiased and reliable measure, the absence of these methodological safeguards is load-bearing and must be addressed before the ranking and correlation results can be interpreted with confidence.
minor comments (2)
- [Abstract] The abstract states '3 failed experiments' without defining failure criteria or how these cases were excluded from the 37 valid submissions and subsequent analyses.
- [Results] Table or results section reporting the 38-fold token difference and cost/quality correlations should include the exact per-system token counts, costs, and human scores to allow direct verification of the 'neither predicts quality' claim.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address the major comment on the human evaluation protocol below and will revise the manuscript to incorporate the requested details, thereby strengthening the interpretability of our results.
read point-by-point responses
-
Referee: [Human evaluation protocol] Human evaluation section: the manuscript reports that human reviewers 'consistently identified methodological flaws, hallucinated experimental results, and reproducibility failures' and that the cheapest system outperforms the most resource-intensive one in human evaluation, yet provides no details on reviewer selection criteria, blinding to system identity, inter-rater agreement statistics, or rubrics for scoring rigor/reproducibility. Because the central claim (workflow design > compute scale, with no correlation between token budget/cost and quality) rests directly on these human judgments being an unbiased and reliable measure, the absence of these methodological safeguards is load-bearing and must be addressed before the ranking and correlation results can be interpreted with confidence.
Authors: We agree that the current manuscript would benefit from expanded methodological details on the human evaluation to support the reliability of the judgments underlying our central claims. In the revised version, we will add a dedicated subsection describing: reviewer selection criteria (PhD-level ML researchers with publications at ICML/NeurIPS and at least five years of relevant experience); blinding procedures (reviewers received anonymized manuscripts with no information on the generating system or token budgets); inter-rater agreement (Fleiss' kappa computed across the three reviewers per manuscript, with values to be reported); and the scoring rubrics (explicit criteria and examples for detecting methodological flaws, hallucinated results, and reproducibility failures). These additions will provide the necessary transparency without altering the reported outcomes or conclusions. We believe this revision directly addresses the concern and allows readers to assess the strength of the evidence that workflow design matters more than scale. revision: yes
Circularity Check
Empirical benchmark self-contained; no circular reductions in derivation chain
full rationale
The paper constructs MLReplicate from external ICML 2025 outstanding papers as standardized inputs, then evaluates six autonomous systems to produce 45 manuscripts assessed via automated reviews and independent human evaluations. The central claim—that workflow design matters more than compute scale because neither token budget nor cost predicts quality, with the cheapest system outperforming the most expensive despite a 38-fold token difference—follows directly from these observed performance metrics and reviewer judgments. No equations, fitted parameters, or results reduce by construction to quantities defined within the paper itself; the findings emerge from new empirical comparisons against external benchmarks rather than self-definitional loops, self-citation chains, or renamed known results. The derivation remains self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ICML 2025 outstanding papers are representative of high-quality machine learning research suitable for reproducibility benchmarking.
Reference graph
Works this paper leans on
-
[1]
Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023). Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 12622–12645. Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes
work page 2024
-
[3]
Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller
Autonomous chemical research with large language models.Nature624, 7992 (2023), 570–578. Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller
work page 2023
-
[4]
ChemCrow: Augmenting large-language models with chemistry tools
Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376(2023). Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al
work page 2020
-
[6]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095(2024). Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
arXiv preprint arXiv:2505.19955(2025)
Mlr-bench: Evaluating ai agents on open-ended machine learning research. arXiv preprint arXiv:2505.19955(2025). Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al
-
[8]
Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080(2024). Unai Fischer-Abaigar, Christoph Kern, and Juan Carlos Perdomo
-
[9]
Alireza Ghafarollahi and Markus J Buehler
The value of prediction in identifying the worst-off.arXiv preprint arXiv:2501.19334(2025). Alireza Ghafarollahi and Markus J Buehler
-
[10]
Josh Givens, Song Liu, and Henry WJ Reeve
SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials37, 22 (2025), 2413523. Josh Givens, Song Liu, and Henry WJ Reeve
work page 2025
-
[11]
Score matching with missing data.arXiv preprint arXiv:2506.00557(2025). Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al
-
[12]
Towards an AI co-scientist.arXiv preprint arXiv:2502.18864(2025). 10 Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Myles Kim, Corey M Williams, Stefan Bekiranov, and Aidong Zhang
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber
Researchcodebench: Benchmarking llms on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314(2025). Jin Huang, Silviu Cucerzan, Sujay Kumar Jauhar, and Ryen W White
-
[14]
Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec
Idea2Plan: Exploring AI-Powered Research Planning.arXiv preprint arXiv:2510.24891(2025). Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec
-
[15]
Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302(2023). Maximilian Idahl and Zahra Ahmadi
-
[16]
Openreviewer: A specialized large language model for generating critical scientific paper reviews. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations). 550–562. Intology AI
work page 2025
-
[17]
https://www.intology.ai/blog/ zochi-tech-report
Zochi Technical Report. https://www.intology.ai/blog/ zochi-tech-report. Accessed: 2025-04-24. Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Daniel S Weld, and Peter Clark
work page 2025
-
[18]
AIDE: AI-Driven Exploration in the Space of Code
Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025). Farhana Keya, Gollam Rabby, Sören Auer, Sahar Vahdati, Prasenjit Mitra, and Mohamad Yaser Jaradeh
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Machine Learning115, 5 (2026),
Sci-idea: Context-aware scientific ideation using token and sentence embeddings. Machine Learning115, 5 (2026),
work page 2026
-
[20]
Jaeho Kim, Yunseok Lee, and Seulki Lee. 2025a. Position: The AI conference peer review crisis demands author feedback and reviewer rewards.arXiv preprint arXiv:2505.04966(2025). Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. 2025b. Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv prep...
-
[21]
Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069(2025). Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du
-
[22]
Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033(2024). Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha
-
[23]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The ai sci- entist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292 (2024). Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266(2025). 11 Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, et al
-
[25]
Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499(2025). Yansheng Qiu, Haoquan Zhang, Zhaopan Xu, Ming Li, Diping Song, Zheng Wang, and Kaipeng Zhang
-
[26]
Gollam Rabby, Diyana Muhammed, Prasenjit Mitra, and Sören Auer
Ai idea bench 2025: Ai research idea generation benchmark.arXiv preprint arXiv:2504.14191(2025). Gollam Rabby, Diyana Muhammed, Prasenjit Mitra, and Sören Auer
-
[27]
Iterative hypothesis generation for scientific discovery with Monte Carlo Nash equilibrium self-refining trees.arXiv preprint arXiv:2503.19309(2025). Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum
-
[28]
Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang
Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025 (2025), 5977–6043. Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang
work page 2025
-
[29]
Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan
Paper2code: Automating code generation from scientific papers in machine learning.arXiv preprint arXiv:2504.17192(2025). Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan
-
[30]
Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363(2024). Jake C Snell and Thomas L Griffiths
-
[31]
Conformal prediction as bayesian quadrature.arXiv preprint arXiv:2502.13228(2025). Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al
-
[32]
PaperBench: Evaluating AI's Ability to Replicate AI Research
PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025). Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705(2025). Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al
-
[34]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023). Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang
Scientific discovery in the age of artificial intelligence.Nature620, 7972 (2023), 47–60. Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang
work page 2023
-
[36]
Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816(2024). Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao
-
[37]
arXiv preprint arXiv:2502.00640(2025)
Collabllm: From passive responders to active collaborators. arXiv preprint arXiv:2502.00640(2025). Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He
-
[38]
Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.arXiv preprint arXiv:2504.00255(2025). Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, and Linyi Yang
-
[39]
SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts.arXiv preprint arXiv:2604.26506(2026). Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066(2025). 12 Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Evaluation conducted exclusively on synthetic datasets
Tinyscientist: An interactive, extensible, and controllable framework for building research agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 558–590. Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. 2025a. Deepreview: Improving llm-based paper review with human-like deep thinking p...
-
[42]
6, organized by the Zhongguancun Academy, the Zhongguancun Institute of Artificial Intelligence, Tsinghua University, Westlake University, and the University of Chicago. We leveraged the ICAIS infrastructure because it enables the generation of multiple automated reviews at scale while maintaining practical usability. Looking ahead, we believe this setup ...
work page 2025
-
[43]
Input One JSON file per run under Level 2 (Reference-Based Ideation), containing the target paper title and a ranked list of source papers with their references, types, justifications, and usage descriptions. Each source paper entry also carries metadata including authors, year and abstract Example (abbreviated): { "target": "Conformal Prediction as Bayes...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.