MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
Position: The ai conference peer review crisis demands author feedback and reviewer rewards.arXiv preprint arXiv:2505.04966
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 2representative citing papers
Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference acceptance rates.
AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enable cheaper downstream verification.
citing papers explorer
-
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
-
Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents
Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference acceptance rates.
-
Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI
AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enable cheaper downstream verification.