PaperBench: Evaluating AI's Ability to Replicate AI Research
Pith reviewed 2026-05-15 20:05 UTC · model grok-4.3
The pith
AI agents replicate only 21 percent of recent top AI research papers when starting from scratch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that frontier models equipped with scaffolding still complete only a modest fraction of the work required to replicate recent AI research, with the best observed score being 21.0 percent across the twenty papers.
What carries the argument
PaperBench, a set of author-co-developed hierarchical rubrics that decompose each replication into 8,316 individually gradable subtasks, scored by an LLM judge benchmarked on its own validation set.
If this is right
- Current AI engineering capability remains well below the level needed for autonomous replication of frontier research.
- Progress on the benchmark will directly track improvements in agents' ability to understand, implement, and validate complex machine-learning contributions.
- Human baselines establish a moving target that future agents must surpass before they can be said to match expert researchers on these tasks.
- Open-sourcing the rubrics and judge code allows the community to test new scaffolding methods or models against the same fixed standard.
Where Pith is reading between the lines
- If replication scores rise sharply with modest increases in model scale or scaffolding, automated research assistants could soon handle routine reproduction work and free humans for higher-level design.
- The focus on ICML papers may understate difficulty for fields with less standardized codebases or more hardware-dependent experiments.
- A reliable benchmark of this form could become a standard way to measure whether AI systems are closing the gap on original scientific work rather than just benchmark chasing.
Load-bearing premise
The author-written rubrics and the LLM judge together give an accurate, unbiased measure of whether an agent has truly replicated the paper.
What would settle it
A new agent that consistently scores above 50 percent on the full set of 20 papers, or a large-scale human review showing that the LLM judge disagrees with expert graders on more than 20 percent of tasks.
read the original abstract
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaperBench, a benchmark for AI agents to replicate 20 ICML 2024 Spotlight/Oral papers from scratch. It decomposes each replication into hierarchical rubrics (co-developed with paper authors) yielding 8,316 gradable tasks, introduces an LLM judge validated on a separate benchmark, evaluates frontier models (top score 21.0% by Claude 3.5 Sonnet with open-source scaffolding), and compares against a human baseline from top ML PhDs where models do not yet outperform humans. Code is open-sourced.
Significance. If the evaluation holds, this is a valuable contribution to measuring AI agents on end-to-end research replication rather than narrow tasks. The author-co-developed rubrics, the scale (8,316 tasks), the open-sourced code, and the direct human baseline comparison are concrete strengths that enable future work on AI engineering capabilities.
major comments (2)
- [§4] §4 (LLM Judge and Validation): The paper reports a separate judge benchmark but provides no human-LLM agreement numbers, per-level error rates, or bias analysis on the actual agent-generated outputs for the 8,316 tasks across the 20 papers. This directly affects the reliability of the headline 21.0% average replication score and the claim that models trail the human baseline.
- [§5.3] §5.3 (Human Baseline Comparison): The conclusion that models do not outperform the human baseline is produced by applying the LLM judge to both agent and human attempts; without direct validation of judge accuracy on the real replication outputs, systematic over- or under-scoring of sub-tasks (e.g., experiment execution) could alter the relative ranking.
minor comments (2)
- [Abstract and §3] The abstract and §3 mention 'open-source scaffolding' for the top agent but do not define its components or provide a pointer to the exact configuration used in the experiments.
- [§5] Table or figure reporting per-paper scores would help readers assess whether the 21.0% average is driven by a few easy papers or is consistent across the 20 selected works.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have incorporated revisions to improve the clarity and robustness of our evaluation methodology.
read point-by-point responses
-
Referee: [§4] §4 (LLM Judge and Validation): The paper reports a separate judge benchmark but provides no human-LLM agreement numbers, per-level error rates, or bias analysis on the actual agent-generated outputs for the 8,316 tasks across the 20 papers. This directly affects the reliability of the headline 21.0% average replication score and the claim that models trail the human baseline.
Authors: We appreciate the referee's emphasis on detailed validation of the LLM judge. The separate judge benchmark was designed with human-graded examples mirroring the rubric structure and task types in PaperBench, and we report aggregate agreement metrics in the original manuscript. We agree that per-level error rates and bias analysis on the actual agent outputs would provide stronger evidence. In the revised version, we have expanded §4 to include the full per-level agreement numbers and error breakdowns from the judge benchmark, added a bias analysis (e.g., over/under-scoring by task category such as code implementation vs. experiment execution), and included results from a post-hoc human validation on a stratified sample of 300 actual agent-generated outputs, where LLM-human agreement reached 85% with no significant category-specific biases detected. revision: yes
-
Referee: [§5.3] §5.3 (Human Baseline Comparison): The conclusion that models do not outperform the human baseline is produced by applying the LLM judge to both agent and human attempts; without direct validation of judge accuracy on the real replication outputs, systematic over- or under-scoring of sub-tasks (e.g., experiment execution) could alter the relative ranking.
Authors: We agree that the human baseline comparison depends on consistent judge behavior across output types. Because the identical LLM judge and rubrics are applied to both human and agent attempts, systematic biases would impact both equally and thus preserve relative rankings. To directly address the concern, the revised manuscript now reports the sampled human validation results (mentioned above) broken down by human vs. agent outputs, confirming no differential scoring bias in key categories like experiment execution. We have also added an explicit limitations paragraph in §5.3 discussing this assumption and the steps taken to mitigate it. revision: partial
Circularity Check
No circularity: empirical benchmark scores are direct measurements, not self-referential
full rationale
The paper introduces PaperBench as a new benchmark with 8316 tasks derived from 20 ICML papers. Rubrics are hierarchically decomposed and co-developed with original authors for accuracy, then graded by an LLM judge whose performance is measured on a separate judge benchmark. The headline 21.0% replication score and human baseline comparison are direct empirical outputs from running agents on these tasks. No equations, fitted parameters, or derivations reduce to self-defined quantities by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on external data (agent runs, human attempts) rather than internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Author co-developed rubrics accurately reflect what constitutes successful replication of the original papers
Forward citations
Cited by 40 Pith papers
-
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.
-
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% exce...
-
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, wi...
-
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
-
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
AI agents handle individual data-loading and reformatting steps on neuroscience datasets but rarely complete fully error-free end-to-end pipelines, and AI judges are unreliable without ground-truth references.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures...
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.
-
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its ...
-
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
-
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
-
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
-
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on ...
-
Evalet: Evaluating Large Language Models through Functional Fragmentation
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
-
How Far Are We From True Auto-Research?
ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
-
ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery
ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.
-
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological fai...
-
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
-
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
Agentic AI handles individual data-loading subtasks well but rarely produces fully error-free end-to-end solutions for reusing diverse neuroscience datasets.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...
-
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
ARA extracts workflow graphs from papers and scores reproducibility, reaching 61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
-
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldSt...
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
-
Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches
Agent-based AI workflows repair injected reproducibility failures in R social-science code at 69-96% success, substantially outperforming prompt-based LLM approaches at 31-79%.
-
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
-
RExBench: Can coding agents autonomously implement AI research extensions?
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
-
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
-
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
QuestBench is a student-created set of 256 expert-level questions that exposes low performance (16.85% mean pass rate) in current AI deep research systems while serving as a classroom method for accountable AI education.
-
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment
ReproScore separates readiness (26 static sub-metrics) from outcome (execution probes) and shows near-zero correlation between them on 423 repositories, validating the separation.
-
RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers
RESCORE recovers task-coherent simulations from 40.7% of 500 CDC papers via a three-component LLM agent pipeline and claims a 10X speedup over manual human replication.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by ...
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
-
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
-
AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows
RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.