{"paper":{"title":"PaperBench: Evaluating AI's Ability to Replicate AI Research","license":"http://creativecommons.org/licenses/by/4.0/","headline":"AI agents replicate only 21 percent of recent top AI research papers when starting from scratch.","cross_cats":["cs.CL"],"primary_cat":"cs.AI","authors_text":"Amelia Glaese, Benjamin Kinsella, Dane Sherburn, Evan Mays, Giulio Starace, James Aung, Johannes Heidecke, Jun Shern Chan, Leon Maksin, Oliver Jaffe, Rachel Dias, Tejal Patwardhan, Wyatt Thompson","submitted_at":"2025-04-02T15:55:24Z","abstract_excerpt":"We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the author-co-developed rubrics and the LLM judge together provide a reliable, unbiased measure of successful replication that generalizes beyond the 20 selected papers.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"PaperBench is a new benchmark showing frontier AI agents replicate only 21% of tasks needed to reproduce state-of-the-art AI papers, below human expert performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"AI agents replicate only 21 percent of recent top AI research papers when starting from scratch.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"071514f2404d401b85e9d6424f00a09fc9599133491b60f2e84d3924c62bfcbe"},"source":{"id":"2504.01848","kind":"arxiv","version":3},"verdict":{"id":"06b8b154-26d4-4c2b-aff4-c4e871f4e42e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:05:52.270470Z","strongest_claim":"the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.","one_line_summary":"PaperBench is a new benchmark showing frontier AI agents replicate only 21% of tasks needed to reproduce state-of-the-art AI papers, below human expert performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the author-co-developed rubrics and the LLM judge together provide a reliable, unbiased measure of successful replication that generalizes beyond the 20 selected papers.","pith_extraction_headline":"AI agents replicate only 21 percent of recent top AI research papers when starting from scratch."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"6a03193263d61e6f953634c3d426c40e00f27d0d404963d19daaf61fef2d5872"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}