pith. sign in
Pith Number

pith:XKQHKK55

pith:2025:XKQHKK55KWAFKFJR23M2XUM7YW
not attested not anchored not stored refs pending

PaperBench: Evaluating AI's Ability to Replicate AI Research

Amelia Glaese, Benjamin Kinsella, Dane Sherburn, Evan Mays, Giulio Starace, James Aung, Johannes Heidecke, Jun Shern Chan, Leon Maksin, Oliver Jaffe, Rachel Dias, Tejal Patwardhan, Wyatt Thompson

AI agents replicate only 21 percent of recent top AI research papers when starting from scratch.

arxiv:2504.01848 v3 · 2025-04-02 · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{XKQHKK55KWAFKFJR23M2XUM7YW}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.

C2weakest assumption

That the author-co-developed rubrics and the LLM judge together provide a reliable, unbiased measure of successful replication that generalizes beyond the 20 selected papers.

C3one line summary

PaperBench is a new benchmark showing frontier AI agents replicate only 21% of tasks needed to reproduce state-of-the-art AI papers, below human expert performance.

Formal links

2 machine-checked theorem links

Cited by

38 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.314464Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

baa0752bbd5580551531d6d9abd19fc593ec6510e2ed23dc614365d35a1f45d9

Aliases

arxiv: 2504.01848 · arxiv_version: 2504.01848v3 · doi: 10.48550/arxiv.2504.01848 · pith_short_12: XKQHKK55KWAF · pith_short_16: XKQHKK55KWAFKFJR · pith_short_8: XKQHKK55
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/XKQHKK55KWAFKFJR23M2XUM7YW \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: baa0752bbd5580551531d6d9abd19fc593ec6510e2ed23dc614365d35a1f45d9
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "b7772e59a981a43b3b67b065ee3e42aea25164a0b0a1fa578fcd63564266fb7f",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2025-04-02T15:55:24Z",
    "title_canon_sha256": "dc269e44cb124817f631b9c1f198ba938035cc4594ee73b20aa8047ea4375577"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2504.01848",
    "kind": "arxiv",
    "version": 3
  }
}