Replicationbench: Can AI agents replicate astrophysics research papers?

· 2025 · arXiv 2510.24591

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

cs.AI · 2026-02-11 · accept · novelty 8.0

ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

cs.DL · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.

Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics

physics.comp-ph · 2026-04-14 · conditional · novelty 6.0

An LLM agent autonomously runs read-plan-compute-compare loops on 111 computational physics papers, raising substantive concerns in 42% of them (97.7% only after execution), and generates a full publishable Comment revising the headline conclusion of a Nature Communications paper on 2D-material MOFs

citing papers explorer

Showing 4 of 4 citing papers.

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences cs.AI · 2026-02-11 · accept · none · ref 23
ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction cs.LG · 2026-05-13 · unverdicted · none · ref 29
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review cs.DL · 2026-05-04 · unverdicted · none · ref 44 · 2 links
ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics physics.comp-ph · 2026-04-14 · conditional · none · ref 16
An LLM agent autonomously runs read-plan-compute-compare loops on 111 computational physics papers, raising substantive concerns in 42% of them (97.7% only after execution), and generates a full publishable Comment revising the headline conclusion of a Nature Communications paper on 2D-material MOFs

Replicationbench: Can AI agents replicate astrophysics research papers?

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer