CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Arvind Narayanan; Benedikt Stroebl; Nitya Nadgir; Sayash Kapoor; Zachary S. Siegel

arxiv: 2409.11363 · v2 · pith:SQNMKGXMnew · submitted 2024-09-17 · 💻 cs.CL · cs.AI· cs.MA

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Zachary S. Siegel , Sayash Kapoor , Nitya Nadgir , Benedikt Stroebl , Arvind Narayanan This is my paper

classification 💻 cs.CL cs.AIcs.MA

keywords agentsresearchtasksscientificagentbenchmarkcore-benchreproducibility

0 comments

read the original abstract

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
cs.AI 2026-02 accept novelty 8.0

ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.
NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
cs.CL 2026-06 unverdicted novelty 7.0

NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly ...
PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation
cs.CR 2026-06 unverdicted novelty 7.0

PixJail automates construction of paper-specific attack modules and unified evaluation pipelines for text-to-image jailbreaks, reproducing eleven methods with 2.1% average and 0% median error.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
cs.LG 2026-05 unverdicted novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
cs.LG 2026-05 unverdicted novelty 7.0

AI agents handle individual data-loading and reformatting steps on neuroscience datasets but rarely complete fully error-free end-to-end pipelines, and AI judges are unreliable without ground-truth references.
Evaluating LLM Agents on Automated Software Analysis Tasks
cs.SE 2026-04 unverdicted novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its ...
How Far Are We From True Auto-Research?
cs.AI 2026-05 unverdicted novelty 6.0

ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery
cs.LG 2026-05 unverdicted novelty 6.0

ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
cs.LG 2026-05 conditional novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological fai...
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
cs.LG 2026-05 unverdicted novelty 6.0

Agentic AI handles individual data-loading subtasks well but rarely produces fully error-free end-to-end solutions for reusing diverse neuroscience datasets.
Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches
cs.SE 2026-02 unverdicted novelty 6.0

Agent-based AI workflows repair injected reproducibility failures in R social-science code at 69-96% success, substantially outperforming prompt-based LLM approaches at 31-79%.
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
cs.CL 2025-09 unverdicted novelty 6.0

CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment
cs.SE 2026-05 unverdicted novelty 5.0

ReproScore separates readiness (26 static sub-metrics) from outcome (execution probes) and shows near-zero correlation between them on 423 repositories, validating the separation.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
cs.AI 2026-05 unverdicted novelty 4.0

A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.