Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He · 2025 · arXiv 2504.00255

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

dataset 2 background 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

cs.AI · 2026-02-11 · accept · novelty 8.0

ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.

ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.

MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

cs.LG · 2026-05-15 · conditional · novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin

RExBench: Can coding agents autonomously implement AI research extensions?

cs.CL · 2025-06-27 · unverdicted · novelty 6.0

RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

cs.AI · 2026-05-22 · unverdicted · novelty 4.0

A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

cs.AI · 2025-04-28 · accept · novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

cs.CL · 2026-04-07

citing papers explorer

Showing 8 of 8 citing papers.

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences cs.AI · 2026-02-11 · accept · none · ref 20
ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.
ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery cs.LG · 2026-05-16 · unverdicted · none · ref 29
ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility cs.LG · 2026-05-15 · conditional · none · ref 38
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
RExBench: Can coding agents autonomously implement AI research extensions? cs.CL · 2025-06-27 · unverdicted · none · ref 50
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery cs.AI · 2026-05-22 · unverdicted · none · ref 116
A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 225
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 110
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery cs.CL · 2026-04-07 · unreviewed · ref 9

Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer