hub Mixed citations

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin · 2025 · cs.AI · arXiv 2504.01848

Mixed citation behavior. Most common role is background (67%).

34 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 34 citing papers arXiv PDF

abstract

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 2 method 1

citation-polarity summary

background 6 use dataset 2 use method 1

representative citing papers

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

cs.AI · 2026-02-11 · accept · novelty 8.0

ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

cs.AI · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

AI agents handle individual data-loading and reformatting steps on neuroscience datasets but rarely complete fully error-free end-to-end pipelines, and AI judges are unreliable without ground-truth references.

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

physics.flu-dyn · 2026-05-07 · conditional · novelty 7.0

AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.

AcademiClaw: When Students Set Challenges for AI Agents

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

cs.AI · 2026-04-05 · conditional · novelty 7.0

FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.

Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

cs.DB · 2026-02-25 · unverdicted · novelty 7.0

New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.

Evalet: Evaluating Large Language Models through Functional Fragmentation

cs.HC · 2025-09-14 · conditional · novelty 7.0

Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.

How Far Are We From True Auto-Research?

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.

ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.

MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

cs.LG · 2026-05-15 · conditional · novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin

Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

cs.DL · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.

Evaluation-driven Scaling for Scientific Discovery

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

physics.comp-ph · 2026-03-31 · unverdicted · novelty 6.0

QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperforming baselines.

Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches

cs.SE · 2026-02-09 · unverdicted · novelty 6.0

Agent-based AI workflows repair injected reproducibility failures in R social-science code at 69-96% success, substantially outperforming prompt-based LLM approaches at 31-79%.

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

cs.CL · 2025-09-19 · unverdicted · novelty 6.0

CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.

RExBench: Can coding agents autonomously implement AI research extensions?

cs.CL · 2025-06-27 · unverdicted · novelty 6.0

RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.

Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

cs.MA · 2026-05-21 · unverdicted · novelty 5.0

Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.

citing papers explorer

Showing 34 of 34 citing papers.

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences cs.AI · 2026-02-11 · accept · none · ref 15 · internal anchor
ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games cs.AI · 2026-05-17 · unverdicted · none · ref 11 · 2 links · internal anchor
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction cs.LG · 2026-05-13 · unverdicted · none · ref 24 · internal anchor
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse cs.LG · 2026-05-12 · unverdicted · none · ref 17 · 2 links · internal anchor
AI agents handle individual data-loading and reformatting steps on neuroscience datasets but rarely complete fully error-free end-to-end pipelines, and AI judges are unreliable without ground-truth references.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents physics.flu-dyn · 2026-05-07 · conditional · none · ref 30 · internal anchor
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.
AcademiClaw: When Students Set Challenges for AI Agents cs.AI · 2026-05-04 · unverdicted · none · ref 18 · internal anchor
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
Evaluating LLM Agents on Automated Software Analysis Tasks cs.SE · 2026-04-13 · unverdicted · none · ref 53 · internal anchor
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 16 · internal anchor
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification cs.AI · 2026-04-05 · conditional · none · ref 13 · internal anchor
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"? cs.DB · 2026-02-25 · unverdicted · none · ref 48 · internal anchor
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.
Evalet: Evaluating Large Language Models through Functional Fragmentation cs.HC · 2025-09-14 · conditional · none · ref 76 · internal anchor
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
How Far Are We From True Auto-Research? cs.AI · 2026-05-18 · unverdicted · none · ref 24 · internal anchor
ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery cs.LG · 2026-05-16 · unverdicted · none · ref 23 · internal anchor
ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility cs.LG · 2026-05-15 · conditional · none · ref 32 · internal anchor
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design cs.AI · 2026-05-15 · unverdicted · none · ref 66 · internal anchor
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review cs.DL · 2026-05-04 · unverdicted · none · ref 45 · 2 links · internal anchor
ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 129 · internal anchor
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization cs.AI · 2026-04-14 · unverdicted · none · ref 21 · internal anchor
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 46 · internal anchor
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations physics.comp-ph · 2026-03-31 · unverdicted · none · ref 35 · internal anchor
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperforming baselines.
Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches cs.SE · 2026-02-09 · unverdicted · none · ref 29 · internal anchor
Agent-based AI workflows repair injected reproducibility failures in R social-science code at 69-96% success, substantially outperforming prompt-based LLM approaches at 31-79%.
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics cs.CL · 2025-09-19 · unverdicted · none · ref 49 · internal anchor
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
RExBench: Can coding agents autonomously implement AI research extensions? cs.CL · 2025-06-27 · unverdicted · none · ref 42 · internal anchor
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators cs.MA · 2026-05-21 · unverdicted · none · ref 36 · internal anchor
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment cs.SE · 2026-05-13 · unverdicted · none · ref 32 · internal anchor
ReproScore separates readiness (26 static sub-metrics) from outcome (execution probes) and shows near-zero correlation between them on 423 repositories, validating the separation.
RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers cs.AI · 2026-04-06 · unverdicted · none · ref 8 · internal anchor
RESCORE recovers task-coherent simulations from 40.7% of 500 CDC papers via a three-component LLM agent pipeline and claims a 10X speedup over manual human replication.
Kimi K2.5: Visual Agentic Intelligence cs.CL · 2026-02-02 · unverdicted · none · ref 55 · internal anchor
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases cs.SE · 2025-10-28 · unverdicted · none · ref 44 · internal anchor
CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by 4.73% on CodeWikiBench.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 69 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery cs.AI · 2026-05-22 · unverdicted · none · ref 109 · internal anchor
A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work cs.AI · 2026-05-20 · conditional · none · ref 32 · 2 links · internal anchor
QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 44 · internal anchor
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows cs.IR · 2026-01-19 · unverdicted · none · ref 38 · internal anchor
RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery cs.CL · 2026-04-07 · unreviewed · ref 8 · internal anchor

PaperBench: Evaluating AI's Ability to Replicate AI Research

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer