Fire-bench: Evaluating agents on the rediscovery of scientific insights

Wang Z, Zhang X, Goyal A, Pratt S, Ji J, Wu J, et al · 2026 · arXiv 2602.02905

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.

Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

cs.AI · 2026-06-22 · unverdicted · novelty 6.0

Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

cs.AI · 2026-05-25 · unverdicted · novelty 6.0

ScientistOne introduces Chain-of-Evidence and an audit system that achieves zero hallucinated references, perfect score verification, and top method-code alignment while matching or beating human experts on five frontier tasks and generalizing to six more.

CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yielding improved benchmark performance with auditable traces.

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

cs.AI · 2026-05-22 · unverdicted · novelty 4.0

A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Fire-bench: Evaluating agents on the rediscovery of scientific insights

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer